Are the next gen consoles the realization of AMDs HSA dream?

Olikan · Apr 17, 2013

ShintaiDK said:
I dont think thats the reason. The Cell itself is terrible weak. Its rather the nVidia GPU thats the reason.

http://dice.se/wp-content/uploads/Christina_Coffin_Programming_SPU_Based_Deferred.pdf

page 68...."SPUs can do significantly better light culling than the RSX"

galego · Apr 17, 2013

ShintaiDK said:
I dont think thats the reason. The Cell itself is terrible weak. Its rather the nVidia GPU thats the reason.

Do you mean the same nVidia GPU with inflated GFLOPS by about a factor of ten? Sooo typical of nVidia...

galego · Apr 17, 2013

ShintaiDK said:
The fact is developers said the exact same thing about the PS3. Nomatter if you like it or not.

The fact is that the situation with the PS4 is clearly different. The technical details were presented as well, and it does not matter if you continue denying them.

ShintaiDK · Apr 17, 2013

galego said:
Do you mean the same nVidia GPU with inflated GFLOPS by about a factor of ten? Sooo typical of nVidia...

It was Sony that inflated them. MS did the same with the ATI GPU in theirs. 2Tflop vs 1Tflop. Both purely made up paper numbers.

And so did you already, claiming the PS4 performs like 20 Tflops on the PC with the fairytale nonsenses about 10x faster API.

http://forums.anandtech.com/showpost.php?p=34884018&postcount=75
http://forums.anandtech.com/showpost.php?p=34883683&postcount=71

galego said:
The fact is that the situation with the PS4 is clearly different. The technical details were presented as well, and it does not matter if you continue denying them.

Its no different. Its simply you that make up random nonsense to try and make a positive spin of a certain product and component.

zlatan · Apr 17, 2013

Arkadrel said:
Vs x86 where each core is the same as each other, and you have 8 of them (in ps4's case).

Id call that a differnt concept.

X86 is just an ISA. It has nothing to do with design concepts. What Sony/IBM/Toshiba want to do with Cell, is a working design for heterogeneous computing. Sure It was not perfect, but It was the very first step on a long road which provide energy-efficient computing. AMD just came up with a more efficient solution.

Arkadrel said:
Did you see the 1million physics objects demo havok had?

Yes I saw. The Cell architecture can also do it if it had the compute power.

Arkadrel said:
That was said to use a tiny faction of the GPU... and not been possible on a ps3.

Don't think about the GPU. it's just confuse you. Think about central processing and vector cores. Cell has 1 PPE (CPU core), and 6 programable SPE (vector cores). The PS4 APU is not really different. It has 8 CPU cores, and 18 compute units (vector cores). The concept is mostly the same, but the SPEs (in Cell) don't really like traditional graphics related workloads. You can emulate texturing, tessellation, and other stuff, but a fixed function unit for these is much faster ... and this is the advantage with the PS4 APU.

zlatan · Apr 17, 2013

ShintaiDK said:
The Cell itself is terrible weak. Its rather the nVidia GPU thats the reason.

Good to see this expertise. Any tips for me to do efficient dynamic branching on PS3 without using Cell?

Cerb · Apr 17, 2013

zlatan said:
Do you think that the PS4 APU is a different concept compared to Cell? Well no, it's not. The main idea is mostly the same with evolutionary improvements.

Er, no. It's practically from a different reality. The only iteration it got was 1 or 2 specialized IBM chips, before being ignore-killed (a common IBM tactic: don't say it wasn't very good, and don't change the roadmaps, and don't say that BG/Q is a better way to do the same sort of work...just pretend it never existed, and say everything is fine).

The vision of the Cell was to give everyone a mini-supercomputer, that could be extended by networking, to provide massive parallel computational capabilities, everywhere. Neat idea, actually. But, to make it work, they made a processor with 8 vector coprocessors, that were regressive is almost every conceivable capacity--but were small and fast. To keep speeds up, and size down, the main CPU was also made highly regressive, lacking basic features we've come to expect in CPUs. The result was a largely broken, fragile, piece of hardware. If you could use the SIMD in SPEs, fit your data set into chunks smaller than 256KB, less code, and less enough space to handle RAM latency, and get all the work done within the space of a frame (<30ms, say, for 30FPS), it might turn out really good. Otherwise, you'd have basically an Atom with fast array maths, and 2/3 or more of the die space sitting around wasted.

The PS4, OTOH, looks like a near-future AMD mobile APU, but on steroids and protein shakes (we will probably get quads only, and with a small fraction of the GPU power--but, include a Linux/X-compatible touch screen, and I'll bite, OK?).

galego · Apr 17, 2013

ShintaiDK said:
It was Sony that inflated them. MS did the same with the ATI GPU in theirs. 2Tflop vs 1Tflop. Both purely made up paper numbers.

No. ATI claimed 240 GFLOP for the GPU. MS claimed 1 TFLOP TOTAL, which included the CPU (115) plus the NPFLOP (697)

ShintaiDK said:
And so did you already, claiming the PS4 performs like 20 Tflops on the PC with the fairytale nonsenses about 10x faster API.

http://forums.anandtech.com/showpost.php?p=34884018&postcount=75
http://forums.anandtech.com/showpost.php?p=34883683&postcount=71

No. The 10x overload factor is well-known.

Your claim that 1 TFLOP on the console equates a 1 TFLOP on the PC is so wrong as if you claim that a 100 HP car is so faster like a 100 HP motorbike. No, they are not, the motorbike (console) is faster than the car (PC).

showb1z · Apr 18, 2013

galego said:
No. The 10x overload factor is well-known.

Your claim that 1 TFLOP on the console equates a 1 TFLOP on the PC is so wrong as if you claim that a 100 HP car is so faster like a 100 HP motorbike. No, they are not, the motorbike (console) is faster than the car (PC).

If the overhead really is 10x, how have PC's been running the same games much better (way higher res and framerate) than consoles for most of this console generation?

Whatever advantage consoles might have will probably already be made up when the 20nm GPU's come around.

ShintaiDK · Apr 18, 2013

galego said:
No. ATI claimed 240 GFLOP for the GPU. MS claimed 1 TFLOP TOTAL, which included the CPU (115) plus the NPFLOP (697)

No. The 10x overload factor is well-known.

Your claim that 1 TFLOP on the console equates a 1 TFLOP on the PC is so wrong as if you claim that a 100 HP car is so faster like a 100 HP motorbike. No, they are not, the motorbike (console) is faster than the car (PC).

You are really out on deep water.

How much do the driver+API account for in execution time in percentage in DX10+?

And since the 10x factor is well know, can you please document it? Or is it just made up?

Lepton87 · Apr 18, 2013

showb1z said:
If the overhead really is 10x, how have PC's been running the same games much better (way higher res and framerate) than consoles for most of this console generation?

Whatever advantage consoles might have will probably already be made up when the 20nm GPU's come around.

Do you really think PS4 is better than a PC with a Titan? If that's true I payed two times more just for the GPU for an inferior gaming platform.

Cerb · Apr 18, 2013

Lepton87 said:
Do you really think PS4 is better than a PC with a Titan? If that's true I payed two times more just for the GPU for an inferior gaming platform.

You should feel bad about that, of course

.

Except for near-metal CPU+GPU explotation, on the PS4's side, exploitation of specific HW features that make little sense on PCs, and ideal exploitation of the local store on the new Xbox, there's not much to worry about. Optimizing for the HW will, just like every time, be a case of keeping the console from appearing to be too far behind, after it's a year or two old. At least with a decent CPU and 8GB RAM, they might be able to do that, this time.

gorobei · Apr 18, 2013

ShintaiDK said:
You are really out on deep water.

How much do the driver+API account for in execution time in percentage in DX10+?

And since the 10x factor is well know, can you please document it? Or is it just made up?

a google search comes up with some pretty relevant results.
original richard huddy interview from 2011. this seems to be the main source of the numbers for draw call limits and direct to metal advantages. the other hits seem to be forums reacting to the numbers and asking for confirmation.
http://www.bit-tech.net/hardware/graphics/2011/03/16/farewell-to-directx/2

tripwire forum discussion, unsourced codemasters op flashpoint programmer being quoted. more hard numbers and better explanation of DX penalty.
http://forums.tripwireinteractive.com/showthread.php?t=69260

i'm not going to search for the video, but there is a john carmack interview in which he talks about the draw call limiting on DX and the benefit of direct to metal.
there are also some youtube videos of tim sweeney and andrew richards debating the use of single die and direct to metal. sweeney makes some very specific comments about DX resulting in marginal gains regardless of hardware improvement early in to the video. the 10 to 1 power and latency costs of going off die come up later.
http://www.youtube.com/watch?v=tnogwO84O0Q
full series
http://semiaccurate.com/2010/09/22/tim-sweeney-and-andrew-richards-talk-about-future-graphics/

ShintaiDK · Apr 18, 2013

The Xbox360 uses DirectX. Did it do much worse than the PS3? Certainly not.

Also direct to metal is kinda wrong. PS3/4 uses OpenGL ES and a driver. The same setup as on Xbox 360 or a PC with Driver+DirectX.

The only difference between the 3 is, since Vista the graphics driver is moved out a bit further in the kernel ring.

gorobei · Apr 18, 2013

ShintaiDK said:
The Xbox360 uses DirectX. Did it do much worse than the PS3? Certainly not.

Also direct to metal is kinda wrong. PS3/4 uses OpenGL ES and a driver. The same setup as on Xbox 360 or a PC with Driver+DirectX.

The only difference between the 3 is, since Vista the graphics driver is moved out a bit further in the kernel ring.

not really.
PS3 has 3 interfaces:
* OpenGL|ES 1.0 + a few 2.0 extensions
* PSGL (~OpenGL|ES + custom NV extensions)
* libGCM (extreme low level API ~= no driver)

http://www.ps3devwiki.com/wiki/RSX

RSX Libraries

The RSX is dedicated to 3D graphics, and developers are able to use different API libraries to access its features. The easiest way is to use high level PSGL, which is basicially OpenGL|ES with programmable pipeline added in - but hardly anyone uses PSGL these days, preferring to use the native GPU command buffer generation library, libgcm. At a lower level developers can use LibGCM, which is an API that talks to the RSX at a lower level. PSGL is actually implemented on top of LibGCM. For the advanced programmer, you can program the RSX by sending commands to it directly using C or assembly. This can be done by setting up commands (via FIFO Context) and DMA Objects and issuing them to the RSX via DMA calls.

early on when devs are unfamiliar with the hardware they use OpenGL, later they use libGCM and get way more performance.

Phynaz · Apr 18, 2013

ShintaiDK said:
And since the 10x factor is well know, can you please document it? Or is it just made up?

Even Intel has said 10x figure. Google it.

ShintaiDK · Apr 18, 2013

Phynaz said:
Even Intel has said 10x figure. Google it.

Couldnt find any. Perhaps you can show me?

Enigmoid · Apr 18, 2013

Phynaz said:
Even Intel has said 10x figure. Google it.

I've said this many times but look at current consoles and a modern graphics card at the same GFLOP level. The PC gpu can play any straight console port at the same settings as the xbox 360.

Ex skyrim on console vs skyrim with a 540m. (Generally the same performance; both are around 250 GFLOPS).

Unless of course the old xbox architecture is holding it back by a factor of 10x ...

Not sure about this but didn't DX 10 and 11 reduce problems with draw calls, etc?

Olikan · Apr 18, 2013

ShintaiDK said:
The Xbox360 uses DirectX. Did it do much worse than the PS3? Certainly not.

xbox's DirectX is modded, and uses a WAY better gpu

gorobei · Apr 18, 2013

http://www.neowin.net/forum/topic/1...k-8gbram-22gbvram-new-controller/page__st__15

Interesting article from Timothy Lottes (creater of FXAA)

Quote
Working assuming the Eurogamer Article is mostly correct with the exception of maybe exact clocks, amount of memory, and number of enabled cores (all of which could easily change to adapt to yields).

While the last console generation is around 16x behind in performance from the current high-end single chip GPUs, this was a result of much easier process scaling and this was before reaching the power wall. Things might be much different this round, a fast console might be able to keep up much longer as scaling slows down. If Sony decided to bump up the PS4 GPU, that was a great move, and will help the platform live for a long time. If PS4 is around 2 Tflop/s, this is roughly half what a single GPU high-end PC has right now, which is probably a lot better than what most PC users have. If desktop goes to 4K displays this requires 4x the perf over 1080p, so if console maintains a 1080p target, perf/pixel might still remain good for consoles even as PC continues to scale.

The real reason to get excited about a PS4 is what Sony as a company does with the OS and system libraries as a platform, and what this enables 1st party studios to do, when they make PS4-only games. If PS4 has a real-time OS, with a libGCM style low level access to the GPU, then the PS4 1st party games will be years ahead of the PC simply because it opens up what is possible on the GPU. Note this won't happen right away on launch, but once developers tool up for the platform, this will be the case. As a PC guy who knows hardware to the metal, I spend most of my days in frustration knowing damn well what I could do with the hardware, but what I cannot do because Microsoft and IHVs wont provide low-level GPU access in PC APIs. One simple example, drawcalls on PC have easily 10x to 100x the overhead of a console with a libGCM style API.

Assuming a 7970M in the PS4, AMD has already released the hardware ISA docs to the public, so it is relatively easy to know what developers might have access to do on a PS4. Lets start with the basics known from PC. AMD's existing profiling tools support true async timer queries (where the timer results are written to a buffer on the GPU, then async read on the CPU). This enables the consistent profiling game developers require when optimizing code. AMD also provides tools for developers to view the output GPU assembly for compiled shaders, another must for console development. Now lets dive into what isn't provided on PC but what can be found in AMD's GCN ISA docs,

Dual Asynchronous Compute Engines (ACE) :: Specifically "parallel operation with graphics and fast switching between task submissions" and "support of OCL 1.2 device partitioning". Sounds like at a minimum a developer can statically partition the device such that graphics can compute can run in parallel. For a PC, static partition would be horrible because of the different GPU configurations to support, but for a dedicated console, this is all you need. This opens up a much easier way to hide small compute jobs in a sea of GPU filling graphics work like post processing or shading. The way I do this on PC now is to abuse vertex shaders for full screen passes (the first triangle is full screen, and the rest are degenerates, use an uber-shader for the vertex shading looking at gl_VertexID and branching into "compute" work, being careful to space out the jobs by the SIMD width to avoid stalling the first triangle, or loading up one SIMD unit on the machine, ... like I said, complicated). In any case, this Dual ACE system likely makes it practical to port over a large amount of the Killzone SPU jobs to the GPU even if they don't completely fill the GPU (which would be a problem without complex uber-kernels on something like CUDA on the PC).

Dual High Performance DMA Engines :: Developers would get access to do async CPU->GPU or GPU->CPU memory transfers without stalling the graphics pipeline, and specifically ability to control semaphores in the push buffer(s) to insure no stalls and low latency scheduling. This is something the PC APIs get horribly wrong, as all memory copies are implicit without really giving control to the developer. This translates to much better resource streaming on a console.

Support for upto 6 Audio Streams :: HDMI supports audio, so the GPU actually outputs audio, but no PC driver gives you access. The GPU shader is in fact the ideal tool for audio processing, but on the PC you need to deal with the GPU->CPU latency wall (which can be worked around with pinned memory), but to add insult to injury the PC driver simply just copies that data back to the GPU for output adding more latency. In theory on something like a PS4 one could just mix audio on the GPU directly into the buffer being sent out on HDMI.

Global Data Store :: AMD has no way of exposing this in DX, and in OpenGL they only expose this in the ultra-limited form of counters which can only increment or decrement by one. The chip has 64KB of this memory, effectively with the same access as shared memory (atomics and everything) and lower latency than global atomics. This GDS unit can be used for all sorts of things, like workgroup to workgroup communication, global locks, or like doing an append or consume to an array of arrays where each thread can choose a different array, etc. To the metal access to GDS removes the overhead associated with managing huge data sets on the GPU. It is much easier to build GPU based hierarchical occlusion culling and scene management with access to these kind of low level features.

Re-used GPU State :: On a console with low level hardware access (like the PS3) one can pre-build and re-use command buffer chunks. On a modern GPU, one could even write or modify pre-built command buffer chunks from a shader. This removes the cost associated with drawing, pushing up the number of unique objects which can be drawn with different materials.

FP_DENORM Control Bit :: On the console one can turn off both DX's and GL's forced flush-to-denorm mode for 32-bit floating point in graphics. This enables easier ways to optimize shaders because integer limited shaders can use floating point pipes using denormals.

128-bit to 256-bit Resource Descriptors :: With GCN all that is needed to define a buffer's GPU state is to set 4 scalar registers to a resource descriptor, similar with texture (up to 8 scalar registers, plus another 4 for sampler). The scalar ALU on GCN supports block fetch of up to 16 scalars with a single instruction from either memory or from a buffer. It looks to be trivially easy on GCN to do bind-less buffers or textures for shader load/stores. Note this scalar unit has it's own data cache also. Changing textures or surfaces from inside the pixel shader looks to be easily possible. Note shaders still index resources using an instruction immediate, but the descriptor referenced by this immediate can be changed. This could help remove the traditional draw call based material limit.

S_SLEEP, S_SETPRIO, and GDS :: These provide all the tools necessary to do lock and lock-free retry loops on the GPU efficiently. DX11 specifically does not allow locks due to fear that some developer might TDR the system. With low level access, the S_SLEEP enables placing wavefront to sleep without busy spinning on the ALUs, and the S_SETPRIO enables reducing priority when checking for unlock between S_SLEEPs.

S_SENDMSG :: This enables a shader to force a CPU interrupt. In theory this can be used to signal to a real-time OS completion of some GPU operation to start up some CPU based tasks without needed the CPU to poll for completion. The other option would be maybe a interrupt signaled from a push buffer, but this wouldn't be able to signal from some intermediate point during a shader's execution. This on PS4 might enable tighter GPU and CPU task dependencies in a frame (or maybe even in a shader), compared to the latency wall which exists on non-real-time OS like Windows which usually forces CPU and GPU task dependencies to be a few frames apart.

Full Cache Flush Control :: DX has only implicit driver controlled cache flushes, it needs to be conservative, track all dependencies (high overhead), then assume conflict and always flush caches. On a console, the developer can easily skip cache flushes when they are not needed, leading to more parallel jobs and higher performance (overlap execution of things which on DX would be separated by a wait for machine to go idle).

GPU Assembly :: Maybe? I don't know if GCN has some hidden very complex rules for code generation and compiler scheduling. The ISA docs seem trivial to manage (manual insertion of barriers for texture fetch, etc). If Sony opens up GPU assembly, unlike the PS3, developers might easily crank out 30% extra from hand tuning shaders. The alternative is iterating on Cg, which is possible with real-time profiling tools. My experience on PC is micro-optimization of shaders yields some massive wins. For those like myself who love assembly of any arch, a fixed hardware spec is a dream.

...

I could continue here, but I'm not, by now you get the picture, launch titles will likely be DX11 ports, so perhaps not much better than what could be done on PC. However if Sony provides the real-time OS with libGCM v2 for GCN, one or two years out, 1st party devs and Sony's internal teams like the ICE team, will have had long enough to build up tech to really leverage the platform.

I'm excited for what this platform will provide for PS4-only 1st party titles and developers who still have the balls to do a non-portable game this next round.

gorobei · Apr 18, 2013

Enigmoid said:
Not sure about this but didn't DX 10 and 11 reduce problems with draw calls, etc?

if you watched the sweeney richards videos i linked, you would know that it doesnt really solve it.

ShintaiDK · Apr 18, 2013

So now we are down to draw calls only? How much do they affect the overall performance?

Draw calls are CPU bound. And you dont need a context switch in certain situations either anymore in DX (Only older versions.). And draw calls are only performance demanding if you actually submit too little data each time instead of batch.

But again, its CPU, not GPU. And a PC CPU is much faster than a puny weak console CPU.

Its simply idiocy to claim a console will perform 10x of a PC.

showb1z · Apr 18, 2013

Lepton87 said:
Do you really think PS4 is better than a PC with a Titan? If that's true I payed two times more just for the GPU for an inferior gaming platform.

Well no, but Titan is nowhere near mainstream. I was thinking more along the lines of a gtx 760/770.

galego · Apr 18, 2013

showb1z said:
If the overhead really is 10x, how have PC's been running the same games much better (way higher res and framerate) than consoles for most of this console generation?

Crysis 2 on Xbox 360 vs PC:

Only in recent years some games look better on last high-end gaming PC. But this is because those PCs are now more than 10x faster and with 10x more memory.

RSX GPU on PS3: 176 GFLOP-CONSOLE ~= 1760 GFLOP.

A Nvidia GTX-680 has about 3000 GFLOP, which is about the double power than a PS3.

Moreover, the above 1760 GFLOP are about the theoretical maximum, as is well-known the PS3 was very difficult to program and its full power was never fully released.

Check memory specs.

RSX GPU on PS3: 256 MB.

A Nvidia GTX-680 has 2048 MB.

You cannot have the same resolution, textures... on the console.

Consoles are about eight years old and cannot compete with recent high-end gaming PCs. This is why a new generation of consoles PS4, Next-Xbox... is just ready.

beginner99 · Apr 18, 2013

ShintaiDK said:
And since the 10x factor is well know, can you please document it? Or is it just made up?

Hey common. Don't play stupid. It's well known the earth is at the center of the universe!

I don't know much about this stuff but common sense tells me that this 10x number is BS. Maybe certain specific tasks have that overhead but in total the effect will be much smaller.

It's the same myth that Java is ultra slow and everything will magically be 10x faster in C++.

Are the next gen consoles the realization of AMDs HSA dream?

Platinum Member

Golden Member

Golden Member

Lifer

Senior member

Senior member

Elite Member

Golden Member

Senior member

Lifer

Platinum Member

Elite Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Lifer

Senior member

Golden Member

Diamond Member