pcm81

Senior member
Mar 11, 2011
584
9
81
Desktops are heading into parralled computing territory, and there is no reason to argue this. What we see happening is the development of Auxilary Proccessing Units in the form of GPGPU. Currently 2 architectures exist: SIMM (AMD) and MIMM (NVIDIA).
In short:
In MIMM architecture each core can perform its own task and so the programing model becomes very simmilar to any multicore CPU.
In SIMM architecure all cores excecute the same instruction but on different blocks of memory.

Both architectures work well and it is only the algorythm sets that are different to max out the potential of the two architectures.

The question is: Which architecture will win in GPGPU?

The programming model for CUDA is more flexible than it is for STREAM, but the power requirement for 1 cuda core will allways be bigger than that for 1 stream processor. Currently 1500 stream processors compete with 512 cuda cores (6970 vs 580) in large datasets (high resolutions), but they loose at lower resolutions, when there is not enough load to use all 1500 stream processors in paralel.

Do you think that the simplicity of a stream processor will make it a beter building block for GPGPU or will the MIMM architecture of CUDA will give it an upper edge? I don't know the answer, but i'd think that if fewer expansive cores in MIMM architecture were the answer we'd see by now 24+++ core CPUs...

Also, I argue that the SIMM model of Stream processors shines with very large sets of data, but MIMM model wins with smaller sets of data. Here large set is anything that has 1000x more threads than the number of cores. So the question is: will the size of memory usage by PCs be large enough to feed 1000s of stream procesors vs only being large enough to feed 100s of cuda cores? Currently we only see such large RAM usage in video compression or rendering applications, ofcouse some models in scientific computing as well.
 
Last edited:

Sylvanas

Diamond Member
Jan 20, 2004
3,752
0
0
The programming model for CUDA is more flexible than it is for STREAM, but the power requirement for 1 cuda core will allways be bigger than that for 1 stream processor. Currently 1500 stream processors compete with 512 cuda cores (6970 vs 580) in large datasets (high resolutions), but they loose at lower resolutions, when there is not enough load to use all 1500 stream processors in paralel.

There's more to it than the elements you mention. Nvidia and AMD have different approaches to handling data both in the driver and the hardware architecture. You may already know this but lets recap for those who don't.

Nvidia focus on Thread level Parallelism (TLP) whereby incoming data sets are processed sequentially and don't need to worry about conflicts or execution resource utilization- all of those 512 cuda cores should be in operation given a large amount of threads (millions of threads per frame). This is what Nvidia's driver has been optimized for, finding the most efficient way of handling data sets and distributing the load evenly across all execution units. If you have 512 units executing 512 threads in a given clock cycle, they are not all going to finish at the same time, some will finish in the following cycle, some may need another 3 cycles to finish processing that thread. The CUDA cores that finish in the next clock cycle must be kept fed with a new set of data to process otherwise that CUDA core is idle and that reduces efficiency as you are losing time that could be better put to use 'doing' something. It's still a very robust way of doing things as you can keep could imagine a 'queue' of threads (words/warps are the technical terms but lets stick to 'thread' to describe a set of data) waiting to use each execution resources- as soon as one is done, the next starts and so on.

AMD go about things differently in their software and hardware architecture and focus heavily on Instruction Level Parallelism (ILP). Cayman brought with it less dependence on ILP than in previous generations but it is still the foundation of AMD's architecture post R600. ILP looks to break down an incoming thread into what can be executed in parallel and what can't. One 'Stream Processor' or 'SP' as AMD puts it consists of a x, y, z, w and t unit which process floating point numbers (with exception to the t unit which is a SFU). If instruction A is first and the outcome of that is used to process instruction B then they are 'dependent' - as B cannot be processed until we know the outcome of A - think of this as taking 2 clock cycles. If however, instructions A, B, C and D are not dependent on each other then they can all be processed by the x, y, z, w respectively in one clock cycle (a simplification but you get the idea). So to compare, in one clock cycle a R600+ GPU has processed 4 instructions in the time it took a CUDA core (dependent) to process 1- but this is assuming the thread 'can' be executed in parallel. AMD's driver does a good job of making maximum use of the resources available and determining what instructions can be executed in what order on which SP's and in what time frame.

If it can't be executed in parallel and instruction B has to wait for A to finish before it can start then 3/4 execution units (x, y, z, w) are sitting idle- thats only 25% efficiency.

So what does all of that mean? Well, different architectures are different so we can't really compare apples to apples- that is, we can't say 1500SP's = 512 Cuda cores because that depends on the instructions they are given. Now, there is more to the graphics pipeline than just execution units so there could be stalls or inefficiencies there to consider aswell (raster engine, geometry unit, dispatch units etc). Sometimes executing things serially is all you can do, so doing that quicker will give better performance, however sometimes things can be parallelized so obviously that would provide advantages. There will always be data to be processed serially. I suggest everyone read AT's excellent article regarding TLP and ILP here

CUDA and Stream are a different argument all together and provide the tools to handle what has been stated above. :thumbsup:
 
Last edited:

pcm81

Senior member
Mar 11, 2011
584
9
81
There's more to it than the elements you mention. Nvidia and AMD have different approaches to handling data both in the driver and the hardware architecture. You may already know this but lets recap for those who don't.

Nvidia focus on Thread level Parallelism (TLP) whereby incoming data sets are processed sequentially and don't need to worry about conflicts or execution resource utilization- all of those 512 cuda cores should be in operation given a large amount of threads (millions of threads per frame). This is what Nvidia's driver has been optimized for, finding the most efficient way of handling data sets and distributing the load evenly across all execution units. If you have 512 units executing 512 threads in a given clock cycle, they are not all going to finish at the same time, some will finish in the following cycle, some may need another 3 cycles to finish processing that thread. The CUDA cores that finish in the next clock cycle must be kept fed with a new set of data to process otherwise that CUDA core is idle and that reduces efficiency as you are losing time that could be better put to use 'doing' something. It's still a very robust way of doing things as you can keep could imagine a 'queue' of threads (words/warps are the technical terms but lets stick to 'thread' to describe a set of data) waiting to use each execution resources- as soon as one is done, the next starts and so on.

AMD go about things differently in their software and hardware architecture and focus heavily on Instruction Level Parallelism (ILP). Cayman brought with it less dependence on ILP than in previous generations but it is still the foundation of AMD's architecture post R600. ILP looks to break down an incoming thread into what can be executed in parallel and what can't. One 'Stream Processor' or 'SP' as AMD puts it consists of a x, y, z, w and t unit which process floating point numbers (with exception to the t unit which is a SFU). If instruction A is first and the outcome of that is used to process instruction B then they are 'dependent' - as B cannot be processed until we know the outcome of A - think of this as taking 2 clock cycles. If however, instructions A, B, C and D are not dependent on each other then they can all be processed by the x, y, z, w respectively in one clock cycle (a simplification but you get the idea). So to compare, in one clock cycle a R600+ GPU has processed 4 instructions in the time it took a CUDA core (dependent) to process 1- but this is assuming the thread 'can' be executed in parallel. AMD's driver does a good job of making maximum use of the resources available and determining what instructions can be executed in what order on which SP's and in what time frame.

If it can't be executed in parallel and instruction B has to wait for A to finish before it can start then 3/4 execution units (x, y, z, w) are sitting idle- thats only 25% efficiency.

So what does all of that mean? Well, different architectures are different so we can't really compare apples to apples- that is, we can't say 1500SP's = 512 Cuda cores because that depends on the instructions they are given. Now, there is more to the graphics pipeline than just execution units so there could be stalls or inefficiencies there to consider aswell (raster engine, geometry unit, dispatch units etc). Sometimes executing things serially is all you can do, so doing that quicker will give better performance, however sometimes things can be parallelized so obviously that would provide advantages. There will always be data to be processed serially. I suggest everyone read AT's excellent article regarding TLP and ILP here

CUDA and Stream are a different argument all together and provide the tools to handle what has been stated above. :thumbsup:

Many good points and all are very accurate. However the bottom line is still the same:
Will the future applications have memory demands large enough to utilize 1000s of stream processors running in task parrallel environment on a single large set of data (SIMM) or will 90% of the stream processors will be idle, hence giving the cuda cores better utilization, resulting in 100s of cuda cores doing more work per unit time than 1000s of stream processors in the same unit of time?

EDIT:
Most traditionally serial algorithms can be paralelized. For example lets take a binary search. Single core will split the data into 2 chuncks and check the middle value, then choose of of two chunck and check the middle value again. This loop will repeat LOG2(N) times for an array with N elements. Now take a die with 1000 cores; the set is devided into 1000 memory subsets and each core checks if the searched for value in in its subset. Then that subset gets again partitioned into 1000 chuncks and loop repeats. So with 1000 cores we get LOG1000(N) number of iterations. Now; lets say we have 512 cuda cores vs 1500 stream processors. For cuda we have LOG512(N) for stream we have LOG1500(N). As you see when N is small cuda cores are doing as well as 1500 stream processors, being MIMM they can all be utilized with different tasks. Stream processors willl either all cruncj one task or will sit idle. So the question is: will future memory demands be high enough to make LOG1500(N) compensate for lower core utilization vs LOG512(N) with better core utilization, since 2 separate tasks/applications can run on the fractions of 512 cores at the same time?
 
Last edited:

wahdangun

Golden Member
Feb 3, 2011
1,007
148
106
I think you miss one BIG detail, the stream processor have 2 type the complex one and the simpler one, in cypress its configured as 4+1, that consist of 4 simpler shader and 1 complex shader, so in best case scenario when games are optimize for it, it will use all the 1600 shader but in worst they can only use 1600/5= 320 shader, but there are no best case scenario usually that simple shader is under utilize, thats why every driver release amd can increase their performance quite significan because the driver team want to alter the game code to optimize it.
 

Barfo

Lifer
Jan 4, 2005
27,539
212
106
apu_nahasapeemapetilon.png


...sorry :p
 

pcm81

Senior member
Mar 11, 2011
584
9
81
I think you miss one BIG detail, the stream processor have 2 type the complex one and the simpler one, in cypress its configured as 4+1, that consist of 4 simpler shader and 1 complex shader, so in best case scenario when games are optimize for it, it will use all the 1600 shader but in worst they can only use 1600/5= 320 shader, but there are no best case scenario usually that simple shader is under utilize, thats why every driver release amd can increase their performance quite significan because the driver team want to alter the game code to optimize it.

This is all true, but:
I am curios about the hardware potential. It is possible to write stupid code on any platform resulting in very poor performance even on top end cpu. Lets assume that the code developers actually know what they are doing, i know this is rare these days, but such people still exist. Lets take best optimized code for 512 cuda cores and best optimized code for 1500 stream procesors, who will win 2, 5, 10 years from now?
 

wahdangun

Golden Member
Feb 3, 2011
1,007
148
106
This is all true, but:
I am curios about the hardware potential. It is possible to write stupid code on any platform resulting in very poor performance even on top end cpu. Lets assume that the code developers actually know what they are doing, i know this is rare these days, but such people still exist. Lets take best optimized code for 512 cuda cores and best optimized code for 1500 stream procesors, who will win 2, 5, 10 years from now?

yup this happen all the time, just look at DA2 and shogun total war performance on nvdia hardware. Or civ5 games on amd card