Actual vs Claimed GPU performance in terms of GFLOP/s

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
May 11, 2008
22,218
1,410
126
You're still not paying attention to what I'm saying. Yes the general GPU architecture and programming model means that you have to have a task that is amenable to parallelization under that model, to ensure maximum utilization of resources at all time.

Take one of the best examples of a task that fulfills the above requirement - GEMM. It's O(n^3) meaning most of the time is spent on the computation kernel, and studies show there is little data transfer overhead for this task.

Now for any large N it should fill out the GPU's video memory, and since a float is 4 bytes, it doesn't take long to figure out that even the most parallelizable tasks are memory bandwidth bound anyway, which is what I've been saying from the beginning.

Once you realize this, then it isn't hard to put two and two together to realize that when even the best GPUs today only have around 3x-8x more memory bandwidth that the best CPU configurations(when comparing the likes of V100 and P100 against Xeon SP and Epyc), figures claiming 100x or more performance should always be taken with a grain of salt.

I glanced through the pdf, it is interesting i will read it later on.
but i touched all that what you write here with writing about cache, datastructures and proper algorithm which also include making use of the DMA engines to stream in data from system ram to gpu ram before needed and stream out when data needs to go from gpu ram to system ram.
A gpu also has cache and internal registers and uses the cache to hide the gpu memory latency. Also, bandwidth is only handy when you make proper use of the width of the memory channels.
Taking one byte at a time from gpu ram will show an awful slow memory bus. One needs to take into account the memory bus width as well. Do not take one float at a time, but a whole bunch of them. But in order to that to be useful, the data structure must be set up in such a way that it can make use of as many floats as the memorybus can provide in one request. It is not as if you have seen the light, all the engineers and mathematicians that put a lot of work in designing a gpu know all this and came up with solutions to solve that. This has been done by cray designers ever since the 1960s.
It does require that an approach is used that takes maximum use of the hardware.

Prefetching, it is a pipelined approach that one must take.
Also, take note the paper is from 2011 and much has changed since then.
Block transfers is key here...


edit:
I do not if it is the case with all modern gpu but i have read that older tesla versions had a problem with misaligned memory access. When memory is misaligned, it takes more clock cycles to get the data from gpu ram into the cache of the gpu. Memory must be aligned at a warp (32 x 4 bytes) .
Of course, this will reduce performance. That is another thing to think about. Not that the gpu or compute unit is bad, but the programmers must know and understand the limitations of the hardware he or she is programming for.
 
Last edited: