Actual vs Claimed GPU performance in terms of GFLOP/s

tamz_msc · Dec 25, 2017

VirtualLarry said:
"bogus"? Hyperbole much? For DC projects, like F@H, the ratio of PPD between the CPU tasks, on a modern Intel AVX-enabled quad-core, versus a modern NV Pascal-based GPU, are indeed on the order of 100x difference (or more!) in favor of the GPU.

You seem to want to paint a picture of GPUs being only marginally faster than CPUs, but that's not accurate, they really are a magnitude (or two!) speedier. Workload permitting, of course. You might just have a poor workload for them.

How much of the task was optimized for the CPU vs the GPU? F@H and other DC tasks are not very useful in directly comparing CPUs and GPUs. They generally run different tasks - some of them are GPU, others are CPU.

VirtualLarry · Dec 25, 2017

tamz_msc said:
How much of the task was optimized for the CPU vs the GPU? F@H and other DC tasks are not very useful in directly comparing CPUs and GPUs. They generally run different tasks - some of them are GPU, others are CPU.

That's kind of the whole point. To get those "100x speedups", you must optimize your tasks appropriately. You can't just run x64 CPU code on a GPU, and expect it to scale out 100x.

Carfax83 · Dec 25, 2017

tamz_msc said:
Dataset size in a typical GPGPU application is usually kept limited to not exceed the VRAM. Register size is not much relevant in determining overall performance.

What does this have to do with what I said? I merely corrected your initial comment about GPUs having a few hundred GB/s of bandwidth, which is plainly wrong. Even if you discount the registers and the caches, the bandwidth in modern high end GPUs is close to 1TB/s due to compression (for graphics) and other bandwidth enhancing features. Heck, even CPUs have internal bandwidth exceeding 1TB/s. AVX2 and FMA require it.

Also, one has to wonder if register size isn't relevant to determining overall performance in GPUs as you claim, then why do GPUs come furnished with tens of thousands of the damn things?

tamz_msc · Dec 25, 2017

VirtualLarry said:
That's kind of the whole point. To get those "100x speedups", you must optimize your tasks appropriately. You can't just run x64 CPU code on a GPU, and expect it to scale out 100x.

Well, no that isn't the point.

Usually whenever somebody claims to have done something(excluding ML) 100x faster on a GPU than a CPU, one should always ask whether they made a fair comparison, exploiting all available options for optimizing both hardware setups. As it happens, a lot of times that isn't the case.

tamz_msc · Dec 25, 2017

Carfax83 said:
What does this have to do with what I said? I merely corrected your initial comment about GPUs having a few hundred GB/s of bandwidth, which is plainly wrong. Even if you discount the registers and the caches, the bandwidth in modern high end GPUs is close to 1TB/s due to compression (for graphics) and other bandwidth enhancing features. Heck, even CPUs have internal bandwidth exceeding 1TB/s. AVX2 and FMA require it.

Also, one has to wonder if register size isn't relevant to determining overall performance in GPUs as you claim, then why do GPUs come furnished with tens of thousands of the damn things?

X=A*B

This is what a GPU does best. Now where are X, A, B stored? Once you've computed the result, do you then want to use the result in something else or not? That means getting it over the interface into main memory, which can do what? 32GB/s? Then there's the modern Intel and AMD CPUs, which have support up to 1.5TB and 2TB per CPU. How much has been crammed on to a GPU PCB?

To actually achieve tens of teraflop/s on something other than simple GEMM, you need lots of memory bandwidth, which simply isn't there at the moment. And the restrictions on dataset size makes GPU usage in a lot of applications severely limited.

tamz_msc · Dec 25, 2017

For anyone interested, this thread is inspired by an Intel study and the hilarious response it elicited from NVIDIA.

Looking at the first claim the NVIDIA uses to argue in favor of GPUS, which can be found here, it is obvious that one should be careful whenever somebody claims the kind of speedups.

In comparison, we compiled the CPU implementation, tMCimg, using the “-O3” option with the gcc compiler and double-precision computation on a single core of an Intel 64bit Xeon processor. of 1.86GHz.

They compare as single core of an unspecified 1.86GHz Xeon with a 8800GT which doesn't even have DP, using just -O3!

William Gaatjes · Dec 25, 2017

tamz_msc said:
For anyone interested, this thread is inspired by an Intel study and the hilarious response it elicited from NVIDIA.

Looking at the first claim the NVIDIA uses to argue in favor of GPUS, which can be found here, it is obvious that one should be careful whenever somebody claims the kind of speedups.

They compare as single core of an unspecified 1.86GHz Xeon with a 8800GT which doesn't even have DP, using just -O3!

Well, you have to see some things in perspective. A cpu core without its caching system and smart prefetching is also limited to main memory bandwidth just as a gpu core is without its prefetching and cache controller. ~50GB/sec (DDR4@3600)max comparing to ~768GB/sec makes the gpu only around 15 times faster in your view but that is wrong.
Take away those things and compare a single cpu core with a gpu core and suddenly you will notice that a cpu is a jack of all trades with sisd with a little simd instructions added, and a gpu core is optimized for simd and a tiny bit of branchy scalar code. But although that simd reads as very limited, the amount of instructions a single gpu core support is still impressive.
The power of a gpu comes from being massively parallel with more than 1000 of those cores and code written to take advantage of that architecture will surpass a cpu core easily in that range of 100 times faster. Take a look at doom2016 and wolfenstein new colossus, those games have engines that make optimal use of the gpu parallel capabilities and does smart prefetching before computing, hiding the latency to get data from system ram to vram and from vram to local internal cache. No cpu can come close.
As long as the beast is properly fed with proper code, the gpu will always be faster than a cpu.
Strip all cache and prefetching and let branchy scalar code execute on a gpu simd unit, it will be slow as hell and then and only then a cpu will surpass it.
The key is to divide and conquer the code. Optimization is key here.

tamz_msc · Dec 25, 2017

William Gaatjes said:
Well, you have to see some things in perspective. A cpu core without its caching system and smart prefetching is also limited to main memory bandwidth just as a gpu core is without its prefetching and cache controller. ~50GB/sec (DDR4@3600)max comparing to ~768GB/sec makes the gpu only around 15 times faster in your view but that is wrong.
Take away those things and compare a single cpu core with a gpu core and suddenly you will notice that a cpu is a jack of all trades with sisd with a little simd instructions added, and a gpu core is optimized for simd and a tiny bit of branchy scalar code. But although that simd reads as very limited, the amount of instructions a single gpu core support is still impressive.
The power of a gpu comes from being massively parallel with more than 1000 of those cores and code written to take advantage of that architecture will surpass a cpu core easily in that range of 100 times faster. Take a look at doom2016 and wolfenstein new colossus, those games have engines that make optimal use of the gpu parallel capabilities and does smart prefetching before computing, hiding the latency to get data from system ram to vram and from vram to local internal cache. No cpu can come close.
As long as the beast is properly fed with proper code, the gpu will always be faster than a cpu.
Strip all cache and prefetching and let branchy scalar code execute on a gpu simd unit, it will be slow as hell and then and only then a cpu will surpass it.
The key is to divide and conquer the code. Optimization is key here.

Well, this discussion is strictly in context of compute performance and the how much the actual speedup vs a CPU is, provided both the CPU and GPU is set up accordingly. In the example I listed, the comparison is between a single CPU core with DP, which basically means half the throughput compared to SP. This discrepancy slashes the advantage from 300x to 150x. This is a clear example where the numbers reported are misleading and outright dishonest.

If you looked at the Intel study, then they show that the geometric mean of the speedup that a GTX 280 gives over an i7 960 is 2.5x, with the best case scenario being 14.9x. This is a long way off the >100x speedups that NVIDIA especially likes to claim. This is directly tied to the fact that the touted GFLOP/s numbers are never going to be achieved running real applications.

For some other results from a different domain, have a look at this thread.

Muhammed · Dec 25, 2017

tamz_msc said:
This is a clear example where the numbers reported are misleading and outright dishonest.

Even CPU numbers are never even close to their theoretical advertised values, IPC is often touted as 6~8 instructions per clocks, real world average IPC? a pathetic 2 instructions! So I don't get your point exactly! Every hardware has a theoretical limit that it never reaches all the time. Even RAM and Hard Disk and SSDs.

HPC, supercomputer, AI machines and servers all are using GPUs because they ARE several magnitudes faster than CPUs, doesn't matter what Intel's pathetic reasoning is. They are losing the battle that's why they are trying to belittle the difference. But real world code and uses beg to differ. They all point to a massive GPU advantage that is only increasing moving forward.

VirtualLarry · Dec 25, 2017

Muhammed said:
But real world code and uses beg to differ. They all point to a massive GPU advantage that is only increasing moving forward.

Exactly.

BFG10K · Dec 26, 2017

Muhammed said:
Even CPU numbers are never even close to their theoretical advertised values, IPC is often touted as 6~8 instructions per clocks, real world average IPC? a pathetic 2 instructions! So I don't get your point exactly! Every hardware has a theoretical limit that it never reaches all the time. Even RAM and Hard Disk and SSDs.

I can't help but feel this whole thread is nothing more than an elaborate troll. The argument basically boils down to: "GPUs don't reach full theoretical performance, so that means they aren't much faster than CPUs", completely ignoring the fact that CPUs never do, or anything else for that matter.

It's hilarious to see him push Intel "studies" after they lost $billions on the failure that was Larrabee. They were telling us the thing was going to make GPUs obsolete because it got "moar FLOPS with ray-tracing". Now we have Intel licensing AMD GPU tech. In other-words the complete opposite of their lunatic Larrabee ramblings.

Likewise, some of Intel's Coffee Lake press release speedup claims were based on increased core count when in reality almost no software gets perfect scaling and saturates all cores 100%.

OP, why do you think there's a GPU mining craze if CPUs are just as fast for crypto compute tasks?

DrMrLordX · Dec 26, 2017

BFG10K said:
It's hilarious to see him push Intel "studies" after they lost billions on the failure that was Larrabee. They were telling us the thing was going to make GPUs obsolete because it got "moar FLOPS with ray-tracing". Now we have Intel licensing AMD GPU tech. In other-words the complete opposite of their lunatic Larrabee ramblings.

Intel hasn't use Phi - Larrabee's commercial descendant - in any role as a GPU. Were it not for the obvious problems with their 10nm node, we might actually see further progress of Atom-based Phi. As it stands, I think we've seen the last of it. Intel can't keep up without resorting to Core, and where that will lead is anyone's guess.

The concept was not altogether bad. Later iterations of Phi were apparently much easier to work with than CUDA or OpenCL devices, and Phi got better performance/watt in highly parallel workloads than many of Intel's best CPUs. It did what it was supposed to do, more-or-less. Intel was too reliant on further node shrinks to keep the concept alive.

Now they're all about FPGA integration to tackle the AI market. It's not clear what products they'll shovel into the remaining HPC sector.

OP, why do you think there's a GPU mining craze if CPUs are just as fast for crypto compute tasks?

You should look at the hashrates of some CPUs mining XMR. They do surprisingly well. Beyond that, I think we all know that ASICs are ultimately the best solution for any given crypto algorithm, assuming the memory subsystem is up to the task.

tamz_msc · Dec 26, 2017

Muhammed said:
Even CPU numbers are never even close to their theoretical advertised values, IPC is often touted as 6~8 instructions per clocks, real world average IPC? a pathetic 2 instructions! So I don't get your point exactly! Every hardware has a theoretical limit that it never reaches all the time. Even RAM and Hard Disk and SSDs.

HPC, supercomputer, AI machines and servers all are using GPUs because they ARE several magnitudes faster than CPUs, doesn't matter what Intel's pathetic reasoning is. They are losing the battle that's why they are trying to belittle the difference. But real world code and uses beg to differ. They all point to a massive GPU advantage that is only increasing moving forward.

Why use silly metrics like IPC when you can just measure time to completion? Because then you'll be caught doing something silly to hide the fact that GPU performance is grossly overestimated in most real-world applications.

BFG10K said:
I can't help but feel this whole thread is nothing more than an elaborate troll. The argument basically boils down to: "GPUs don't reach full theoretical performance, so that means they aren't much faster than CPUs", completely ignoring the fact that CPUs never do, or anything else for that matter.

It's hilarious to see him push Intel "studies" after they lost $billions on the failure that was Larrabee. They were telling us the thing was going to make GPUs obsolete because it got "moar FLOPS with ray-tracing". Now we have Intel licensing AMD GPU tech. In other-words the complete opposite of their lunatic Larrabee ramblings.

Likewise, some of Intel's Coffee Lake press release speedup claims were based on increased core count when in reality almost no software gets perfect scaling and saturates all cores 100%.

OP, why do you think there's a GPU mining craze if CPUs are just as fast for crypto compute tasks?

Ah, so you're ignoring the facts because it came from Intel? In that case are you buying into NVIDIA's weak argument that 'look GPUs are so easy that even MIT undergrads can code for them, but Intel has to hire experts to make their CPUs work'. That NVIDIA response was plain stupid.

EDIT: Since you've brought up cryptocurrency, then one can equally ask, why do mining algorithms not utilize the specialized crypto instructions on modern CPUs and rely on brute-force GPU calculations?

VirtualLarry · Dec 26, 2017

If you're saying that not all code is embarassingly parallel, and therefore, code ported directly from CPUs to GPUs, will result in underwhelming performance speedups, I agree with you.

But if you take code that IS embarassingly parallel, and optimize that code for GPUs, and let it fly, then completion times alone, suggest that GPUs ARE 100x faster than CPUs, or more.

Edit: My PrimeGrid PPS Sieve WUs, take 17 hours on a CPU core, and like 18 minutes on a GPU (with 2048 cores, an RX 470 4GB).

Muhammed · Dec 26, 2017

tamz_msc said:
Why use silly metrics like IPC when you can just measure time to completion?

Time of completion of what? Even going by your FLOPs metrics, CPUs don't even come close to their theoretical limits. Most gains from multi core CPUs are negligible, whether in games, video editing, photo editing, and even many 3D applications. As CPUs fail to scale up spectacularly. While GPUs scale much much better.

tamz_msc said:
In that case are you buying into NVIDIA's weak argument that 'look GPUs are so easy that even MIT undergrads can code for them

NVIDIA gave plenty of links for real workloads showing 100x speed ups compared to CPUs. Intel gave none. They ran unknown code on an old GPU and declared it only 14x faster! When it comes to evidence, Intel showed none, while NVIDIA showed it all.

tamz_msc · Dec 26, 2017

VirtualLarry said:
If you're saying that not all code is embarassingly parallel, and therefore, code ported directly from CPUs to GPUs, will result in underwhelming performance speedups, I agree with you.

But if you take code that IS embarassingly parallel, and optimize that code for GPUs, and let it fly, then completion times alone, suggest that GPUs ARE 100x faster than CPUs, or more.

Edit: My PrimeGrid PPS Sieve WUs, take 17 hours on a CPU core, and like 18 minutes on a GPU (with 2048 cores, an RX 470 4GB).

No, my question is, provided that you have a workload that can be parallelized on both CPUs and GPUs, and using all available optimization methods on both of them, what is the difference in performance between the two?

Let's take a highly specific synthetic test that can be compared head-to-head - Linpack vs cuBLAS. An overclocked 600$ 7820X gives over 600 GFLOP/s in DP Linpack, how many GPUs give >1 GFLOP/s per dollar double-precision performance in linear algebra?

tamz_msc · Dec 26, 2017

Muhammed said:
NVIDIA gave plenty of links for real workloads showing 100x speed ups compared to CPUs. Intel gave none. They ran unknown code on an old GPU and declared it only 14x faster! When it comes to evidence, Intel showed none, while NVIDIA showed it all.

Ugh...this is in essence what the first NVIDIA-linked "study" does to claim 300X performance.

And then I'm supposed to believe NVIDIA over Intel. If they're going to state relative performance over one CPU core, then they should ideally be comparing against one GPU "core" - for the sake of fairness!

Time of completion of what? Even going by your FLOPs metrics, CPUs don't even come close to their theoretical limits. Most gains from multi core CPUs are negligible, whether in games, video editing, photo editing, and even many 3D applications. As CPUs fail to scale up spectacularly. While GPUs scale much much better.

That's simply wrong. CPUs can scale just as well, provided of course you know your workload. Time to completion refers to refers to the actual way people measure performance, like in this test.

Muhammed · Dec 26, 2017

tamz_msc said:
Time to completion refers to refers to the actual way people measure performance, like in this test.

LOL! The test isn't even optimized well for GPUs to begin with, It is stated right there in the first section. I don't even know why you are so hung up on it.

tamz_msc said:
Ugh...this is in essence what the first NVIDIA-linked "study" does to claim 300X performance.

If you don't like the first comparison, switch to the others then. GPUs are still far faster than CPUs. Keep in mind this is 2010 we are talking about. By now a Titan V can be 1000x better than a common CPU.

tamz_msc said:
That's simply wrong. CPUs can scale just as well, provided of course you know your workload.

Aha! And here lies the crux of the matter, use a suitable workload and your hardware can push to the limits. Use a non suitable workload and your hardware acts like a turd.

When it comes to desktop scaling, GPUs outpace CPUs by orders of magnitudes, they scale massively better in gaming, Video and Photo editing, 3D Rendering, Video acceleration, Crypto, Folding, AI, and dozens of other scientific applications. While CPUs generally SUCK there. When was the last time you saw a game that successfully utilized even 6 cores?

William Gaatjes · Dec 26, 2017

tamz_msc said:
Well, this discussion is strictly in context of compute performance and the how much the actual speedup vs a CPU is, provided both the CPU and GPU is set up accordingly. In the example I listed, the comparison is between a single CPU core with DP, which basically means half the throughput compared to SP. This discrepancy slashes the advantage from 300x to 150x. This is a clear example where the numbers reported are misleading and outright dishonest.

If you looked at the Intel study, then they show that the geometric mean of the speedup that a GTX 280 gives over an i7 960 is 2.5x, with the best case scenario being 14.9x. This is a long way off the >100x speedups that NVIDIA especially likes to claim. This is directly tied to the fact that the touted GFLOP/s numbers are never going to be achieved running real applications.

For some other results from a different domain, have a look at this thread.

From your link:
https://www.cfd-online.com/Forums/hardware/187098-gpu-acceleration-ansys-fluent.html

This is a typical example of how important the implementation of an algorithm is. For to be fast on a gpu it must be highly parallel and i mean make use of hundreds or thousands of cores.
We do not know what the code is ansys runs. It is not wise to just take this example and say gpu are slow.
If the code was thought up from the ground up to run best on a cpu, it will never run good on a gpu because of the reasons i mentioned in my previous post.

For example, the cpu has sse2 or avx. It is possible to write code that runs awful even while using sse2 or avx.
Code has to be specially written for it. Optimization.
You cannot just take code and compile it for a different architecture and expect it to be hell of a lot faster just like that.
Sometimes it happen but most often not.

Another example is a an interesting discussion Carfax83 and i had about physx.
I went searching and found the blog from one of the physx programmers and posted it in the thread.
It is most illuminating.
http://www.codercorner.com/blog/?p=1129

What he writes here is what is important.

For proper SIMD gains you need to design the data structures accordingly and think about that stuff from the ground up, not as an afterthought

....
And still, even after all this happened, a better algorithm, or better data structures or less cache misses, still give you more gains than SIMD. SIMD itself does not guarantee that your code is any good. Any non-SIMD code can kick SIMD code’s ass any day of the week if SIMD code is clueless about everything else.

...
EDIT: I forgot something. Contrary to what people also claim, PhysX works just fine on consoles and it is a multi-platforms library. That is, writing SIMD is not as easy as hardcoding a bunch of SSE2 intrinsics in the middle of the code. It has to be properly supported on all platforms, including some that do not like basic things like shuffles, or do not support very useful instructions like movemask. Converting something to SIMD means writing the converted code several times, possibly in different ways, making sure that the SIMD versions are faster than their non-SIMD counterparts on each platform - which is not a given at all. It takes a lot of time and a lot of effort, and gains vary a lot from one platform to the next.

He even mentions that the cpu holy grale that Intel touts : simd instructions (and is in effect good, ARM also has neon), can perform horribly or have no benefit when an algorithm is properly coded with generic cpu instructions.

Optimization and rewriting is key.

tamz_msc · Dec 28, 2017

Muhammed said:
LOL! The test isn't even optimized well for GPUs to begin with, It is stated right there in the first section. I don't even know why you are so hung up on it.

That test clearly states the situations in Ansys Fluent which are favorable for GPUs and which are not. In the situation where GPUs can benefit to a significant degree, it is also shown that throwing in more CPU cores is equally viable. You have zero understanding of what the test showed, which is clear from your arguments.

If you don't like the first comparison, switch to the others then. GPUs are still far faster than CPUs. Keep in mind this is 2010 we are talking about. By now a Titan V can be 1000x better than a common CPU.

Actual hard numbers please, in a fair comparison? You're acting as if CPUs haven't gone anywhere since 2010.

Aha! And here lies the crux of the matter, use a suitable workload and your hardware can push to the limits. Use a non suitable workload and your hardware acts like a turd.

When it comes to desktop scaling, GPUs outpace CPUs by orders of magnitudes, they scale massively better in gaming, Video and Photo editing, 3D Rendering, Video acceleration, Crypto, Folding, AI, and dozens of other scientific applications. While CPUs generally SUCK there. When was the last time you saw a game that successfully utilized even 6 cores?

Again showing your lack of understanding. Take the best example of a highly parallelizable compute task, like matrix multiply, on a modern Intel CPU, say the Xeon Gold 6148 with the latest vector SIMD instructions, use MKL to write optimized code, choose a suitable problem size that utilizes the main memory, and report what GFLOP/s you get.

Do the same with your Titan V using cuBLAS.

Now tell me honestly, in relative terms, how much faster is the Titan V?Given that they cost the same - so that GFLOP/$ is normalized.

tamz_msc · Dec 28, 2017

William Gaatjes said:
This is a typical example of how important the implementation of an algorithm is. For to be fast on a gpu it must be highly parallel and i mean make use of hundreds or thousands of cores.
We do not know what the code is ansys runs. It is not wise to just take this example and say gpu are slow.
If the code was thought up from the ground up to run best on a cpu, it will never run good on a gpu because of the reasons i mentioned in my previous post.

William Gaatjes said:
For example, the cpu has sse2 or avx. It is possible to write code that runs awful even while using sse2 or avx.
Code has to be specially written for it. Optimization.
You cannot just take code and compile it for a different architecture and expect it to be hell of a lot faster just like that.
Sometimes it happen but most often not.

See my last response to Muhammed in the previous post. You're talking about generalities while I'm talking about a specific thing. Linear algebra. Both CUDA and Intel MKL can accelerate matrix operations, which is one of the best examples of a highly parallel workload.

Muhammed · Dec 29, 2017

tamz_msc said:
That test clearly states the situations in Ansys Fluent which are favorable for GPUs and which are not. In the situation where GPUs can benefit to a significant degree, it is also shown that throwing in more CPU cores is equally viable.

Which actually means the code is not optimized well for GPUs. Being favorable to GPUs in some cases doesn't in anyway mean the code is even 30% suitable for GPUs. It's obviously not even 20% suitable if adding more CPU cores is equal to adding more GPUs.

tamz_msc said:
Again showing your lack of understanding. Take the best example of a highly parallelizable compute task, like matrix multiply, on a modern Intel CPU, say the Xeon Gold 6148 with the latest vector SIMD instructions, use MKL to write optimized code, choose a suitable problem size that utilizes the main memory, and report what GFLOP/s you get.

Take a look at Folding@Home for example, GPUs vastly outpace CPUs there. F@H is one of the best examples of a highly parallelized code optimized for both GPUs and CPUs.

Have more looks here, a single GP100 is 7 times faster than a 36 Core CPU setup!

Here it's 20 times more powerful:

Here it's 40X more powerful:

So no, if you use code optimized for GPUs, CPU are decimated.

AI benches:

data-center-tesla-v100-inference-performance-chart-update-6-625-udt@2x.png

tamz_msc said:
Actual hard numbers please, in a fair comparison? You're acting as if CPUs haven't gone anywhere since 2010.

Here is the 1000X faster I was telling you about:

EDIT: numbers correction.

Krteq · Dec 29, 2017

... aaand another derailed thread by a same person :/

This one was about "Actual vs Claimed GPU performance in terms of GFLOP/s", now there is OT about CPU vs. GPU and other crap.

William Gaatjes · Dec 29, 2017

tamz_msc said:
See my last response to Muhammed in the previous post. You're talking about generalities while I'm talking about a specific thing. Linear algebra. Both CUDA and Intel MKL can accelerate matrix operations, which is one of the best examples of a highly parallel workload.

I am not writing about generalities.
It is a perfect explanation why code sometimes is not faster and when it is not.
And there is more of course :
On a gpu to get close to that theoretical number, you must write an algorithm that allows all threads in a warp or wavefront to be fully active with as little if then else statements as possible.
When some threads take the if and the other the else statement, during the execution of the instruction some threads from a warp/wavefront can not complete and have to wait.
Then when these thread can execute, the other threads have to wait.
From 32 threads, not all are in use all the time.
This is what we see in games as well, when a game cannot make full use of all threads, be it warp or wavefront the fps will be lower because not all simt (simd) units can be utilized properly.
That is where profiling software comes in.
I am sure Nvidia and AMD provide profiling software that is not just suited for graphics alone.

Read this from a gpu programmer.
https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/26/warp-or-wavefront-of-gpu-threads/

It is about graphic programming, but the same applies for cuda or opencl.

When programming GPUs we know that we typically schedule many 1000s of threads and we also know that we can further organize them in many tiles of threads.

Aside: These concepts also exist in other programming models, so in HLSL they are called “threads” and “thread groups”. In CUDA they are called “CUDA threads” and “thread blocks”. In OpenCL they are called “work items” and “work groups”. But we’ll stick with the C++ AMP terms of “threads” and “tiles (of threads)”.

From a correctness perspective and from a programming model concepts perspective, that is the end of the story.

The hardware scheduling unit
However, from a performance perspective, it is interesting to know that the hardware has an additional bunching of threads which in NVIDIA hardware is called a “warp”, in AMD hardware is called a “wavefront”, and in other hardware that at the time of writing is not available on the market, it will probably be called something else. If I had it my way, they would be called a “team” of threads, but I lost that battle.

A “warp” (or “wavefront”) is the most basic unit of scheduling of the NVIDIA (or AMD) GPU. Other equivalent definitions include: “is the smallest executable unit of code” OR “processes a single instruction over all of the threads in it at the same time” OR “is the minimum size of the data processed in SIMD fashion”.

A “warp” currently consists of 32 threads on NVIDIA hardware. A “wavefront” currently consists of 64 threads in AMD hardware. Each vendor may decide to change that, since this whole concept is literally an implementation detail, and new hardware vendors may decide to come up with other sizes.

Note that on CPU hardware this concept of most basic level of parallelism is often called a “vector width” (for example when using the SSE instructions on Intel and AMD processors). The vector width is characterized by the total number of bits in it, which you can populate, e.g. with a given number of floats, or a given number of doubles. The upper limits of CPU vector widths is currently lower than GPU hardware.

So without going to any undesirable extremes of tying your implementation to a specific card, or specific family of cards or a hardware vendor’s cards, how can you easily use this information?

Avoid having diverged warps/wavefronts
Note: below every occurrence of the term “warp” can be replaced with the term “wavefront” and the meaning of the text will not change. I am just using the shorter of the two terms .

All the threads in a warp execute the same instruction in lock-step, the only difference being the data that they operate on in that instruction. So if your code does anything that causes the threads in a warp to be unable to execute the same instruction, then some threads in the warp will be diverged during the execution of that instruction. So you’d be leaving some compute power on the table.

tamz_msc · Jan 1, 2018

William Gaatjes said:
I am not writing about generalities.
It is a perfect explanation why code sometimes is not faster and when it is not.
And there is more of course :
On a gpu to get close to that theoretical number, you must write an algorithm that allows all threads in a warp or wavefront to be fully active with as little if then else statements as possible.
When some threads take the if and the other the else statement, during the execution of the instruction some threads from a warp/wavefront can not complete and have to wait.
Then when these thread can execute, the other threads have to wait.
From 32 threads, not all are in use all the time.
This is what we see in games as well, when a game cannot make full use of all threads, be it warp or wavefront the fps will be lower because not all simt (simd) units can be utilized properly.
That is where profiling software comes in.
I am sure Nvidia and AMD provide profiling software that is not just suited for graphics alone.

Read this from a gpu programmer.
https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/26/warp-or-wavefront-of-gpu-threads/

It is about graphic programming, but the same applies for cuda or opencl.

You're still not paying attention to what I'm saying. Yes the general GPU architecture and programming model means that you have to have a task that is amenable to parallelization under that model, to ensure maximum utilization of resources at all time.

Take one of the best examples of a task that fulfills the above requirement - GEMM. It's O(n^3) meaning most of the time is spent on the computation kernel, and studies show there is little data transfer overhead for this task.

Now for any large N it should fill out the GPU's video memory, and since a float is 4 bytes, it doesn't take long to figure out that even the most parallelizable tasks are memory bandwidth bound anyway, which is what I've been saying from the beginning.

Once you realize this, then it isn't hard to put two and two together to realize that when even the best GPUs today only have around 3x-8x more memory bandwidth that the best CPU configurations(when comparing the likes of V100 and P100 against Xeon SP and Epyc), figures claiming 100x or more performance should always be taken with a grain of salt.

Actual vs Claimed GPU performance in terms of GFLOP/s

Diamond Member

No Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

No Lifer

Lifer

Lifer

Diamond Member

No Lifer

Senior member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Golden Member

Lifer

Diamond Member