Should I wait for the intel haswell processor to build a gaming rig?

iCyborg · Apr 12, 2012

BenchPress said:
There are many reasons why GPUs suck at generic computing (pardon my French). For starters, they don't have out-of-order execution, so when a thread has to access memory, it can stall for hundred of clock cycles.

Not the best comparison: GTX 680 was made for gaming with some clear sacrifices on the compute side. Why ignore GCN or previous nV 5xx series?
E.g. a $300 i7 3820 loses to a $150 Radeon 7770 in the same benchmark, and a $1000 3960X or a $600 3930K loses to a $250 7850. Also, putting 2x CPUs will require server MBs and server CPUs making it quite a bit more pricey, while two and even more GPUs aren't much of a problem.

BenchPress said:
Also, GPUs don't have large caches to avoid having to go to memory in the first place. Furthermore, there's a round-trip delay from sending a task to the GPU and reading back the result, and you have to go through several layers of driver software. With AVX2 on the CPU, the input and output are right where you want them.

Yes, they don't have caches, but 7970's mem bandwidth is 264GB/s, almost ~10x dual channel ddr3-1600.
Also, most of these don't apply to APUs, theoretically Trinity has bigger *potential* compute advantages than Haswell. And it's based on NI, Kaveri which will come around Haswell time will be based on more compute-friendly SI. Generally, I wouldn't count out heterogeneous computing either: Haswell is making strides with AVX2 towards GPUs, but GPUs are making strides too: support for full C++, unified address space, user-mode scheduling, context switching etc.

taltamir · Apr 12, 2012

amenx said:
Anyone considering an Intel CPU upgrade should be mindful of whether its a tick or tock. A tick is a minor improvement, a die shrink that doesnt usually bring much of a performance gain, but will have reduced TDP. A tock is major, it is a new arch that brings more significant improvements and performance gains. IB is a tick, Haswell is a tock.

Power savings bring major money savings

IntelUser2000 · Apr 12, 2012

BenchPress said:
So it's best to just let GPUs do what they do best: graphics. Making them do anything else requires big compromises. With the CPU on the other hand it takes surprisingly little changes to turn it into a high-throughput device, without sacrificing any of its qualities. The result, is Haswell.

I think you are way too positive on CPUs for throughput computing. You sound like the other guy that is absolutely sure on CPU replacing even graphics workloads.

If there's a reason why CPU might be better suited, could be because in lot of cases high FPU throughput and lot of threads aren't always the best approach. Now I'm not saying that makes CPU better, it may be that workloads will be always split into two, one better suited for CPU than GPU(and vice versa).

Let's be mindful that GTX680 isn't the best example on there. The 7970 is doing far better, and we haven't seen Tesla versions of GTX680 which doesn't compromise GPGPU performance for gaming frames/second.

BenchPress · Apr 12, 2012

amenx said:
Anyone considering an Intel CPU upgrade should be mindful of whether its a tick or tock. A tick is a minor improvement, a die shrink that doesnt usually bring much of a performance gain, but will have reduced TDP.

There is absolutely no guarantee that a process shrink will reduce TDP. In fact when things get smaller, the variability increases, which means some transistors leak more and this has an effect on the entire power consumption when not compensated for by other means.

Also, chip manufacturers can always choose to use the new process technology to alter various parameters which affect top clock speed, density, power consumption at low speed, and power consumption at high speed. There are a lot of possibilities.

It looks like with 22 nm Intel is specifically choosing not to lower power consumption for the high clock frequency parts, but to tweak the process to particularly get low power consumption at lower clock frequencies. This benefits the mobile parts, while keeping things fairly constant for the high-end parts. It makes it harder to overclock, but this is really of lesser concern to Intel.

BenchPress · Apr 12, 2012

iCyborg said:
Not the best comparison: GTX 680 was made for gaming with some clear sacrifices on the compute side. Why ignore GCN or previous nV 5xx series?

That's easy. Because game developers will ignore GPGPU as long as a significant portion of GPUs suck at it! And it looks like it's going to stay that way for a long time. It is critical to realize that GTX 680 is NVIDIA's flagship product and yet it loses against a quad-core CPU! So it's really irrelevant how well AMD's GPUs are doing. It's the law of the lowest common denominator, and NVIDIA has turned back the clock many years.

CPUs on the other hand are only getting better at throughput computing. And they don't suffer from round-trip delays like the discrete GPUs do. So game developers can reliably invest their time into developing algorithms for multi-core CPUs with wide SIMD. It's guaranteed to run better on next-gen hardware.

Yes, they don't have caches, but 7970's mem bandwidth is 264GB/s, almost ~10x dual channel ddr3-1600.

Bandwidth is irrelevant when you need low latency. Also, the CPU doesn't need such high bandwidth because most accesses hit the cache.

Haswell is making strides with AVX2 towards GPUs, but GPUs are making strides too: support for full C++, unified address space, user-mode scheduling, context switching etc.

Yeah, but CPUs support all these things for many decades, and GPUs still haven't caught up. Things like "full C++ support" only exist on paper. In reality when you try recursion it exhausts the tiny caches in just a few iterations. Worthless.

Note that there are so many difference GPU compute models that it's a nightmare to get good results out of each of them. So it's far more attractive for (game) developers to look into AVX2 because even if it's not supported their software will still run fine, just slower. For GPGPU there is no such guarantee since the older ones will simply not run certain algorithms, in any shape or form.

Edrick · Apr 12, 2012

BenchPress said:
Yeah, but CPUs support all these things for many decades, and GPUs still haven't caught up.

GPUs have supported FMA for many years now and Intel is only catching up in 2013.

I agree with pretty much everything you posted so far on this subject, except for this. CPUs are trying to be more GPU-like and GPUs are trying to be more CPU-like. They both are playing catchup in certain areas.

BenchPress · Apr 12, 2012

IntelUser2000 said:
I think you are way too positive on CPUs for throughput computing. You sound like the other guy that is absolutely sure on CPU replacing even graphics workloads.

He actually might be right about that in the very, very long run, for specific markets. AVX2 is really only the beginning of serious throughput computing for the CPU. It can scale to 1024-bit, and some 'dedicated' instructions for graphics could be added. It would basically be a homogeneous APU, and that might actually already be the direction AMD is planning to take (mixing GCN code into x86). Anyway it's pretty pointless to speculate about such a potential long-term future, unless for 'academic' purposes. Haswell on the other hand is a real product so discussing the impact of AVX2 and TSX seems far more relevant...

If there's a reason why CPU might be better suited, could be because in lot of cases high FPU throughput and lot of threads aren't always the best approach. Now I'm not saying that makes CPU better, it may be that workloads will be always split into two, one better suited for CPU than GPU(and vice versa).

Sure, I'm not saying the CPU is better in all cases. If a 'generic' workload is very similar to a graphics workload, it will be better to use a compute shader. But it's interesting that things are evolving in favor the CPU more so than the GPU. The GTX 680 is less well suited for complex GPGPU, and Haswell will be a leap forward in raw throughput for the CPU. In other words, fewer workloads become suited for the GPU, and more workloads become suited for the CPU.

Let's be mindful that GTX680 isn't the best example on there. The 7970 is doing far better, and we haven't seen Tesla versions of GTX680 which doesn't compromise GPGPU performance for gaming frames/second.

That's great, but utterly irrelevant. The GTX 680 represents a step backwards for GPGPU from a major manufacturer. Game developers are not going to rely heavily on generic GPU computing if a significant percentage of consumers with the latest hardware get abysmal performance. Let me say that again: newer hardware, worse performance! I cannot stress enough how damaging this will be to the mainstream adoption of GPGPU technology. AMD's progress is a wasted effort because most developers won't write for just one platform.

Instead, they'll turn to the CPU since it's making steady progress and there's backward compatibility. Anyone buying Haswell will be rewarded with a nice speedup, the way it's meant to be (played). :biggrin:

BenchPress · Apr 12, 2012

Edrick said:
GPUs have supported FMA for many years now and Intel is only catching up in 2013. I agree with pretty much everything you posted so far on this subject, except for this. CPUs are trying to be more GPU-like and GPUs are trying to be more CPU-like. They both are playing catchup in certain areas.

I don't think we're talking about the same thing here. GPUs still have a lot of catching up to do to even become capable of running certain things at all. It is physically impossible for them to run all of the workloads a CPU can run, and it gets worse with older ones. The current lack of FMA on Intel CPUs on the other hand isn't a compatibility issue at all since it can be replaced by a multiplication and addition. You'll lose some performance, but not a whole lot. An older generation GPU will instead result in a 100% reduction in performance, which is obviously unacceptable and so GPGPU won't be a viable option for many years to come. And I'm not even talking about the Kepler performance degradation screwup.

AVX2 and FMA are obviously awesome, but it's equally important that when developers choose to support it they can still fall back to older instructions when its not available, and still get reasonable performance. With the GPU there is no such guarantee, and performance can be worse on newer hardware!

CPUs are trying to be more GPU-like and GPUs are trying to be more CPU-like. They both are playing catchup in certain areas.

The GTX 680 is not catching up with the CPU. It's a step backward.

Tuna-Fish · Apr 12, 2012

BenchPress said:
That's great, but utterly irrelevant. The GTX 680 represents a step backwards for GPGPU from a major manufacturer.

No. As a chip it's the successor to the GF114, and is faster at compute than it.

tweakboy · Apr 12, 2012

BenchPress said:
Please take a look at the AVX2 and TSX technology links provided above. They can provided way higher performance than a 5-15 % increase.

Probably not with today's games, no. But next year new consoles will be launched, and the new generation of games that come with it will push your CPU to the limits.

You say the CPU plays a small role yet you do recommend a quad-core? :\

You probably recommended a dual-core a couple years back, since "you won't notice a difference". Anyone following that advice now regrets it because now you and everyone else recommends a quad-core. Likewise, I think you need to realize that Haswell is still a year out, and games that benefit from its technology simply don't exist yet. But they'll exist soon enough, and you don't want to be stuck with an Ivy Bridge CPU and motherboard when that happens.

Your advice is only an argument why one should not upgrade to Ivy Bridge. You won't notice the difference compared to Sandy Bridge or even older than that. Instead, wait for something that does make a difference...

I absolutely agree that there hasn't been a whole lot of progress in the last several years since the first quad-cores. It has all been small evolutionary steps. And Ivy Bridge is yet another one of those evolutionary steps.

But Haswell is different. It takes a revolutionary leap ahead. AVX2 is the first SIMD extension that is really suitable for high throughput SPMD processing. And TSX enables more aggressive fine-grained multi-threading that uses all your cores.

Nice post my friend, you are right...

bronxzv · Apr 12, 2012

Edrick said:
GPUs have supported FMA for many years now and Intel is only catching up in 2013.

IA-64 was featuring FMA since Merced in 2001, also Intel is a key contributor of IEEE 754-2008 (FMA standardisation) and, btw, at the origin of IEEE 754-1985

iCyborg · Apr 12, 2012

BenchPress said:
That's easy. Because game developers will ignore GPGPU as long as a significant portion of GPUs suck at it! And it looks like it's going to stay that way for a long time. It is critical to realize that GTX 680 is NVIDIA's flagship product and yet it loses against a quad-core CPU! So it's really irrelevant how well AMD's GPUs are doing. It's the law of the lowest common denominator, and NVIDIA has turned back the clock many years.

First of all, game developers tend to use GPUs for graphics.
And if they developed for the lowest common denominator, then the games would target Intel IGP that holds like 55-60% of the market. Most of which are not HD3000. So, yes, it is relevant, because you're looking at it very one-sidedly citing a single example while treating complete product lineups as irrelevant.

BenchPress said:
CPUs on the other hand are only getting better at throughput computing. And they don't suffer from round-trip delays like the discrete GPUs do. So game developers can reliably invest their time into developing algorithms for multi-core CPUs with wide SIMD. It's guaranteed to run better on next-gen hardware.

Well, OpenCL programs are also guaranteed to run better on next-gen hardware. And as mentioned earlier: GPUs are only getting better with general purpose programming too, and APUs/on-die IGPs don't suffer from these round-trip delays either.

BenchPress said:
Bandwidth is irrelevant when you need low latency. Also, the CPU doesn't need such high bandwidth because most accesses hit the cache.

That wasn't the point: yes, CPUs don't need such high bandwidth, but the vice versa is also true: GPUs have less need for large caches because of the high bandwidth as well as the generally memory intensive nature of the apps currently suitable for them that would not benefit greatly from caches, at least not from a die size per performance perspective.

BenchPress said:
Yeah, but CPUs support all these things for many decades, and GPUs still haven't caught up. Things like "full C++ support" only exist on paper. In reality when you try recursion it exhausts the tiny caches in just a few iterations. Worthless.

Yes, that's an example where GPU is inferior to CPU. An opposite example is bitcoin where heavily OC-ed 6-core SB-E is on par with a Radeon 4770, a mid-range 3 generations old GPU. Worthless.

Of course that GPUs aren't meant to do everything as well as CPUs nor replace them and there are many areas where they will never catch up. But there are many areas where CPUs will remain well behind GPUs in the foreseeable future, with or without AVX2, and these areas aren't just graphics as you claim.

BenchPress said:
Note that there are so many difference GPU compute models that it's a nightmare to get good results out of each of them. So it's far more attractive for (game) developers to look into AVX2 because even if it's not supported their software will still run fine, just slower. For GPGPU there is no such guarantee since the older ones will simply not run certain algorithms, in any shape or form.

This is not true: OpenCL targets both CPUs and GPUs. Depending on the app, it will run faster or slower on each of them. So if it runs better on GPUs and there's a GPU available, it can utilize it, if not, it will just run slower on a CPU(*). And an OpenCL program should benefit from Haswell's AVX2 too.

* assumes the app isn't discriminatory against CPUs

BenchPress · Apr 12, 2012

Tuna-Fish said:
No. As a chip it's the successor to the GF114, and is faster at compute than it.

Where can I buy one for 150$ ?

tweakboy · Apr 12, 2012

Edrick said:
Proof? Oh wait, you never have any proof for the sillyness that comes out of your mouth.

I typed it my mouth was closed the whole time. You don't like me don't read me. thank you

:biggrin:

Edrick · Apr 12, 2012

tweakboy said:
I typed it my mouth was closed the whole time. You don't like me don't read me. thank you

:biggrin:

Not a matter of liking you or not. It is a matter of you passing your OPINION as fact when it is clearly not, and without any proof at all.

Edrick · Apr 12, 2012

BenchPress said:
The GTX 680 is not catching up with the CPU. It's a step backward.

True, but "big kepler" will be much different. In fact, I think GK110 will match or even beat Intel's Knights Corner. But you are right, GPUs have a long way to go before they will catch up to CPUs, but you can not argue that they have been trying. Project Denver is also a step in that direction.

Edrick · Apr 12, 2012

bronxzv said:
IA-64 was featuring FMA since Merced in 2001, also Intel is a key contributor of IEEE 754-2008 (FMA standardisation) and, btw, at the origin of IEEE 754-1985

True. I guess I should have said Intel's non-Itanium CPUs then

taltamir · Apr 12, 2012

BenchPress said:
The GTX 680 is not catching up with the CPU. It's a step backward.

This is reminding of "evolution level" misconception.
The GTX680 is superior to its predecessor because it discarded unnecessary junk even if this makes it "dumber" (many creatures evolved smaller brains for various reasons).
The GTX780 might be a return to kepler and ALSO be an improvement if at that time GPGPU becomes relevant for video gamers. Sometimes an improvement is removing a feature that is too costly and not useful/needed at the moment.

BenchPress · Apr 12, 2012

iCyborg said:
First of all, game developers tend to use GPUs for graphics.
And if they developed for the lowest common denominator, then the games would target Intel IGP that holds like 55-60% of the market. Most of which are not HD3000. So, yes, it is relevant, because you're looking at it very one-sidedly citing a single example while treating complete product lineups as irrelevant.

Please don't ignore the context. You have to look at the lowest common denominator in each class of hardware. If a game developer decides to target today's high-end hardware, he can't ignore the fact that NVIDIA's latest architecture is far worse at complex GPGPU workloads than AMD's.

Well, OpenCL programs are also guaranteed to run better on next-gen hardware.

You're very wrong about that:

Make no mistake about it. This is NVIDIA saying that mainstream GPGPU computing is a lost cause. There's just too many compromises to graphics performance to make a GPU efficient at complex generic computing. And gamers don't buy a GPU for compute. They buy it for graphics.

NVIDIA also realizes that it can't stop the progress of CPUs in throughput computing. Multi-core, wide SIMD, gather and FMA all used to be technology exclusive to the GPU. Not any more. It would be a losing battle for NVIDIA to try to keep up with GPGPU because they'd need ever more advanced scheduling and bigger caches, sacrificing graphics performance and thus their main selling point. It's suicide. And hence GPGPU has no future in gaming.

IGPs don't suffer from these round-trip delays either.

Indeed, but again it's a matter of lowest common denominator (among the high-end hardware). If someone with a GTX 680 and a 6-core CPU can't run a game in full glory, there's something seriously wrong with the choices the developer made. You can't blame this gamer for not buying a CPU with an IGP instead.

An IGP will always remain an optional feature. AVX2 on the other hand will be supported by Haswell and everything that follows it. So game developers can safely invest into using this technology. And reversely it's a safe bet for gamers to buy a newer CPU and expect better performance.

That wasn't the point: yes, CPUs don't need such high bandwidth, but the vice versa is also true: GPUs have less need for large caches because of the high bandwidth as well as the generally memory intensive nature of the apps currently suitable for them that would not benefit greatly from caches, at least not from a die size per performance perspective.

So what was your point then exactly? GPUs have 10x the bandwidth but you admit CPUs don't need it. And no, high bandwidth doesn't compensate for a lack of large caches. Caches provide two things: lowering the bandwidth to RAM, and lowering the average latency. So without large caches you need extra bandwidth, and discrete GPUs have no trouble providing that, but you can't get lower latency any other way. And this is what makes GPUs much less suited for generic computing.

You see, for graphics it doesn't matter if it takes the GPU 16 milliseconds to show the result on screen. You're still getting 60 FPS. In fact the GPU can even lag behind multiple frames. Graphics is a one-way stream and the results aren't read back by the CPU. For game logic however you need the results of your calculations ASAP, often in a matter of microseconds. But GPU threads are horrendously slow. They constantly stall hundreds of cycles and the only reason the GPU gets any work done at all is because it has hundreds of threads.

Yes, that's an example where GPU is inferior to CPU. An opposite example is bitcoin where heavily OC-ed 6-core SB-E is on par with a Radeon 4770, a mid-range 3 generations old GPU. Worthless.

Has it cured cancer yet?

My point wasn't to show an example of where the GPU is just slower than the CPU. My point was to show an example of where a GPU crashes and burns. It simply cannot do deep recursion. And that's for the latest and greatest. It gets worse really fast when looking at somewhat older hardware. And my point with that is that it's very unlikely for GPGPU workloads to run properly on a sufficiently wide range of hardware. People highly prefer sacrificing some performance over not supporting something at all. So it's safer for developers to do the computing on the CPU and take advantage of AVX2 when available.

Of course that GPUs aren't meant to do everything as well as CPUs nor replace them and there are many areas where they will never catch up. But there are many areas where CPUs will remain well behind GPUs in the foreseeable future, with or without AVX2, and these areas aren't just graphics as you claim.

Care to give me some examples?

This is not true: OpenCL targets both CPUs and GPUs. Depending on the app, it will run faster or slower on each of them. So if it runs better on GPUs and there's a GPU available, it can utilize it, if not, it will just run slower on a CPU(*). And an OpenCL program should benefit from Haswell's AVX2 too.

Sure, but OpenCL doesn't support recursion, no function pointers, no bitfields, no variable length arrays, no variadic functions, etc. Also, you have to actually explicitly rewrite your code for it. With AVX2, you can let the compiler do all the work and your're not restricted by anything.

tweakboy · Apr 12, 2012

Edrick said:
Not a matter of liking you or not. It is a matter of you passing your OPINION as fact when it is clearly not, and without any proof at all.

Sorry, what was the opinion your talking about. thx

tc :|

iCyborg · Apr 12, 2012

BenchPress said:
Please don't ignore the context. You have to look at the lowest common denominator in each class of hardware. If a game developer decides to target today's high-end hardware, he can't ignore the fact that NVIDIA's latest architecture is far worse at complex GPGPU workloads than AMD's.

It's not worse. It's a widely accepted fact that GK104 was not meant to be GTX 580's successor. GTX680 is more like GTX 660/670 Superclocked. If Big Kepler is far slower in compute than GTX 580 or 7970, then I will accept that the new arch is far worse.

BenchPress said:
You're very wrong about that:

Make no mistake about it. This is NVIDIA saying that mainstream GPGPU computing is a lost cause. There's just too many compromises to graphics performance to make a GPU efficient at complex generic computing. And gamers don't buy a GPU for compute. They buy it for graphics.

NVIDIA also realizes that it can't stop the progress of CPUs in throughput computing. Multi-core, wide SIMD, gather and FMA all used to be technology exclusive to the GPU. Not any more. It would be a losing battle for NVIDIA to try to keep up with GPGPU because they'd need ever more advanced scheduling and bigger caches, sacrificing graphics performance and thus their main selling point. It's suicide. And hence GPGPU has no future in gaming.

Why do you keep focusing only on gaming? I already said that most resource intensive games are mainly graphics limited, trying to shift even more work to GPU makes no sense. Stuff that GPGPU is used for is: http://en.wikipedia.org/wiki/GPGPU#Applications
A pretty long list, with no gaming on it (arguably some items from the list could be used in games)...

And as mentioned, we should wait for BigK to make conclusions. Otherwise I can also use some low/mid-range Haswell and compare it to a 6-core IB-E and conclude that it's not true that next gen hardware with wider SIMD is guaranteed to be faster... Or i7 990X vs i5 2400 with AVX if you prefer current models.

BenchPress said:
So what was your point then exactly? GPUs have 10x the bandwidth but you admit CPUs don't need it. And no, high bandwidth doesn't compensate for a lack of large caches. Caches provide two things: lowering the bandwidth to RAM, and lowering the average latency. So without large caches you need extra bandwidth, and discrete GPUs have no trouble providing that, but you can't get lower latency any other way. And this is what makes GPUs much less suited for generic computing.

My point is that if you have to process 1GB of data in a short time, your fast low-latency 6MB cache will not be a deal breaker. No one is saying that in 2 years you will have a choice of using GTX 880 without a CPU, and no one bothers using GPUs for small loads.

BenchPress said:
You see, for graphics it doesn't matter if it takes the GPU 16 milliseconds to show the result on screen. You're still getting 60 FPS. In fact the GPU can even lag behind multiple frames. Graphics is a one-way stream and the results aren't read back by the CPU. For game logic however you need the results of your calculations ASAP, often in a matter of microseconds. But GPU threads are horrendously slow. They constantly stall hundreds of cycles and the only reason the GPU gets any work done at all is because it has hundreds of threads.

Again with the game logic... If you need to do a heavy computation that will require billions of FP operations, your AVX2 CPU will not give you back the result in microseconds either.
And, btw, why do you need results of computations back in microseconds if there's no way to present them to the user until the next frame?
Any CPU context switch wastes thousands of cycles (OK, hundreds if between 2 logical HT cores), and these happen quite often too.
The last sentence: it's like saying that cars are much faster than 18-wheelers, and the only reason these are used is because they can carry lots of freight. Of course the GPU's main advantage is in the large number of compute units, and that individual units are nowhere near an x86 core.

BenchPress said:
My point wasn't to show an example of where the GPU is just slower than the CPU. My point was to show an example of where a GPU crashes and burns. It simply cannot do deep recursion. And that's for the latest and greatest. It gets worse really fast when looking at somewhat older hardware. And my point with that is that it's very unlikely for GPGPU workloads to run properly on a sufficiently wide range of hardware. People highly prefer sacrificing some performance over not supporting something at all. So it's safer for developers to do the computing on the CPU and take advantage of AVX2 when available.

And deep recursion will benefit from AVX2 how exactly? Or say something like simulating Markov chains with very simple state transitions (next state relies only on the previous, so need to compute them one after the other, can't do in parallel)?

There's one example given: hashing in bitcoin.

BenchPress said:
Sure, but OpenCL doesn't support recursion, no function pointers, no bitfields, no variable length arrays, no variadic functions, etc. Also, you have to actually explicitly rewrite your code for it. With AVX2, you can let the compiler do all the work and your're not restricted by anything.

It will in time, take a look at C++ AMP e.g.. You said a lot of this is only on paper now. Well, I don't see Haswells out either, we're still waiting for IB to show up.

BenchPress · Apr 12, 2012

iCyborg said:
It's not worse. It's a widely accepted fact that GK104 was not meant to be GTX 580's successor. GTX680 is more like GTX 660/670 Superclocked. If Big Kepler is far slower in compute than GTX 580 or 7970, then I will accept that the new arch is far worse.

Still irrelevant. A large number of people will buy cards using the GK104 architecture, so its absymal GPGPU performance is something that will inevitably steer game developers away from using such technology. And even if Big Kepler is much better at compute, gamers are more likely to buy dual GK104 instead since it will offer superior gaphics performance. And it's doubtful that Big Kepler has a radically different architecture anyway.

Why do you keep focusing only on gaming?

Because that's the topic of this thread. You're welcome to start a more generic thread about GPGPU versus CPU computing if you like...

I already said that most resource intensive games are mainly graphics limited, trying to shift even more work to GPU makes no sense.

Exactly! Hence its worthwhile to wait for Haswell if it's games you care about. They will no doubt be among the first to take advantage of AVX2 and TSX.

Stuff that GPGPU is used for is: http://en.wikipedia.org/wiki/GPGPU#Applications
A pretty long list, with no gaming on it (arguably some items from the list could be used in games)...

Yawn. Only a fraction of those could be of interest to consumers. It's the very same reason why Intel demoted Larrabee to research and academics. There's just no big demand for a device that is great for generic throughput computing workloads but mediocre for rasterization graphics.

AVX2 and TSX on the other hand are applicable to all software, so it will have a far greater impact on the future of consumer software.

My point is that if you have to process 1GB of data in a short time, your fast low-latency 6MB cache will not be a deal breaker.

Don't jump to conclusions. For every bit of input data, there will be several accesses to temporary and constant data. And they can all reside in the cache instead of requiring additional RAM accesses. This is the case for just about any useful algorithm out there.

Also note that GPUs went from doing graphics in multiple passes, applying one texture at a time and reading and writing the frame buffer for each of them, to using programmable shaders where temporary results are stored in massive register files. So these register files do reduce the GPU's bandwidth needs, but they're only really efficient for graphics like workloads where the working set per thread is tiny and fixed. Generic computing workloads more often than not exceed the number of registers the GPU can optimally accommodate, and would benefit from true caches which adapt to any needs.

That's exactly what the CPU has. It is far more adaptable to various workloads, and it's the reason why a quad-core CPU can be faster at OpenCL than a GTX 680, even though it's still lacking 256-bit SIMD, fused-multiply-add, and gather! So I see a much brighter future for high throughput CPU processing than for GPGPU.

If you need to do a heavy computation that will require billions of FP operations, your AVX2 CPU will not give you back the result in microseconds either.

Actually it will. Nobody says you need to wait for the full result. You can split that huge billion operation task into smaller tasks and use the results of each one that has finished processing before the whole thing completes. That's only possible because AVX2 is part of the x86 ISA so you can seamlessly hop between different code, and because TSX enables fast synchronization of small tasks.

This is impossible for GPGPU. There is considerable overhead for trying to split things into smaller tasks. So you're stuck between a rock and a hard place.

And, btw, why do you need results of computations back in microseconds if there's no way to present them to the user until the next frame?

Because of dependencies. The GPU is great for performing few operations on many objects (like pixels), but lousy at performing many operations on few objects.

Any CPU context switch wastes thousands of cycles (OK, hundreds if between 2 logical HT cores), and these happen quite often too.

Sure, which is why you want to avoid it and use thread pools (preferably system global like GCD) to make it much less of an issue. TSX will help greatly in streamlining this as well.

And deep recursion will benefit from AVX2 how exactly?

I didn't say AVX2 would benefit deep recursion. It is orthogonal to the CPU's already excellent support of recursion.

It will in time, take a look at C++ AMP e.g.

AMP isn't any better:
- No support for char or short types, and some bool limitations apply as well.
- No support for pointers to pointers.
- No pointers in compound types.
- No casting between integers and pointers.
- No support for bitfields.
- No variable argument functions.
- No virtual functions, function pointers, or recursion.
- No support for exceptions.
- No goto statements.

You said a lot of this is only on paper now. Well, I don't see Haswells out either, we're still waiting for IB to show up.

Not the same thing. Some GPUs support recursion on paper, but in practice they fail after a few iterations. AVX2 and TSX on the other hand will be extremely useful in practice.

iCyborg · Apr 12, 2012

BenchPress said:
Because that's the topic of this thread. You're welcome to start a more generic thread about GPGPU versus CPU computing if you like...

OK, I took the part I originally quoted out of context.

BenchPress said:
Exactly! Hence its worthwhile to wait for Haswell if it's games you care about. They will no doubt be among the first to take advantage of AVX2 and TSX.

In the light of your agreeing that games are mainly GPU limited, and CPUs not being a bottleneck, why is it worthwhile to wait for Haswell then? Most games barely use more than 2 cores, so widening SIMD for more parallelism doesn't sound like must-have to me.
And you could add 2nd GPU as dedicated GPGPU, sort of like some people have GPUs for PhysX.

BenchPress said:
Yawn. Only a fraction of those could be of interest to consumers. It's the very same reason why Intel demoted Larrabee to research and academics. There's just no big demand for a device that is great for generic throughput computing workloads but mediocre for rasterization graphics.

AVX2 and TSX on the other hand are applicable to all software, so it will have a far greater impact on the future of consumer software.

One of the main reasons is that it's been quite cumbersome to program with OpenCL or CUDA, but that is changing.

I looked at one class at work that is performance critical, and out of 8 loops it has, not one can be auto-vectorized (mostly function calls for i'th object, or search/modify container which must be done one by one). One sort of looks like it could since it is an iteration through 32 items and we do for each index the same operation, but this operation is setting a bit in a 64-bit bitmap, and all 32 operations on the same 64-bit bitmap.

In short, I think you're over-optimistic about the amount of code that will see benefits from AVX2.

Edit: I actually, I lied, there is a memset over 32 bytes that would have a ~2x speedup (memset already uses 128-bit MMX registers), but that part is done only once per executable lifetime.

BenchPress said:
Don't jump to conclusions. For every bit of input data, there will be several accesses to temporary and constant data. And they can all reside in the cache instead of requiring additional RAM accesses. This is the case for just about any useful algorithm out there.

The register files that you mention below are 256KB per compute unit for GCN (this is 8MB for Tahiti XT, more than L3 on 2500K), plus 16KB L1 per CU, texture cache etc. I think this is enough for a couple of important constants and supporting data.
And if you need to process a large dataset, you'll have lots of cache misses anyway.

BenchPress said:
That's exactly what the CPU has. It is far more adaptable to various workloads, and it's the reason why a quad-core CPU can be faster at OpenCL than a GTX 680, even though it's still lacking 256-bit SIMD, fused-multiply-add, and gather! So I see a much brighter future for high throughput CPU processing than for GPGPU.

Sure. We just need to label GCN irrelevant and pretend it doesn't exist. Or APUs.

BenchPress said:
Actually it will. Nobody says you need to wait for the full result. You can split that huge billion operation task into smaller tasks and use the results of each one that has finished processing before the whole thing completes. That's only possible because AVX2 is part of the x86 ISA so you can seamlessly hop between different code, and because TSX enables fast synchronization of small tasks.

This is impossible for GPGPU. There is considerable overhead for trying to split things into smaller tasks. So you're stuck between a rock and a hard place.

There's a disconnect here: I am assuming here that you need to complete all those to produce a final result, not that you can retrieve individual results from ALU registers much faster...
I'm not sure I follow with this considerable overhead: the whole point of GPGPU is to split a large task into a large number of small tasks, much larger than for CPUs? If this was impossible for GPUs, why would anyone ever use them for anything?

BenchPress said:
Because of dependencies. The GPU is great for performing few operations on many objects (like pixels), but lousy at performing many operations on few objects.

If you have heavy dependencies, then your code is not massively parallel by definition. We all know GPUs will not perform well there nor were they designed for that.

BenchPress said:
Sure, which is why you want to avoid it and use thread pools (preferably system global like GCD) to make it much less of an issue. TSX will help greatly in streamlining this as well.

I didn't say AVX2 would benefit deep recursion. It is orthogonal to the CPU's already excellent support of recursion.

1. Well GPUs also have thread schedulers and various other mechanisms of mitigating expensive stalls.
2. You said: "People highly prefer sacrificing some performance over not supporting something at all. So it's safer for developers to do the computing on the CPU and take advantage of AVX2 when available."
And then you gave deep recursion? If you have deep recursion, you'll do it on CPU, not sure where's a sacrifice here.

BenchPress said:
Not the same thing. Some GPUs support recursion on paper, but in practice they fail after a few iterations. AVX2 and TSX on the other hand will be extremely useful in practice.

Again, not all C++ is currently supported, this serves to point that steps are taken in the right direction, and the plan is for full C++ support in the next couple years, I remember seeing that in some slide.

BenchPress · Apr 16, 2012

iCyborg said:
In the light of your agreeing that games are mainly GPU limited, and CPUs not being a bottleneck, why is it worthwhile to wait for Haswell then? Most games barely use more than 2 cores, so widening SIMD for more parallelism doesn't sound like must-have to me.

I've already covered that:
1) A new generation of games will be released when the next consoles hit the street. Those consoles are likely to have AVX2 support or other powerful SIMD instuctions.
2) In any case your argument supports to not go with Ivy Bridge but wait longer!

One of the main reasons is that it's been quite cumbersome to program with OpenCL or CUDA, but that is changing.

It's always going to be more cumbersome than when you don't have to use an API and a specific language. AVX2 can be used by any programming language of your choice, seamlessly.

I looked at one class at work that is performance critical, and out of 8 loops it has, not one can be auto-vectorized (mostly function calls for i'th object, or search/modify container which must be done one by one).

What about outer loops? And why would search/modify have to be done one by one? There are parallel algorithms for that.

In short, I think you're over-optimistic about the amount of code that will see benefits from AVX2.

And I think you're drawing conclusions from a single anecdotal example. And I doubt you're even right about that one example.

In any case there's more code that can benefit from AVX2 than code which benefits from GPGPU.

The register files that you mention below are 256KB per compute unit for GCN (this is 8MB for Tahiti XT, more than L3 on 2500K), plus 16KB L1 per CU, texture cache etc. I think this is enough for a couple of important constants and supporting data.

The total register space doesn't matter. All that's relevant is how many registers you can use per thread before the occupation drops. For Fermi this happens with just 20 registers.

And yes, AVX only has 16 registers, but spilling to cache is extremely fast and virtually unlimited. The GPU doesn't have such a graceful option. It has to lower the thread count and thus sacrifice performance.

And if you need to process a large dataset, you'll have lots of cache misses anyway.

No, the CPU has one other secret weapon: prefetching. Whenever a strided access pattern is detected, the next data will be fetched into cache before it is explicitly requested. This eliminates a lot of expensive misses.

Sure. We just need to label GCN irrelevant and pretend it doesn't exist. Or APUs.

You're not going to solve traffic jams by inventing a new jet plane. Likewise, I'm not ignoring what AMD brings to the table, but the reality is that a lot of people will equip a gaming system with an NVIDIA card. And they have just set back mainstream GPGPU adoption by many years. Hence, since you also agree that games are GPU bound anyway, game developers are more likely to take advantage of AVX2 than of GPGPU. Note that AMD won't wait very long to support AVX2 as well, and like I said before, there are reliable fallback solution on the CPU but not on the GPU.

There's a disconnect here: I am assuming here that you need to complete all those to produce a final result, not that you can retrieve individual results from ALU registers much faster...

You can't make that assumption. Even if a reduce operation has to be performed, this doesn't have to wait for all the intermediate results.

I'm not sure I follow with this considerable overhead: the whole point of GPGPU is to split a large task into a large number of small tasks, much larger than for CPUs? If this was impossible for GPUs, why would anyone ever use them for anything?

The GPU splits the work up internally. So you still can't read back any partial results to start using them on the CPU side.

So the dilemma is that once you start doing GPGPU, you have to move as much work over to the GPU as possible. But obviously this hurts graphics even more and the CPU has to wait even longer for the results. And if you need recursion or pointer chasing you have no choice but to wait on the GPU and perform those operations on the CPU. There is no such issue with AVX2. You can use intermediate results right away (e.g. on another thread), and recursion and pointer chasing are always supported.

If you have heavy dependencies, then your code is not massively parallel by definition.

Wrong. You can have for instance a neural network with very complex dynamic dependencies, and yet a massive amount of parallelism.

We all know GPUs will not perform well there nor were they designed for that.

Exactly my point. The CPU is very good at dependencies, but lacks parallel throughput and efficient thread synchronization. Both issues will be addressed by Haswell.

So unless you have an absolute ancient system and can't possibly wait any longer, it's clearly well worth waiting for Haswell.

Edrick · Apr 16, 2012

tweakboy said:
Sorry, what was the opinion your talking about. thx

tc :|

tweakboy said:
Look, Haswell is going to be 5 to 15 percent faster then Ivy bridge. I promise you, you wont notice a difference

That one. You have zero proof, yet you are making "promises".

Should I wait for the intel haswell processor to build a gaming rig?

Golden Member

Lifer

Elite Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Golden Member

Senior member

Diamond Member

Golden Member

Golden Member

Golden Member

Lifer

Senior member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member