It's not worse. It's a widely accepted fact that GK104 was not meant to be GTX 580's successor. GTX680 is more like GTX 660/670 Superclocked. If Big Kepler is far slower in compute than GTX 580 or 7970, then I will accept that the new arch is far worse.
Still irrelevant. A large number of people will buy cards using the GK104 architecture, so its absymal GPGPU performance is something that will inevitably steer game developers away from using such technology. And even if Big Kepler is much better at compute, gamers are more likely to buy dual GK104 instead since it will offer superior gaphics performance. And it's doubtful that Big Kepler has a radically different architecture anyway.
Why do you keep focusing only on gaming?
Because that's the topic of this thread. You're welcome to start a more generic thread about GPGPU versus CPU computing if you like...
I already said that most resource intensive games are mainly graphics limited, trying to shift even more work to GPU makes no sense.
Exactly! Hence its worthwhile to wait for Haswell if it's games you care about. They will no doubt be among the first to take advantage of AVX2 and TSX.
Stuff that GPGPU is used for is:
http://en.wikipedia.org/wiki/GPGPU#Applications
A pretty long list, with no gaming on it (arguably some items from the list could be used in games)...
Yawn. Only a fraction of those could be of interest to consumers. It's the very same reason why Intel demoted Larrabee to research and academics. There's just no big demand for a device that is great for generic throughput computing workloads but mediocre for rasterization graphics.
AVX2 and TSX on the other hand are applicable to
all software, so it will have a far greater impact on the future of consumer software.
My point is that if you have to process 1GB of data in a short time, your fast low-latency 6MB cache will not be a deal breaker.
Don't jump to conclusions. For every bit of input data, there will be several accesses to temporary and constant data. And they can all reside in the cache instead of requiring additional RAM accesses. This is the case for just about any useful algorithm out there.
Also note that GPUs went from doing graphics in multiple passes, applying one texture at a time and reading and writing the frame buffer for each of them, to using programmable shaders where temporary results are stored in massive register files. So these register files do reduce the GPU's bandwidth needs, but they're only really efficient for graphics like workloads where the working set per thread is tiny and fixed. Generic computing workloads more often than not exceed the number of registers the GPU can optimally accommodate, and would benefit from true caches which adapt to any needs.
That's exactly what the CPU has. It is far more adaptable to various workloads, and it's the reason why a quad-core CPU can be faster at OpenCL than a GTX 680, even though it's still lacking 256-bit SIMD, fused-multiply-add, and gather! So I see a much brighter future for high throughput CPU processing than for GPGPU.
If you need to do a heavy computation that will require billions of FP operations, your AVX2 CPU will not give you back the result in microseconds either.
Actually it will. Nobody says you need to wait for the full result. You can split that huge billion operation task into smaller tasks and use the results of each one that has finished processing before the whole thing completes. That's only possible because AVX2 is part of the x86 ISA so you can seamlessly hop between different code, and because TSX enables fast synchronization of small tasks.
This is impossible for GPGPU. There is considerable overhead for trying to split things into smaller tasks. So you're stuck between a rock and a hard place.
And, btw, why do you need results of computations back in microseconds if there's no way to present them to the user until the next frame?
Because of
dependencies. The GPU is great for performing few operations on many objects (like pixels), but lousy at performing many operations on few objects.
Any CPU context switch wastes thousands of cycles (OK, hundreds if between 2 logical HT cores), and these happen quite often too.
Sure, which is why you want to avoid it and use thread pools (preferably system global like GCD) to make it much less of an issue. TSX will help greatly in streamlining this as well.
And deep recursion will benefit from AVX2 how exactly?
I didn't say AVX2 would benefit deep recursion. It is orthogonal to the CPU's already excellent support of recursion.
It will in time, take a look at C++ AMP e.g.
AMP isn't any better:
- No support for char or short types, and some bool limitations apply as well.
- No support for pointers to pointers.
- No pointers in compound types.
- No casting between integers and pointers.
- No support for bitfields.
- No variable argument functions.
- No virtual functions, function pointers, or recursion.
- No support for exceptions.
- No goto statements.
You said a lot of this is only on paper now. Well, I don't see Haswells out either, we're still waiting for IB to show up.
Not the same thing. Some GPUs support recursion on paper, but in practice they fail after a few iterations. AVX2 and TSX on the other hand will be extremely useful in practice.