Originally posted by: evolucion8
But I'm not talking about the P4, I'm talking about the C2D/Athlon 64 generation, is the Phenom 2 much faster in a per clock basis against the Athlon 64? Is the Core i7 much faster than the C2Q in a per clock basis? NOO. They reached a point that is much more expensive and very hard to increase the parallelism inside of a CPU (IPC), so the best way is going multi core and that's what Intel is currently doing with it's Core i7 architecture.
In the case of the i7, the IPC improvements were there (e.g. better TLB, larger OoO window, integrated memory controller, etc), but it was neutered with the 256 KB L2 cache per core compared to 3
MB per core on the Penryn.
Of course when 32 nm comes along, Intel will no doubt increase the L2 cache and it?ll not only be faster clock-for-clock, but it?ll also run at higher clocks. As an example, look at the i5?s much higher turbo boost with the same manufacturing process as the i7.
I haven?t been following the Phenom closely but IIRC there were some architectural issues that held it back, along with very low initial clock speeds.
In any case, my original point still stands: even
if the IPC stays the same, a die-shrink almost certainly guarantees higher clock speeds.
Why Intel didn't make the Core i7 a Quad Core/4 thread CPU?
Because HT is very cheap in terms of die cost and design complexity. Most of the hardware is already there, you just need a few extra pointers to keep track of the concurrency.
But that's far too fetched. The same could be said with the CPU's.
Yes exactly, that?s my point. A single CPU/GPU is the building block of multi-GPU/CPU; if the former hits a wall then so does the latter.
But the GPU graphic work is highly parallel, while there is no silver bullet yet, the posibility is there. CPU's aren't and yet the benefits are there, but definitively the developers are far behind from exploiting the new technology.
Again, I?m not sure if you understand how multi-GPU works, especially AFR. Yes, graphics rendering is inherently more parallel than general purpose code, and that?s exactly why adding extra execution resources to a single GPU nets performance with very little effort.
However multi-GPU breaks that parallelism since AFR is serial in nature. That means you run into all sorts of interdependencies not present on a single GPU, and these need to be managed by hand on a per-application basis. Even SFR isn?t optimal since there?s still duplicate data storage and processing happening.
This is much like how you can?t simply take any general purpose code, throw it into some kind of ?threading machine?, and expect meaningful performance gains. Most of it is done by hand on a per-application basis, and it?s extremely complex compared to traditional code.