CPUarchitect
Senior member
- Jun 7, 2011
- 223
- 0
- 0
That's just the execution unit. Nothing spectacular there for a 32 nm process.This one doesn't look 10x slower for example (page 26 of the advance program)
10.3 A 1.45GHz 52-to-162GFLOPS/W Variable-Precision 9:30 AMTracking in 32nm CMOS
Floating-Point Fused Multiply-Add Unit with Certainty
Yes, but I also mentioned that gather justifies the widening from 128-bit to 256-bit. It doesn't justify doubling it twice, at least not before the workloads have become much more vector oriented. Also note that FMA will have doubled peak floating-point performance as well.good point, though IIRC you were mentioning the fact that thanks to gather a lot of codes will be vectorizable (not that I buy much the argument)
The problem is that inactive lanes still affect the routing. Also, your cores would be substantially larger but not offer higher scalar performance, which to some clients is more important.I don't really feel the risk since it's easy to clock gate (or even power gate AFAIK) unused lanes, like it's already the case on Sandy Bridge for the 128 high bits.
That's hardly relevant. Again note that Sandy Bridge-E has room for 8 cores but only enables 6. At 14 nm there will be room for 32 cores, but unless the power consumption is addressed a large number of them can't be active. AVX-1024 offers a solution for this, while widening the SIMD units does not.It will also take less chip area to double the SIMD width than to double the number of cores to get the same peak FLOPS.
Also note that trying to double the peak throughput of a core requires increasing the cache sizes, besides doubling everything execution related. That's especially important since your suggestion does not help cover cache miss latencies. So your cores would be very large and still not achieve the efficiency that executing AVX-1024 in four cycles would offer.
Intel has done it before in the Pentium 3 and 4. Designing and validating it is not a big challenge.It looks like your idea will be more difficult to design and validate than a simple doubling of the SIMD width but I'm a software guy
Effective performance is increased by increasing efficiency, active core count, and clock frequency. The competition can't offer anything with higher performance without addressing the power consumption issue even more effectively.there is the risk that the competition (IBM, Fujitsu, Oracle) reach the market with way more powerful solutions before than you because you stop increasing the peak FLOPS
