Haswell Integrated GPU = Larrabee3?

CPUarchitect · Feb 1, 2012

bronxzv said:
This one doesn't look 10x slower for example (page 26 of the advance program)

10.3 A 1.45GHz 52-to-162GFLOPS/W Variable-Precision 9:30 AM
Floating-Point Fused Multiply-Add Unit with Certainty
Tracking in 32nm CMOS

That's just the execution unit. Nothing spectacular there for a 32 nm process.

good point, though IIRC you were mentioning the fact that thanks to gather a lot of codes will be vectorizable (not that I buy much the argument)

Yes, but I also mentioned that gather justifies the widening from 128-bit to 256-bit. It doesn't justify doubling it twice, at least not before the workloads have become much more vector oriented. Also note that FMA will have doubled peak floating-point performance as well.

I don't really feel the risk since it's easy to clock gate (or even power gate AFAIK) unused lanes, like it's already the case on Sandy Bridge for the 128 high bits.

The problem is that inactive lanes still affect the routing. Also, your cores would be substantially larger but not offer higher scalar performance, which to some clients is more important.

It will also take less chip area to double the SIMD width than to double the number of cores to get the same peak FLOPS.

That's hardly relevant. Again note that Sandy Bridge-E has room for 8 cores but only enables 6. At 14 nm there will be room for 32 cores, but unless the power consumption is addressed a large number of them can't be active. AVX-1024 offers a solution for this, while widening the SIMD units does not.

Also note that trying to double the peak throughput of a core requires increasing the cache sizes, besides doubling everything execution related. That's especially important since your suggestion does not help cover cache miss latencies. So your cores would be very large and still not achieve the efficiency that executing AVX-1024 in four cycles would offer.

It looks like your idea will be more difficult to design and validate than a simple doubling of the SIMD width but I'm a software guy

Intel has done it before in the Pentium 3 and 4. Designing and validating it is not a big challenge.

there is the risk that the competition (IBM, Fujitsu, Oracle) reach the market with way more powerful solutions before than you because you stop increasing the peak FLOPS

Effective performance is increased by increasing efficiency, active core count, and clock frequency. The competition can't offer anything with higher performance without addressing the power consumption issue even more effectively.

bronxzv · Feb 1, 2012

CPUarchitect said:
Also note that trying to double the peak throughput of a core requires increasing the cache sizes

not really, mostly memory bandwidth must be increased as you know probably, I dont see why it can have a significant impact on cache capacity (keeping the workload constant), just look at the history trend of effective FLOPS vs. cache capacities from Gallatin to Haswell for example, cache capacities have been increased far less than effective FLOPS and memory bandwidth

CPUarchitect said:
That's especially important since your suggestion does not help cover cache miss latencies.

Neither yours, btw I was also mentioning more hardware threads per core as the industry trend goes, this is the true weapon against the always widening bandwidth vs. latency gap

CPUarchitect said:
Intel has done it before in the Pentium 3 and 4. Designing and validating it is not a big challenge.

The P!!! was cracking the SSE instructions in two uops [1] and the P4 was with a 128-bit physical register file [2] (i.e. same width than the SIMD ISA), so your idea is arguably different than both designs since you don't crack the uops and you keep the register file width at 256-bit, for example it will require the allocation of 4 physical registers for executing an AVX-1024 instruction, and to track these 4 registers up to retirement, it looks like it will make the design more complex, I'll suggest you to ask hardware guys to have a better insight on the tradeofs involved with your idea, try to ask a question on a more technical forum such as RWT http://www.realworldtech.com/forums/index.cfm?action=list&roomid=2

also the fact that the logical registers width is at 128B and the cache line size at 64B makes it different than the classical situation where a whole register content fit in a cacheline with aligned loads/stores (btw AVX-512 will be a perfect match), I'm quite sure there is some nasty cases just for this reason, here again I'll suggest to ask an expert to have a better insight on the issue

[1] Pentium(R) III Processor Implementation Tradeoffs ftp://download.intel.com/technology/itj/Q21999/PDF/impliment.pdf
[2] The Microarchitecture of the Pentium(R) 4 Processor http://www.ecs.umass.edu/ece/koren/ece568/papers/Pentium4.pdf

CPUarchitect · Feb 1, 2012

bronxzv said:
not really, mostly memory bandwidth must be increased as you know probably, I dont see why it can have a significant impact on cache capacity (keeping the workload constant), just look at the history trend of effective FLOPS vs. cache capacities from Gallatin to Haswell for example, cache capacities have been increased far less than effective FLOPS and memory bandwidth

You can't use that as a trend. The Pentium 4 did not have an integrated memory controller and therefore the RAM access latencies were much higher. It required a big cache to compensate for that.

Neither yours, btw I was also mentioning more hardware threads per core as the industry trend goes, this is the true weapon against the always widening bandwidth vs. latency gap

Executing AVX-1024 does help cover cache miss latencies. And I've already indicated that SMT has significant downsides.

bronxzv · Feb 2, 2012

CPUarchitect said:
Executing AVX-1024 does help cover cache miss latencies.

I don't see why, remember this concrete example ?

the memory access pattern of 4x unrolled AVX-256 kernels will be much the same than with your proposal for AVX-1024

CPUarchitect · Feb 2, 2012

bronxzv said:
I don't see why, remember this concrete example ?

the memory access pattern of 4x unrolled AVX-256 kernels will be much the same than with your proposal for AVX-1024

Yes, unrolling has the same effect, but except for some anecdotal examples it would cause register spilling, and it consumes four times more power in the front-end per iteration (and high power translates to low performance). The whole purpose of AVX-1024 is precisely to get the advantages of unrolling, without the disadvantages!

bronxzv · Feb 2, 2012

CPUarchitect said:
and it consumes four times more power in the front-end per iteration

I give up, please wake me up when your claims will be "peer"-reviewed by someone in the know

CPUarchitect · Feb 2, 2012

bronxzv said:
I give up, please wake me up when your claims will be "peer"-reviewed by someone in the know

The benefits of executing vector instructions in multiple cycles have already been studied in great detail: Vector Processors. Just have a look at the many implementations in Figure G.2, and note that some have multiple lanes. AVX-1024 would fit right in with 16 elements and 4 lanes (64-bit each).

bronxzv · Feb 2, 2012

CPUarchitect said:
The benefits of executing vector instructions in multiple cycles have already been studied in great detail: Vector Processors.

As you know probably, the 1st vector processors were processing a single element per cycle because the FPUs were big and expensive *not to save power*, it was considered a progress when they started to process several elements per cycle, at the time power wasn't really the #1 concern, I still remember the Cray 2 with its strange cooling liquid, nice memories

Haswell Integrated GPU = Larrabee3?

CPUarchitect

Senior member

bronxzv

Senior member

CPUarchitect

Senior member

bronxzv

Senior member

CPUarchitect

Senior member

bronxzv

Senior member

CPUarchitect

Senior member

bronxzv

Senior member

TRENDING THREADS