Haswell Integrated GPU = Larrabee3?

Nemesis 1 · Jan 25, 2012

ExarKun333 said:
Awesome post! I think I am starting to 'get' your posts nemesis.

I think your in deep do do If my post are starting to hit home with ya.

Nemesis 1 · Jan 25, 2012

CPUarchitect said:
This isn't my idea at all. AVX was announced by Intel to be extendable to 512 and 1024 bit. And it was also Intel who first implemented 128-bit SIMD instructions on 64-bit execution units back in 1999. Despite using the same execution width, SSE proved to be superior to AMD's 3DNow! The main reason for that was the increase in available register space which offered latency hiding while avoiding spilling. These are the exact same qualities AVX-1024 would offer. It's not a new idea by any stretch.

I was wondering if you would pick up that.I was going to say it but your sharp so I stay out of way

Nemesis 1 · Jan 25, 2012

CPUarchitect said:
Intel differentiates between AVX-128 and AVX-256 instrutions. So obviously their 512-bit and 1024-bit instructions would be named AVX-512 and AVX-1024 respectively.

Other people appear to show an interest in AVX-512 and AVX-1024 as well, and some come to the same conclusion that splitting the execution over multiple cycles offers the best cost/gain ratio. I am not "Nick".

If you are Nicks you changed your posting style that I can see. But there are those people out there that changes posting methods . I not even sure how many utube accounts I have. But I do know how many videos I have created . LOTS

CPUarchitect · Jan 25, 2012

bronxzv said:
indeed, that's exactly why I was reacting to this comment of yours:
"
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling,
"
it never happens in practice since we unroll only small loops

But that means you're totally ignoring the latency problem with big loops!

indeed and it changes absolutely nothing to the fact that you don't need to do loop unrolling or software pipelining to hide latencies on OoO cores

I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.

It's a dilemma that can only be solved by adding more registers. It is not feasible to directly add more logical registers to x86, but the same thing can be achieved by widening the logical registers and processing the instructions over multiple cycles.

And AVX-1024 wouldn't just offer latency hiding, it would also lower the power consumption. You can't achieve any of these advantages with AVX-256.

I was meaning your idea to process 1024-bit data in 4 chunks with a single uop for the 4 parts, it looks like a new idea in the x86 world

No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.

bronxzv · Jan 25, 2012

CPUarchitect said:
But that means you're totally ignoring the latency problem with big loops!

we don't unroll them because it's actually faster on real targets, if there was a problem it will show in the timings, anyway why will there be a latency problem with typical 2-5 clocks latency for usual instructions ? as already asked I'll be glad to see an actual code sample that shows well the issue, maybe your use cases are so different than mine that I overlooked something?

CPUarchitect said:
They also provided a cycle-accurate emulator for Sandy Bridge

Sure, I use the IACA since v1.0, btw I was discussing IACA today with his developper and he says it's not modeling well Sandy Bridge
http://software.intel.com/en-us/forums/showthread.php?t=102334&o=a&s=lr

CPUarchitect said:
but it only proves useful for hand-written assembly.

not only, I use it extensively with C++ wrapper classes around intrinsics and it proved very useful

CPUarchitect said:
It's a dilemma that can only be solved by adding more registers. It is not

I'll be happy with some more registers, but first of all you can reuse logical registers before the physical register is free, another physical register will be allocated for you from the pool, see for example what the Intel compiler do with a 5x unrolled loop of your example, the true limiter is the number of physical registers divided by the number of running threads, note that it allows *today already* to unroll 4x (as AVX-1024 over 256-bit hw will achieve) any loop with independent iterations (as is the typical OpenCL case) with low extra register fills/spills

Code:

.B13.5:: ; Preds .B13.5 .B13.4
vmovups ymm1, YMMWORD PTR [rcx+rax*4] ;418.24
inc r10d ;416.3
vmovups ymm0, YMMWORD PTR [rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [r8+rax*4] ;419.20
vmovups YMMWORD PTR [r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [32+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [32+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [32+r8+rax*4] ;419.20
vmovups YMMWORD PTR [32+r9+rax*4], ymm3 ;419.5
vmovups ymm5, YMMWORD PTR [64+rcx+rax*4] ;418.24
vmovups ymm4, YMMWORD PTR [64+rdx+rax*4] ;418.32
vfmadd213ps ymm5, ymm4, YMMWORD PTR [64+r8+rax*4] ;419.20
vmovups YMMWORD PTR [64+r9+rax*4], ymm5 ;419.5
vmovups ymm1, YMMWORD PTR [96+rcx+rax*4] ;418.24
vmovups ymm0, YMMWORD PTR [96+rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [96+r8+rax*4] ;419.20
vmovups YMMWORD PTR [96+r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [128+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [128+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [128+r8+rax*4] ;419.20
vmovups YMMWORD PTR [128+r9+rax*4], ymm3 ;419.5
add rax, 40 ;416.3
cmp r10d, 25 ;416.3
jb .B13.5 ; Prob 95% ;416.3

then you can use indexed addressing for frequently accessed read-only data, this looks like if the processor access them from buffers much like the store forwarding buffers (like some kind of virtual registers), I was really stuned to see it's as fast as using a register, a recent optimization in the Intel compiler do just that even in low register pressure conditions, I discuss it here : http://software.intel.com/en-us/forums/showthread.php?t=101886&o=a&s=lr

CPUarchitect said:
it would also lower the power consumption. You can't achieve any of these advantages with AVX-256.

let's say you are right, how much % less power consumption do you envision?

CPUarchitect said:
No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.

if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...

Nemesis 1 · Jan 25, 2012

Guys this has been a great debate . Thank you. I asked my daughter whos major was computer science . As Ya both sound like ya in the know . Daughter said this some deep stuff but as of so far she gives node to CPUarch with reservation

piesquared · Jan 25, 2012

Quick question, is this larbee an actual GPU or is it just some sort of VLC part like what seems to be in ivy bridge? Hope they fix the 23.976 bug so they could at least show a smooth fake.

bronxzv · Jan 26, 2012

Nemesis 1 said:
Daughter said this some deep stuff but as of so far she gives node to CPUarch

greetings to your daughter, so she thinks AVX-1024 over 256-bit hardware is a more likely future after AVX-256 than AVX-512 over 512-bit hardware? let her think a bit more about it and let us know

denev2004 · Jan 26, 2012

piesquared said:
Quick question, is this larbee an actual GPU or is it just some sort of VLC part like what seems to be in ivy bridge? Hope they fix the 23.976 bug so they could at least show a smooth fake.

This bug of course can't exist on LRB. Not by saying it a different arch, but also they've fixed it in the new stepping.

denev2004 · Jan 26, 2012

Nemesis 1 said:
Guys this has been a great debate . Thank you. I asked my daughter whos major was computer science . As Ya both sound like ya in the know . Daughter said this some deep stuff but as of so far she gives node to CPUarch with reservation

What's her idea?
BTW, I'd like to say at least 10% of us had studied CS courses. Although I'm kinda not one of the 10%.

CPUarchitect said:
I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.
.

Isn't it a problem with OCL itself?
I think OpenCL is mostly designed to set a platform to engage different kinds of architechtures having the ability to run a same general-purpose programs. It does not pay much attention on efficiency, as it might opposed to that way. At least its programming structure is at such a high level, which sounds to be ineffienct.

bronxzv · Jan 26, 2012

denev2004 said:
At least its programming structure is at such a high level, which sounds to be ineffienct.

one of the problem with such high level portable code is that it can't be well suited for all specific targets, including future unknown targets, for top performance you can't escape considering the physical implemention (cache line size, caches capacities, width of the SIMD units, ...)

to go back to the issue discussed with CPUarchitect, with wide vectors like AVX-1024 will have (32 x 32-bit elements) there is a serious issue of padding for small arrays, for example if you have arrays of 130 elements in some existing application you will need to process 160 elements with AVX-1024 instead of only 136 elements with AVX-256, on average 16 elements are wasted with AVX-1024 and only 4 with AVX-256, that's one of the reason it's very difficult to sell (*) to developers wider vectors without a true wider implementation, even if with smart tricks the power consumption is 5% lower, if you want lower power consumption just clock down a full width implementation and enjoy 50% less power or better

*: all the way back to Pentium MMX a major issue with SIMD ISAs is to convince developers to use them (just look at how many AVX enabled applications are available 1 year after the Sandy Bridge launch), since you have to continue to support legacy targets missing the new ISAs, the more code paths, the higher development and validation costs

CPUarchitect · Jan 26, 2012

bronxzv said:
1) because MIC products are 512-bit wide and they will certainly try to make standard components for both, (implementation and validation less costly, not more), with a grand plan to unify the ISAs at some point, it makes no sense in the long run to have two x86 based vector ISAs

Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width. The MIC products could execute future AVX-1024 instructions in 2 cycles and AVX-512 instructions in 1 cycle. Or, they could be equipped with two 256-bit execution units which execute AVX-1024 instructions in 4 cycles.

So the fact that LRBni consists of 512-bit instructions which Knight's Corner executes in a single cycle is no indication at all of where things could be heading. I'm actually quite convinced that the MIC product line will be short lived. There are already 16-core Ivy Bridge parts on the roadmap, so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.

2) because it was the natural evolution so far, 64-bit MMX to 128-bit SSE to 256-bit AVX, next logical step is 512-bit

Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width. The other one makes sense if we consider that future software should contain a larger number of parallelized loops thanks to gather support.

But beyond that it would lose balance. It's obvious that the SIMD execution width can't keep doubling again and again. It has to stop at a point where a wide range of workloads can be executed efficiently. 512-bit SIMD execution units would compromise scalar performance. It made sense for a MIC since it can still rely on a CPU to run the scalar workloads. For a homogeneous CPU however I believe 256-bit is the maximum sensible width.

3) because most ISVs will not bother to port their code with no performance gain but only a slight perf/power advantage

AVX is squarely aimed at auto-vectorization. So it should only take a simple recompile to take advantage of AVX-1024. With JIT-compilation like with OpenCL the ISV doesn't even have to do a recompile itself.

And any free performance improvement will certainly be welcomed. Also, with many things going ultra-mobile these days it's definitely a concern of ISVs to ensure they get the best performance/Watt.

4) with wider vectors there is on average more padding and thus unused computation slots, if the peak AVX-256 and AVX-1024 throughputs are the same the actual performance will be lower with AVX-1024 (more unused slots), all the pain to support yet another code path and worse performance, i.e. a loss-loss situation

It's certainly true that AVX-1024 won't help in all cases, but note that there would still be AVX-512, AVX-256, and AX-128. And again the decision of the optimal approach would in most cases be handled by the (JIT-)compiler. So no need to worry about additional code paths.

bronxzv · Jan 27, 2012

CPUarchitect said:
Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width.

You got it reversed, the ISA dictates the minimal execution width, for example you can process the whole AVX instruction set in two 128-bit part because there isn't cross 128-bit lanes instruction, though you can't process it in 4 x 64-bit parts due to instructions such as VPERMLIPS/PD, VEXTRACTF128 and VINSERTF128 which require 128-bit execution width

AVX2 introduces new instructions such as VPERMPS/PD/D/Q, VBROADCASTI128, VPERM2I128 that you can't split in two 128-bit parts, so the minimal execution width is 256-bit for AVX2

we can't really talk of what we call "AVX-512" and "AVX-1024" since these aren't disclosed yet, at this stage it's well possible that we will see AVX-512 only in a far future and never AVX-1024 but a true vector ISA instead with variable length arrays (much like the string instructions) it will be a far better fit with OpenCL for avoiding the padding issues

CPUarchitect said:
so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.

I agree with you here, that's why I consider AVX2 a very important target for my code (3D rendering)

CPUarchitect said:
Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width.

Sorry but I don't see how it's related

CPUarchitect said:
So no need to worry about additional code paths.

If you want top performance you need several code paths, one reason is that any intermediate language will always miss some features when a new ISA is disclosed, and so is not a good match for all instructions that you can use by targeting the native ISA. That's exactly why Intel do a dispatch to a lot of different code paths in MKL & IPP for example, the comparison with OCL implementations will be not pretty, I'm sure.
Even if the compiler can help you a lot it's still a lot of development and validation effort for each path, especially if you factor in the combination of all your ISA paths with 32-bit and 64-bit modes. It's not practical do release something without heavy regression testing and it rapidly becomes a significant cost.

IntelUser2000 · Jan 27, 2012

CPUarchitect said:
I'm actually quite convinced that the MIC product line will be short lived. There are already 16-core Ivy Bridge parts on the roadmap, so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.

This is probably true, except Ivy Bridge EX is coming in early 2013, about the same time as Knights Corner. Anything 12 cores or more will be put as EX parts, and considering their 1.5 year development schedule, that means you won't see Haswell EX until mid-2014. It would be competing with successor to Knights Corner.

Oh, and Ivy Bridge is said to have up to 12 cores, not 16.

Let's compare theoreticals, shall we?

Knights Corner: 64 cores @ ~1.2GHz
Ivy Bridge EX: 12 cores @ 2.8GHz

GFlops-
KC: 64 cores x 1.2GHz x FMA x 8 DP Flops/cycle = 1.2 TFlops
IVB-EX: 12 cores x 2.8GHz x 2 FP units/core x 4 DP Flops/cycle = 268.8 GFlops

That's not counting additional bandwidth that KC has with GDDR5+ memory. And that I am being reasonable with EX's clocks while being slightly conservative with KC's. Pretty sure latter's market opportunities are limited, but it looks good for what it'll be targeted as - co-processor for HPC workstations. No one except Intel knows at this point what they are going with this.

Haswell is said to have 20 EUs. That's not Larrabee-based, its Gen X based.

CPUarchitect · Jan 27, 2012

bronxzv said:
we don't unroll them because it's actually faster on real targets, if there was a problem it will show in the timings, anyway why will there be a latency problem with typical 2-5 clocks latency for usual instructions ? as already asked I'll be glad to see an actual code sample that shows well the issue, maybe your use cases are so different than mine that I overlooked something?

Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining). That's what I've been saying all along. It's a dilemma that can only be solved with extra register space, which can be provided in the form of AVX-1024.

There's an instruction latency problem with long loops because they typically contain long chains of dependent instructions. Such a chain can easily exceed the length of the scheduling window. For instance Sandy Bridge has a 54 uop scheduler, which means that when you're aiming for an IPC of 3 you basically get 18 uops per arithmetic port. And when the average instruction latency is 3 cycles that means it can only fit dependency chains of 6 instructions. Any longer dependency chains, or any higher instruction latencies, lead to some loss of performance at the scheduler. Note that high latencies are one of the reasons AMD's Bulldozer architecture fails to achieve high performance. So you definitely should care about them.

That said, we're only talking about a potential ~20% performance gain from hiding more latency. There has been lots of research on the topic of increasing the scheduling window and they all agree ~256 entries would offer near-optimal scheduling for a 3-issue architecture but it's just not feasible to have such a large scheduler. With AVX-1024 a scheduler with 54 entries would behave like one with 216 entries! So just because you're not observing obvious latency issues doesn't mean there's not a small but worthwhile amount of performance that can be gained from unrolling with software pipelining if you had more register space.

I'll be happy with some more registers, but first of all you can reuse logical registers before the physical register is free, another physical register will be allocated for you from the pool, see for example what the Intel compiler do with a 5x unrolled loop of your example, the true limiter is the number of physical registers divided by the number of running threads, note that it allows *today already* to unroll 4x (as AVX-1024 over 256-bit hw will achieve) any loop with independent iterations (as is the typical OpenCL case) with low extra register fills/spills

Yes that's a remarkable feature, but you'd get even higher performance with AVX-1024 by eliminating the spill and restore instructions. Nothing etraordinary, but together with the power savings it's an ample justification of the nearly negligible cost.

I fully understand that you're trying to explain that things are already pretty good as-is, but compared to throughput-oriented architectures it is really necessary to further improve performance/Watt and for x86 I believe AVX-1024 would be a crucial step.

let's say you are right, how much % less power consumption do you envision?

The OpenSPARC T1 is known to spend 40% of power consumption in the front-end, and I expect x86 to be similar. For a realistic throughput-oriented workload, wide vector instructions which take 4 cycles would allow to clock-gate the front-end roughly two times more often. Hence a reduction in power consumption of 20% can be achieved.

Combined with a 20% increase in performance that's a 50% improvement in performance/Watt. With a higher clock frequency and/or more cores you can also get 50% higher performance at equal power consumption! And this is just a conservative estimate. There is reduced switching activity for other stages as well, and the latency hiding can be used to reduce aggressive prefetching.

if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...

I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.

CPUarchitect · Jan 27, 2012

denev2004 said:
Isn't it a problem with OCL itself?

No, the issue of not having sufficient register space to optimally hide latencies applies to the vast majority of parallel workloads. You can't fix that with another software API. You need hardware support for it, and unless someone comes up with a better idea it looks like AVX-1024 would be most straightforward and most effective.

bronxzv · Jan 27, 2012

CPUarchitect said:
Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining).

not really since spills/fills are mostly to/from the store-load fowarding buffer

CPUarchitect said:
There's an instruction latency problem with long loops because they typically contain long chains of dependent instructions.

as already requested, do you have a real and complete example to share showing concretely the problem ? for all the high throughput AVX cases I'm dealing with the main limitations come from the L1D / L2 caches bandwidth, I'm stuned you can measure issues due to the instruction latencies, I'll say these are 2 or 3 order of magnitude less important in the real world than the cache bandwidth limitation, instead of all this theory why not sharing an illustrative example ?

CPUarchitect said:
Hence a reduction in power consumption of 20% can be achieved.

not bad, if indeed the case

CPUarchitect said:
Combined with a 20% increase in performance

how can you achieve 20% increase in performance with the same peak throughput ? don't forget SMT already help a lot for hiding memory hierarchy latencies and 4-way SMT is a possibility for Haswell, AVX2 code will be already very near it's maximum potential throughput, the only limitations will come from cache and memory bandwidth, something that your idea will not improve (I don't really buy your "less agressive prefetching" argument since there is already very effective safegards in Sandy Bridge against excessive prefetch)

since some codes reach 90% peak throughput
http://software.intel.com/en-us/articles/intel-avx-optimization-in-intel-mkl-v103/
do you mean you'll achieve 110% of the theoretical peak ?

btw do you have a concrete AVX-256 example suffering from 20% slowdown from the latency or spills/fills problems you are refering to ? excuse-me but it absolutey doesn't match with what we see when tuning actual code

CPUarchitect said:
I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.

I was kidding, anyway I'll be glad to have a pointer to some documentation explaining this single uop for SSE on P4, it isn't explained at all in the documentation I have at hand

CPUarchitect · Jan 27, 2012

bronxzv said:
to go back to the issue discussed with CPUarchitect, with wide vectors like AVX-1024 will have (32 x 32-bit elements) there is a serious issue of padding for small arrays

Yes, there is no denying that, but let's get something straight. GPUs have logical vector widths of 1024-bit and 2048-bit. So anything you could run efficiently on a GPU would also run well on a CPU with AVX-1024.

But the CPU has two more advantages. You can still use AVX-512 and AVX-256 without losing peak throughput (just a ~20% loss in scheduling efficiency). And secondly the CPU requires far fewer threads per core. If you give a GPU just 1024-bit worth of work it takes forever to get the result back due to moving the data to and from another core, and it can't hide any latencies without a massive amount of additional work. A homogeneous CPU with wide vectors doesn't suffer from that.

So just because I'm advocating a simple extension like AVX-1024 doesn't mean you should use it in situations where it offers no benefit. Use it for workloads which are more parallel than 256-bit but don't have to be as massively parallel as GPGPU workloads.

CPUarchitect · Jan 27, 2012

bronxzv said:
You got it reversed, the ISA dictates the minimal execution width, for example you can process the whole AVX instruction set in two 128-bit part because there isn't cross 128-bit lanes instruction, though you can't process it in 4 x 64-bit parts due to instructions such as VPERMLIPS/PD, VEXTRACTF128 and VINSERTF128 which require 128-bit execution width

Point taken, but we were talking about converging LRBni and AVX, which both logically split the vectors into 128-bit lanes. So while the minimum width is indeed 128-bit, the discussion was really about 256-bit versus 512-bit. In theory you could get all of the functionality of LRBni with AVX-512 instructions even if the execution units are 256-bit. Conclusion: any motivation for converging them isn't an argument why executing AVX-512 in a single cycle is more likely than executing AVX-1024 in four cycles.

AVX2 introduces new instructions such as VPERMPS/PD/D/Q, VBROADCASTI128, VPERM2I128 that you can't split in two 128-bit parts, so the minimal execution width is 256-bit for AVX2

I don't think those strictly demand 256-bit. Note that SSE1/2 had 128-bit movlhps/movhlps/pshufd while the execution units were still 64-bit.

we can't really talk of what we call "AVX-512" and "AVX-1024" since these aren't disclosed yet, at this stage it's well possible that we will see AVX-512 only in a far future and never AVX-1024 but a true vector ISA instead with variable length arrays (much like the string instructions) it will be a far better fit with OpenCL for avoiding the padding issues

You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes. Any vector string extension would require entirely new encodings and I'm not sure if there's any space left for that. Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.

I agree with you here, that's why I consider AVX2 a very important target for my code (3D rendering)

That's very interesting because 3D rendering consists of shaders which are fairly long and thus there are long dependency chains and no opportunities for
unrolling and software pipelining unless you have massive register files like a GPU. So I can't see why you're trying so hard to find counter-arguments for AVX-1024 while your software is probably among those that would benefit from it the most.

Sorry but I don't see how it's related

Between the Core and Core 2 architecture, Intel doubled up everything. Scalar execution became twice as wide, cache bandwidth doubled, and SIMD execution doubled. So the same balance was maintained. They didn't leave anything bottlenecked or underutilized.

If instead you want to just double the SIMD width you'd be making compromises. You can leave the bandwidth as-is but then it becomes starved, or you could double it too but then that's overkill for scalar workloads (increasing power consumption and chip area). The only way to justify changing the balance is if the workloads are changing too. And that's what expected to happen with AVX2 since its gather support will allows a lot of loops to be vectorized.

But I don't see anything that would justify disturbing the balance even more with 512-bit SIMD units. At least not in the foreseeable future. It's of course still possible that in the distant future it will make some sense due to another shift in workloads, but before that becomes even remotely a viable option we'll have had many years during which executing AVX-1024 on 256-bit units was far easier to justify.

CPUarchitect · Jan 27, 2012

IntelUser2000 said:
This is probably true, except Ivy Bridge EX is coming in early 2013, about the same time as Knights Corner. Anything 12 cores or more will be put as EX parts, and considering their 1.5 year development schedule, that means you won't see Haswell EX until mid-2014. It would be competing with successor to Knights Corner.

Oh, and Ivy Bridge is said to have up to 12 cores, not 16.

According to the first post there will be 16 core parts. How many cores will be active remains to be seen of course. Likewise Knights Corner is said to have "more than 50 cores" which likely means 64 and several disabled depending on yield.

Let's compare theoreticals, shall we?

Knights Corner: 64 cores @ ~1.2GHz
Ivy Bridge EX: 12 cores @ 2.8GHz

GFlops-
KC: 64 cores x 1.2GHz x FMA x 8 DP Flops/cycle = 1.2 TFlops
IVB-EX: 12 cores x 2.8GHz x 2 FP units/core x 4 DP Flops/cycle = 268.8 GFlops

Yes but notice that Haswell will double the computing density with FMA while Knights can't pull that trick any more.

Also please note that I didn't claim Knights doesn't stand a chance. I said it will be short lived. Of course they won't put something on the market and remove it one year later so short lived still means multiple years. But clearly Knights will have to face Haswell and its successors. There's an unmistakable convergence taking place, and a generic CPU is a better value than a MIC even if the peak throughput is lower because it can run more diverse workloads efficiently. Note that a MIC requires a massive number of threads while a CPU with out-of-order execution doesn't suffer that much from Amdahl's Law. So outside of some niche markets I don't expect the MIC to survive for long.

That's not counting additional bandwidth that KC has with GDDR5+ memory.

Granted, but DDR4 is on its way as well and CPUs require less RAM bandwidth due to massive caches.

Haswell is said to have 20 EUs. That's not Larrabee-based, its Gen X based.

Do you think all Haswell parts will have an IGP? And do you happen to know how many GFLOPS 20 Gen X EU's amount to?

bronxzv · Jan 27, 2012

CPUarchitect said:
I don't think those strictly demand 256-bit.

if generic permute of a full register doesn't require full width, well, I don't see what requires full width for decent performance

CPUarchitect said:
Note that SSE1/2 had 128-bit movlhps/movhlps

IIRC these were slow and to be avoided before Conroe, but I may be wrong

CPUarchitect said:
You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes.

so, if I understand your well, you think AVX-512 and AVX-1024 will be introduced at the same time, right ?

CPUarchitect said:
Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.

less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!

CPUarchitect said:
That's very interesting because 3D rendering consists of shaders

I'm sure you know very well the subject

CPUarchitect said:
If instead you want to just double the SIMD width you'd be making compromises.

it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD width

CPUarchitect said:
And that's what expected to happen with AVX2 since its gather support will allows a lot of loops to be vectorized.

the same loops can already be vectorized using software synthetized gather (there is a common misconception that it's not possible), hardware gather will speed them up but it's already way faster than keeping them scalar, now I think we already discussed gather well enough in the past at RWT and on the Intel AVX forum

CPUarchitect · Jan 27, 2012

bronxzv said:
not really since spills/fills are mostly to/from the store-load fowarding buffer

That sounds like a far too optimistic assumption. These buffers are finite in size and don't hold the data for long. In particular when unrolling and software pipelining loops, the distance between spills and fills increases and the chances of still finding the values in the store buffer are slim. What doesn't help either is that spilling is specifically done on variables which aren't needed for a while (else they'd be kept in a register instead). And last but not least, there's only one store port and two load ports so between all the spills/fills and regular load/store you quickly get contention...

as already requested, do you have a real and complete example to share showing concretely the problem ? for all the high throughput AVX cases I'm dealing with the main limitations come from the L1D / L2 caches bandwidth, I'm stuned you can measure issues due to the instruction latencies, I'll say these are 2 or 3 order of magnitude less important in the real world than the cache bandwidth limitation

Exactly! All your spills and fills cause port congestion.

I'm expecting Haswell to double the width of each load and store port, but while that will certainly help it won't completely eliminate the contention problem. You should be able to observe this by using AVX-128 code.

instead of all this theory why not sharing an illustrative example ?

The traces I've worked with are just too long to post here, and more importantly contain potential IP from clients. But since you mentioned that you work with 3D graphics you should have access to plenty of shader kernels which are not suitable for unrolling.

how can you achieve 20% increase in performance with the same peak throughput ?

That's what you would typically gain from eliminating spill/fill instructions and improving scheduling opportunities.

don't forget SMT already help a lot for hiding memory hierarchy latencies and 4-way SMT is a possibility for Haswell, AVX2 code will be already very near it's maximum potential throughput, the only limitations will come from cache and memory bandwidth, something that your idea will not improve

Once again I'd like you to check whether it's the bandwidth or load/store port contention which is hampering your code, by using AVX-128 instead of AVX-256.

SMT doesn't help at all with load/store port contention. And implementing 4-way SMT is far more expensive than AVX-1024, worsens data locality, and consumes more power. AVX-1024 offers latency hiding like SMT but without the bad properties.

bronxzv · Jan 27, 2012

CPUarchitect said:
access to plenty of shader kernels which are not suitable for unrolling.

I'll say where unrolling is unecessary to be more precise, there are a lot of cases with no single spill in the unrolled version and the unrolled version not faster, that's why I can't understand your POV on the matter, and since you use the IP / trade secrets excuse, when it's so easy to devise a convincing example when we have something concrete to show, I'm afraid I'll not learn much from you

CPUarchitect said:
SMT doesn't help at all with load/store port contention.

so it should not be the problem since we typically enjoy 25% + spedup from enabling hyperthreading (slightly better speedups with 64-bit than 32-bit code and with AVX code than with SSE code for some reason, I'm sure you'll say due to less load/store contention in 64-bit mode, but the difference should be 1% or 2%), the workload that scale the best in our series of 3D models for the regression tests is at 31% speedup from HT, note that power consumption isn't significantly up with HT enabled according to the ASUS AI suite

CPUarchitect · Jan 27, 2012

bronxzv said:
so, if I understand your well, you think AVX-512 and AVX-1024 will be introduced at the same time, right ?

Not necessarily. When AVX-1024 becomes available there will definitely be AVX-512 too for obvious reasons. The reverse isn't strictly going to be true and I can think of two reasons: renaming registers, and marketing. To keep the out-of-order engine running smoothly you need sufficient registers for renaming. Sandy Bridge is supposed to have 144 physical SIMD registers, of which 2 x 16 are required for two thread contexts, leaving 112 for renaming. AVX-1024 would increase the contexts with 3 x 16 256-bit registers, so a total of 240 registers would be desirable to keep the same renaming capacity. Fortunately that's not a lot for the 16 nm process where we might see AVX-1024 at the earliest (two 'tock' generations from Sandy Bridge), and due to the inherent latency hiding there's no strict need to have the same renaming capacity, but at 176 registers AVX-512 would be cheaper still.

As for the marketing reason, well, they might rather have you buy both a CPU with AVX-512 and then one with AVX-1024 instead of immediately offering the
latter and leaving less incentive to upgrade soon. This completely depends on what the competition might offer. So I'm hoping AMD is thinking of offering AVX-1024 support sooner rather than later. Of course their process disadvantage might make it harder to add sufficient registers.

Anyway, if I had to place a bet I would say that getting AVX-1024/512 in one go is more likely than getting just AVX-512 first. It's only a 30% register file increase per 'tock', in line with previous generations, and it's probably fine with less.

less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!

I'm glad it's finally making sense to you but keep in mind that it's a delicate balancing act. Too much of a good thing becomes a bad thing.

In particular it's really still beneficial to have SIMD execution units. Making them more narrow gets rid of the 'padding' but the average performance/Watt goes down because the power consumption doesn't grow linearly width the execution width due to fixed costs.

it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD width

No, Sandy Bridge cleverly borrows ALUs from another execution stack for foating-point operations only. It doesn't double the bandwidth either but makes a compromise with the symmetric load ports. So in many ways it's just a stepping stone toward Haswell which will offer 256-bit execution for all SIMD instructions, brings us gather support to facilitate parallelizing generic code, and is next to certain going to increase cache bandwidth.

In other words, saying that Sandy Bridge doubled SIMD throughput is wrong. It's only 50% there, with Haswell completing the task and justifying it with gather support. I still don't see any justification for going to 512-bit any time soon.

bronxzv · Jan 27, 2012

CPUarchitect said:
Fortunately that's not a lot for the 16 nm process

that's 14nm, and it's certainly not a lot of chip area, even if we get 512-bit wide registers as I'm expecting as the next natural step

CPUarchitect said:
So I'm hoping AMD is thinking of offering AVX-1024 support sooner rather than later. Of course their

don't forget that AVX-1024 isn't disclosed yet and the FMA4 debacle

CPUarchitect said:
process disadvantage might make it harder to add sufficient registers.

you seem to think the register file takes a lot of chip area, are you sure ? I''ll be interested to see what a CPU designer has to say about it,
see for example the small area taken by 128 x 128 bit registers in the good old 130 nm Northwood
http://www.chip-architect.org/news/Northwood_130nm_die_text_1600x1200.jpg
it's certainly not a stretch to imagine 144 x 512-bit registers at 14 nm, is it ?

CPUarchitect said:
Anyway, if I had to place a bet I would say that getting AVX-1024/512 in one go is more likely than getting just AVX-512 first.

I bet for AVX-512 first, full width, let's see who will be right in a few years

CPUarchitect said:
and is next to certain going to increase cache bandwidth.

sure, I expect Ivy Bridge to increase cache bandwidth already (at least DL1 to L2 bandwidth), providing nice speedups to AVX-256 code

Haswell Integrated GPU = Larrabee3?

Lifer

Lifer

Lifer

Senior member

Senior member

Lifer

Golden Member

Senior member

Member

Member

Senior member

Senior member

Senior member

Elite Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member