Nemesis 1
Lifer
- Dec 30, 2006
- 11,366
- 2
- 0
Awesome post! I think I am starting to 'get' your posts nemesis.![]()
I think your in deep do do If my post are starting to hit home with ya.
Awesome post! I think I am starting to 'get' your posts nemesis.![]()
This isn't my idea at all. AVX was announced by Intel to be extendable to 512 and 1024 bit. And it was also Intel who first implemented 128-bit SIMD instructions on 64-bit execution units back in 1999. Despite using the same execution width, SSE proved to be superior to AMD's 3DNow! The main reason for that was the increase in available register space which offered latency hiding while avoiding spilling. These are the exact same qualities AVX-1024 would offer. It's not a new idea by any stretch.
If you are Nicks you changed your posting style that I can see. But there are those people out there that changes posting methods . I not even sure how many utube accounts I have. But I do know how many videos I have created . LOTSIntel differentiates between AVX-128 and AVX-256 instrutions. So obviously their 512-bit and 1024-bit instructions would be named AVX-512 and AVX-1024 respectively.
Other people appear to show an interest in AVX-512 and AVX-1024 as well, and some come to the same conclusion that splitting the execution over multiple cycles offers the best cost/gain ratio. I am not "Nick".
But that means you're totally ignoring the latency problem with big loops!indeed, that's exactly why I was reacting to this comment of yours:
"
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling,
"
it never happens in practice since we unroll only small loops
I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.indeed and it changes absolutely nothing to the fact that you don't need to do loop unrolling or software pipelining to hide latencies on OoO cores
No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.I was meaning your idea to process 1024-bit data in 4 chunks with a single uop for the 4 parts, it looks like a new idea in the x86 world
But that means you're totally ignoring the latency problem with big loops!
They also provided a cycle-accurate emulator for Sandy Bridge
not only, I use it extensively with C++ wrapper classes around intrinsics and it proved very usefulbut it only proves useful for hand-written assembly.
It's a dilemma that can only be solved by adding more registers. It is not
.B13.5:: ; Preds .B13.5 .B13.4
vmovups ymm1, YMMWORD PTR [rcx+rax*4] ;418.24
inc r10d ;416.3
vmovups ymm0, YMMWORD PTR [rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [r8+rax*4] ;419.20
vmovups YMMWORD PTR [r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [32+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [32+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [32+r8+rax*4] ;419.20
vmovups YMMWORD PTR [32+r9+rax*4], ymm3 ;419.5
vmovups ymm5, YMMWORD PTR [64+rcx+rax*4] ;418.24
vmovups ymm4, YMMWORD PTR [64+rdx+rax*4] ;418.32
vfmadd213ps ymm5, ymm4, YMMWORD PTR [64+r8+rax*4] ;419.20
vmovups YMMWORD PTR [64+r9+rax*4], ymm5 ;419.5
vmovups ymm1, YMMWORD PTR [96+rcx+rax*4] ;418.24
vmovups ymm0, YMMWORD PTR [96+rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [96+r8+rax*4] ;419.20
vmovups YMMWORD PTR [96+r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [128+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [128+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [128+r8+rax*4] ;419.20
vmovups YMMWORD PTR [128+r9+rax*4], ymm3 ;419.5
add rax, 40 ;416.3
cmp r10d, 25 ;416.3
jb .B13.5 ; Prob 95% ;416.3
it would also lower the power consumption. You can't achieve any of these advantages with AVX-256.
if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.
Daughter said this some deep stuff but as of so far she gives node to CPUarch
This bug of course can't exist on LRB. Not by saying it a different arch, but also they've fixed it in the new stepping.Quick question, is this larbee an actual GPU or is it just some sort of VLC part like what seems to be in ivy bridge? Hope they fix the 23.976 bug so they could at least show a smooth fake.![]()
What's her idea?Guys this has been a great debate . Thank you. I asked my daughter whos major was computer science . As Ya both sound like ya in the know . Daughter said this some deep stuff but as of so far she gives node to CPUarch with reservation
Isn't it a problem with OCL itself?I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.
.
At least its programming structure is at such a high level, which sounds to be ineffienct.
Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width. The MIC products could execute future AVX-1024 instructions in 2 cycles and AVX-512 instructions in 1 cycle. Or, they could be equipped with two 256-bit execution units which execute AVX-1024 instructions in 4 cycles.1) because MIC products are 512-bit wide and they will certainly try to make standard components for both, (implementation and validation less costly, not more), with a grand plan to unify the ISAs at some point, it makes no sense in the long run to have two x86 based vector ISAs
Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width. The other one makes sense if we consider that future software should contain a larger number of parallelized loops thanks to gather support.2) because it was the natural evolution so far, 64-bit MMX to 128-bit SSE to 256-bit AVX, next logical step is 512-bit
AVX is squarely aimed at auto-vectorization. So it should only take a simple recompile to take advantage of AVX-1024. With JIT-compilation like with OpenCL the ISV doesn't even have to do a recompile itself.3) because most ISVs will not bother to port their code with no performance gain but only a slight perf/power advantage
It's certainly true that AVX-1024 won't help in all cases, but note that there would still be AVX-512, AVX-256, and AX-128. And again the decision of the optimal approach would in most cases be handled by the (JIT-)compiler. So no need to worry about additional code paths.4) with wider vectors there is on average more padding and thus unused computation slots, if the peak AVX-256 and AVX-1024 throughputs are the same the actual performance will be lower with AVX-1024 (more unused slots), all the pain to support yet another code path and worse performance, i.e. a loss-loss situation
Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width.
so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.
Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width.
So no need to worry about additional code paths.
I'm actually quite convinced that the MIC product line will be short lived. There are already 16-core Ivy Bridge parts on the roadmap, so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.
Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining). That's what I've been saying all along. It's a dilemma that can only be solved with extra register space, which can be provided in the form of AVX-1024.we don't unroll them because it's actually faster on real targets, if there was a problem it will show in the timings, anyway why will there be a latency problem with typical 2-5 clocks latency for usual instructions ? as already asked I'll be glad to see an actual code sample that shows well the issue, maybe your use cases are so different than mine that I overlooked something?
Yes that's a remarkable feature, but you'd get even higher performance with AVX-1024 by eliminating the spill and restore instructions. Nothing etraordinary, but together with the power savings it's an ample justification of the nearly negligible cost.I'll be happy with some more registers, but first of all you can reuse logical registers before the physical register is free, another physical register will be allocated for you from the pool, see for example what the Intel compiler do with a 5x unrolled loop of your example, the true limiter is the number of physical registers divided by the number of running threads, note that it allows *today already* to unroll 4x (as AVX-1024 over 256-bit hw will achieve) any loop with independent iterations (as is the typical OpenCL case) with low extra register fills/spills
The OpenSPARC T1 is known to spend 40% of power consumption in the front-end, and I expect x86 to be similar. For a realistic throughput-oriented workload, wide vector instructions which take 4 cycles would allow to clock-gate the front-end roughly two times more often. Hence a reduction in power consumption of 20% can be achieved.let's say you are right, how much % less power consumption do you envision?
I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...
No, the issue of not having sufficient register space to optimally hide latencies applies to the vast majority of parallel workloads. You can't fix that with another software API. You need hardware support for it, and unless someone comes up with a better idea it looks like AVX-1024 would be most straightforward and most effective.Isn't it a problem with OCL itself?
Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining).
There's an instruction latency problem with long loops because they typically contain long chains of dependent instructions.
Hence a reduction in power consumption of 20% can be achieved.
Combined with a 20% increase in performance
I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.
Yes, there is no denying that, but let's get something straight. GPUs have logical vector widths of 1024-bit and 2048-bit. So anything you could run efficiently on a GPU would also run well on a CPU with AVX-1024.to go back to the issue discussed with CPUarchitect, with wide vectors like AVX-1024 will have (32 x 32-bit elements) there is a serious issue of padding for small arrays
Point taken, but we were talking about converging LRBni and AVX, which both logically split the vectors into 128-bit lanes. So while the minimum width is indeed 128-bit, the discussion was really about 256-bit versus 512-bit. In theory you could get all of the functionality of LRBni with AVX-512 instructions even if the execution units are 256-bit. Conclusion: any motivation for converging them isn't an argument why executing AVX-512 in a single cycle is more likely than executing AVX-1024 in four cycles.You got it reversed, the ISA dictates the minimal execution width, for example you can process the whole AVX instruction set in two 128-bit part because there isn't cross 128-bit lanes instruction, though you can't process it in 4 x 64-bit parts due to instructions such as VPERMLIPS/PD, VEXTRACTF128 and VINSERTF128 which require 128-bit execution width
I don't think those strictly demand 256-bit. Note that SSE1/2 had 128-bit movlhps/movhlps/pshufd while the execution units were still 64-bit.AVX2 introduces new instructions such as VPERMPS/PD/D/Q, VBROADCASTI128, VPERM2I128 that you can't split in two 128-bit parts, so the minimal execution width is 256-bit for AVX2
You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes. Any vector string extension would require entirely new encodings and I'm not sure if there's any space left for that. Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.we can't really talk of what we call "AVX-512" and "AVX-1024" since these aren't disclosed yet, at this stage it's well possible that we will see AVX-512 only in a far future and never AVX-1024 but a true vector ISA instead with variable length arrays (much like the string instructions) it will be a far better fit with OpenCL for avoiding the padding issues
That's very interesting because 3D rendering consists of shaders which are fairly long and thus there are long dependency chains and no opportunities forI agree with you here, that's why I consider AVX2 a very important target for my code (3D rendering)
Between the Core and Core 2 architecture, Intel doubled up everything. Scalar execution became twice as wide, cache bandwidth doubled, and SIMD execution doubled. So the same balance was maintained. They didn't leave anything bottlenecked or underutilized.Sorry but I don't see how it's related
According to the first post there will be 16 core parts. How many cores will be active remains to be seen of course. Likewise Knights Corner is said to have "more than 50 cores" which likely means 64 and several disabled depending on yield.This is probably true, except Ivy Bridge EX is coming in early 2013, about the same time as Knights Corner. Anything 12 cores or more will be put as EX parts, and considering their 1.5 year development schedule, that means you won't see Haswell EX until mid-2014. It would be competing with successor to Knights Corner.
Oh, and Ivy Bridge is said to have up to 12 cores, not 16.
Yes but notice that Haswell will double the computing density with FMA while Knights can't pull that trick any more.Let's compare theoreticals, shall we?
Knights Corner: 64 cores @ ~1.2GHz
Ivy Bridge EX: 12 cores @ 2.8GHz
GFlops-
KC: 64 cores x 1.2GHz x FMA x 8 DP Flops/cycle = 1.2 TFlops
IVB-EX: 12 cores x 2.8GHz x 2 FP units/core x 4 DP Flops/cycle = 268.8 GFlops
Granted, but DDR4 is on its way as well and CPUs require less RAM bandwidth due to massive caches.That's not counting additional bandwidth that KC has with GDDR5+ memory.
Do you think all Haswell parts will have an IGP? And do you happen to know how many GFLOPS 20 Gen X EU's amount to?Haswell is said to have 20 EUs. That's not Larrabee-based, its Gen X based.
I don't think those strictly demand 256-bit.
IIRC these were slow and to be avoided before Conroe, but I may be wrongNote that SSE1/2 had 128-bit movlhps/movhlps
You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes.
less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.
I'm sure you know very well the subjectThat's very interesting because 3D rendering consists of shaders
it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD widthIf instead you want to just double the SIMD width you'd be making compromises.
the same loops can already be vectorized using software synthetized gather (there is a common misconception that it's not possible), hardware gather will speed them up but it's already way faster than keeping them scalar, now I think we already discussed gather well enough in the past at RWT and on the Intel AVX forumAnd that's what expected to happen with AVX2 since its gather support will allows a lot of loops to be vectorized.
That sounds like a far too optimistic assumption. These buffers are finite in size and don't hold the data for long. In particular when unrolling and software pipelining loops, the distance between spills and fills increases and the chances of still finding the values in the store buffer are slim. What doesn't help either is that spilling is specifically done on variables which aren't needed for a while (else they'd be kept in a register instead). And last but not least, there's only one store port and two load ports so between all the spills/fills and regular load/store you quickly get contention...not really since spills/fills are mostly to/from the store-load fowarding buffer
Exactly! All your spills and fills cause port congestion.as already requested, do you have a real and complete example to share showing concretely the problem ? for all the high throughput AVX cases I'm dealing with the main limitations come from the L1D / L2 caches bandwidth, I'm stuned you can measure issues due to the instruction latencies, I'll say these are 2 or 3 order of magnitude less important in the real world than the cache bandwidth limitation
The traces I've worked with are just too long to post here, and more importantly contain potential IP from clients. But since you mentioned that you work with 3D graphics you should have access to plenty of shader kernels which are not suitable for unrolling.instead of all this theory why not sharing an illustrative example ?
That's what you would typically gain from eliminating spill/fill instructions and improving scheduling opportunities.how can you achieve 20% increase in performance with the same peak throughput ?
Once again I'd like you to check whether it's the bandwidth or load/store port contention which is hampering your code, by using AVX-128 instead of AVX-256.don't forget SMT already help a lot for hiding memory hierarchy latencies and 4-way SMT is a possibility for Haswell, AVX2 code will be already very near it's maximum potential throughput, the only limitations will come from cache and memory bandwidth, something that your idea will not improve
I'll say where unrolling is unecessary to be more precise, there are a lot of cases with no single spill in the unrolled version and the unrolled version not faster, that's why I can't understand your POV on the matter, and since you use the IP / trade secrets excuse, when it's so easy to devise a convincing example when we have something concrete to show, I'm afraid I'll not learn much from youaccess to plenty of shader kernels which are not suitable for unrolling.
so it should not be the problem since we typically enjoy 25% + spedup from enabling hyperthreading (slightly better speedups with 64-bit than 32-bit code and with AVX code than with SSE code for some reason, I'm sure you'll say due to less load/store contention in 64-bit mode, but the difference should be 1% or 2%), the workload that scale the best in our series of 3D models for the regression tests is at 31% speedup from HT, note that power consumption isn't significantly up with HT enabled according to the ASUS AI suiteSMT doesn't help at all with load/store port contention.
Not necessarily. When AVX-1024 becomes available there will definitely be AVX-512 too for obvious reasons. The reverse isn't strictly going to be true and I can think of two reasons: renaming registers, and marketing. To keep the out-of-order engine running smoothly you need sufficient registers for renaming. Sandy Bridge is supposed to have 144 physical SIMD registers, of which 2 x 16 are required for two thread contexts, leaving 112 for renaming. AVX-1024 would increase the contexts with 3 x 16 256-bit registers, so a total of 240 registers would be desirable to keep the same renaming capacity. Fortunately that's not a lot for the 16 nm process where we might see AVX-1024 at the earliest (two 'tock' generations from Sandy Bridge), and due to the inherent latency hiding there's no strict need to have the same renaming capacity, but at 176 registers AVX-512 would be cheaper still.so, if I understand your well, you think AVX-512 and AVX-1024 will be introduced at the same time, right ?
I'm glad it's finally making sense to you but keep in mind that it's a delicate balancing act. Too much of a good thing becomes a bad thing.less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!
No, Sandy Bridge cleverly borrows ALUs from another execution stack for foating-point operations only. It doesn't double the bandwidth either but makes a compromise with the symmetric load ports. So in many ways it's just a stepping stone toward Haswell which will offer 256-bit execution for all SIMD instructions, brings us gather support to facilitate parallelizing generic code, and is next to certain going to increase cache bandwidth.it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD width
Fortunately that's not a lot for the 16 nm process
don't forget that AVX-1024 isn't disclosed yet and the FMA4 debacleSo I'm hoping AMD is thinking of offering AVX-1024 support sooner rather than later. Of course their
you seem to think the register file takes a lot of chip area, are you sure ? I''ll be interested to see what a CPU designer has to say about it,process disadvantage might make it harder to add sufficient registers.
I bet for AVX-512 first, full width, let's see who will be right in a few yearsAnyway, if I had to place a bet I would say that getting AVX-1024/512 in one go is more likely than getting just AVX-512 first.
and is next to certain going to increase cache bandwidth.
