Go Back   AnandTech Forums > Hardware and Technology > CPUs and Overclocking

Forums
· Hardware and Technology
· CPUs and Overclocking
· Motherboards
· Video Cards and Graphics
· Memory and Storage
· Power Supplies
· Cases & Cooling
· SFF, Notebooks, Pre-Built/Barebones PCs
· Networking
· Peripherals
· General Hardware
· Highly Technical
· Computer Help
· Home Theater PCs
· Consumer Electronics
· Digital and Video Cameras
· Mobile Devices & Gadgets
· Audio/Video & Home Theater
· Software
· Software for Windows
· All Things Apple
· *nix Software
· Operating Systems
· Programming
· PC Gaming
· Console Gaming
· Distributed Computing
· Security
· Social
· Off Topic
· Politics and News
· Discussion Club
· Love and Relationships
· The Garage
· Health and Fitness
· Merchandise and Shopping
· For Sale/Trade
· Hot Deals with Free Stuff/Contests
· Black Friday 2014
· Forum Issues
· Technical Forum Issues
· Personal Forum Issues
· Suggestion Box
· Moderator Resources
· Moderator Discussions
   

Reply
 
Thread Tools
Old 01-25-2012, 12:53 PM   #51
Nemesis 1
Banned
 
Join Date: Dec 2006
Posts: 11,379
Default

Quote:
Originally Posted by CPUarchitect View Post
Intel differentiates between AVX-128 and AVX-256 instrutions. So obviously their 512-bit and 1024-bit instructions would be named AVX-512 and AVX-1024 respectively.

Other people appear to show an interest in AVX-512 and AVX-1024 as well, and some come to the same conclusion that splitting the execution over multiple cycles offers the best cost/gain ratio. I am not "Nick".
If you are Nicks you changed your posting style that I can see. But there are those people out there that changes posting methods . I not even sure how many utube accounts I have. But I do know how many videos I have created . LOTS
Nemesis 1 is offline   Reply With Quote
Old 01-25-2012, 02:51 PM   #52
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
indeed, that's exactly why I was reacting to this comment of yours:
"
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling,
"
it never happens in practice since we unroll only small loops
But that means you're totally ignoring the latency problem with big loops!
Quote:
indeed and it changes absolutely nothing to the fact that you don't need to do loop unrolling or software pipelining to hide latencies on OoO cores
I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.

It's a dilemma that can only be solved by adding more registers. It is not feasible to directly add more logical registers to x86, but the same thing can be achieved by widening the logical registers and processing the instructions over multiple cycles.

And AVX-1024 wouldn't just offer latency hiding, it would also lower the power consumption. You can't achieve any of these advantages with AVX-256.
Quote:
I was meaning your idea to process 1024-bit data in 4 chunks with a single uop for the 4 parts, it looks like a new idea in the x86 world
No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.
CPUarchitect is offline   Reply With Quote
Old 01-25-2012, 03:17 PM   #53
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
But that means you're totally ignoring the latency problem with big loops!
we don't unroll them because it's actually faster on real targets, if there was a problem it will show in the timings, anyway why will there be a latency problem with typical 2-5 clocks latency for usual instructions ? as already asked I'll be glad to see an actual code sample that shows well the issue, maybe your use cases are so different than mine that I overlooked something?

Quote:
Originally Posted by CPUarchitect View Post
They also provided a cycle-accurate emulator for Sandy Bridge
Sure, I use the IACA since v1.0, btw I was discussing IACA today with his developper and he says it's not modeling well Sandy Bridge
http://software.intel.com/en-us/foru...02334&o=a&s=lr

Quote:
Originally Posted by CPUarchitect View Post
but it only proves useful for hand-written assembly.
not only, I use it extensively with C++ wrapper classes around intrinsics and it proved very useful

Quote:
Originally Posted by CPUarchitect View Post
It's a dilemma that can only be solved by adding more registers. It is not
I'll be happy with some more registers, but first of all you can reuse logical registers before the physical register is free, another physical register will be allocated for you from the pool, see for example what the Intel compiler do with a 5x unrolled loop of your example, the true limiter is the number of physical registers divided by the number of running threads, note that it allows *today already* to unroll 4x (as AVX-1024 over 256-bit hw will achieve) any loop with independent iterations (as is the typical OpenCL case) with low extra register fills/spills

Code:
 
 
.B13.5:: ; Preds .B13.5 .B13.4
vmovups ymm1, YMMWORD PTR [rcx+rax*4] ;418.24
inc r10d ;416.3
vmovups ymm0, YMMWORD PTR [rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [r8+rax*4] ;419.20
vmovups YMMWORD PTR [r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [32+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [32+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [32+r8+rax*4] ;419.20
vmovups YMMWORD PTR [32+r9+rax*4], ymm3 ;419.5
vmovups ymm5, YMMWORD PTR [64+rcx+rax*4] ;418.24
vmovups ymm4, YMMWORD PTR [64+rdx+rax*4] ;418.32
vfmadd213ps ymm5, ymm4, YMMWORD PTR [64+r8+rax*4] ;419.20
vmovups YMMWORD PTR [64+r9+rax*4], ymm5 ;419.5
vmovups ymm1, YMMWORD PTR [96+rcx+rax*4] ;418.24
vmovups ymm0, YMMWORD PTR [96+rdx+rax*4] ;418.32
vfmadd213ps ymm1, ymm0, YMMWORD PTR [96+r8+rax*4] ;419.20
vmovups YMMWORD PTR [96+r9+rax*4], ymm1 ;419.5
vmovups ymm3, YMMWORD PTR [128+rcx+rax*4] ;418.24
vmovups ymm2, YMMWORD PTR [128+rdx+rax*4] ;418.32
vfmadd213ps ymm3, ymm2, YMMWORD PTR [128+r8+rax*4] ;419.20
vmovups YMMWORD PTR [128+r9+rax*4], ymm3 ;419.5
add rax, 40 ;416.3
cmp r10d, 25 ;416.3
jb .B13.5 ; Prob 95% ;416.3
then you can use indexed addressing for frequently accessed read-only data, this looks like if the processor access them from buffers much like the store forwarding buffers (like some kind of virtual registers), I was really stuned to see it's as fast as using a register, a recent optimization in the Intel compiler do just that even in low register pressure conditions, I discuss it here : http://software.intel.com/en-us/foru...01886&o=a&s=lr

Quote:
Originally Posted by CPUarchitect View Post
it would also lower the power consumption. You can't achieve any of these advantages with AVX-256.
let's say you are right, how much % less power consumption do you envision?

Quote:
Originally Posted by CPUarchitect View Post
No, that part is not new either. The Pentium 4 still had the 64-bit SIMD units from the Pentium 3, but used a single uop for every 128-bit instruction.
if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...

Last edited by bronxzv; 01-26-2012 at 01:33 AM.
bronxzv is offline   Reply With Quote
Old 01-25-2012, 04:52 PM   #54
Nemesis 1
Banned
 
Join Date: Dec 2006
Posts: 11,379
Default

Guys this has been a great debate . Thank you. I asked my daughter whos major was computer science . As Ya both sound like ya in the know . Daughter said this some deep stuff but as of so far she gives node to CPUarch with reservation
Nemesis 1 is offline   Reply With Quote
Old 01-25-2012, 06:01 PM   #55
piesquared
Golden Member
 
Join Date: Oct 2006
Posts: 1,058
Default

Quick question, is this larbee an actual GPU or is it just some sort of VLC part like what seems to be in ivy bridge? Hope they fix the 23.976 bug so they could at least show a smooth fake.
piesquared is offline   Reply With Quote
Old 01-26-2012, 04:11 AM   #56
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by Nemesis 1 View Post
Daughter said this some deep stuff but as of so far she gives node to CPUarch
greetings to your daughter, so she thinks AVX-1024 over 256-bit hardware is a more likely future after AVX-256 than AVX-512 over 512-bit hardware? let her think a bit more about it and let us know

Last edited by bronxzv; 01-26-2012 at 07:59 AM.
bronxzv is offline   Reply With Quote
Old 01-26-2012, 07:55 AM   #57
denev2004
Member
 
denev2004's Avatar
 
Join Date: Dec 2011
Location: PRC
Posts: 105
Default

Quote:
Originally Posted by piesquared View Post
Quick question, is this larbee an actual GPU or is it just some sort of VLC part like what seems to be in ivy bridge? Hope they fix the 23.976 bug so they could at least show a smooth fake.
This bug of course can't exist on LRB. Not by saying it a different arch, but also they've fixed it in the new stepping.
__________________
There is no fate but what we make for ourselves.
denev2004 is offline   Reply With Quote
Old 01-26-2012, 08:05 AM   #58
denev2004
Member
 
denev2004's Avatar
 
Join Date: Dec 2011
Location: PRC
Posts: 105
Default

Quote:
Originally Posted by Nemesis 1 View Post
Guys this has been a great debate . Thank you. I asked my daughter whos major was computer science . As Ya both sound like ya in the know . Daughter said this some deep stuff but as of so far she gives node to CPUarch with reservation
What's her idea?
BTW, I'd like to say at least 10% of us had studied CS courses. Although I'm kinda not one of the 10%.

Quote:
Originally Posted by CPUarchitect View Post
I'm sorry but this is where you're mistaken. All OpenCL kernels I've tested suffer from low ILP when run on a modern x86 out-of-order CPU. It's no coincidence that Intel increased the scheduling window from 36 to 54 between Nehalem and Sandy Bridge. But that doesn't mean that out-of-order scheduling can handle anything now. They also provided a cycle-accurate emulator for Sandy Bridge to tune critical paths, but it only proves useful for hand-written assembly. Unfortunately software pipelining didn't help for these OpenCL kernels due to the increased register pressure.
.
Isn't it a problem with OCL itself?
I think OpenCL is mostly designed to set a platform to engage different kinds of architechtures having the ability to run a same general-purpose programs. It does not pay much attention on efficiency, as it might opposed to that way. At least its programming structure is at such a high level, which sounds to be ineffienct.
__________________
There is no fate but what we make for ourselves.
denev2004 is offline   Reply With Quote
Old 01-26-2012, 09:58 AM   #59
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by denev2004 View Post
At least its programming structure is at such a high level, which sounds to be ineffienct.
one of the problem with such high level portable code is that it can't be well suited for all specific targets, including future unknown targets, for top performance you can't escape considering the physical implemention (cache line size, caches capacities, width of the SIMD units, ...)

to go back to the issue discussed with CPUarchitect, with wide vectors like AVX-1024 will have (32 x 32-bit elements) there is a serious issue of padding for small arrays, for example if you have arrays of 130 elements in some existing application you will need to process 160 elements with AVX-1024 instead of only 136 elements with AVX-256, on average 16 elements are wasted with AVX-1024 and only 4 with AVX-256, that's one of the reason it's very difficult to sell (*) to developers wider vectors without a true wider implementation, even if with smart tricks the power consumption is 5% lower, if you want lower power consumption just clock down a full width implementation and enjoy 50% less power or better

*: all the way back to Pentium MMX a major issue with SIMD ISAs is to convince developers to use them (just look at how many AVX enabled applications are available 1 year after the Sandy Bridge launch), since you have to continue to support legacy targets missing the new ISAs, the more code paths, the higher development and validation costs

Last edited by bronxzv; 01-26-2012 at 11:09 AM.
bronxzv is offline   Reply With Quote
Old 01-26-2012, 09:43 PM   #60
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
1) because MIC products are 512-bit wide and they will certainly try to make standard components for both, (implementation and validation less costly, not more), with a grand plan to unify the ISAs at some point, it makes no sense in the long run to have two x86 based vector ISAs
Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width. The MIC products could execute future AVX-1024 instructions in 2 cycles and AVX-512 instructions in 1 cycle. Or, they could be equipped with two 256-bit execution units which execute AVX-1024 instructions in 4 cycles.

So the fact that LRBni consists of 512-bit instructions which Knight's Corner executes in a single cycle is no indication at all of where things could be heading. I'm actually quite convinced that the MIC product line will be short lived. There are already 16-core Ivy Bridge parts on the roadmap, so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.
Quote:
2) because it was the natural evolution so far, 64-bit MMX to 128-bit SSE to 256-bit AVX, next logical step is 512-bit
Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width. The other one makes sense if we consider that future software should contain a larger number of parallelized loops thanks to gather support.

But beyond that it would lose balance. It's obvious that the SIMD execution width can't keep doubling again and again. It has to stop at a point where a wide range of workloads can be executed efficiently. 512-bit SIMD execution units would compromise scalar performance. It made sense for a MIC since it can still rely on a CPU to run the scalar workloads. For a homogeneous CPU however I believe 256-bit is the maximum sensible width.
Quote:
3) because most ISVs will not bother to port their code with no performance gain but only a slight perf/power advantage
AVX is squarely aimed at auto-vectorization. So it should only take a simple recompile to take advantage of AVX-1024. With JIT-compilation like with OpenCL the ISV doesn't even have to do a recompile itself.

And any free performance improvement will certainly be welcomed. Also, with many things going ultra-mobile these days it's definitely a concern of ISVs to ensure they get the best performance/Watt.
Quote:
4) with wider vectors there is on average more padding and thus unused computation slots, if the peak AVX-256 and AVX-1024 throughputs are the same the actual performance will be lower with AVX-1024 (more unused slots), all the pain to support yet another code path and worse performance, i.e. a loss-loss situation
It's certainly true that AVX-1024 won't help in all cases, but note that there would still be AVX-512, AVX-256, and AX-128. And again the decision of the optimal approach would in most cases be handled by the (JIT-)compiler. So no need to worry about additional code paths.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 03:01 AM   #61
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
Indeed it seems beneficial to converge LRBni and AVX. But the ISAs don't dictate the execution width.
You got it reversed, the ISA dictates the minimal execution width, for example you can process the whole AVX instruction set in two 128-bit part because there isn't cross 128-bit lanes instruction, though you can't process it in 4 x 64-bit parts due to instructions such as VPERMLIPS/PD, VEXTRACTF128 and VINSERTF128 which require 128-bit execution width

AVX2 introduces new instructions such as VPERMPS/PD/D/Q, VBROADCASTI128, VPERM2I128 that you can't split in two 128-bit parts, so the minimal execution width is 256-bit for AVX2

we can't really talk of what we call "AVX-512" and "AVX-1024" since these aren't disclosed yet, at this stage it's well possible that we will see AVX-512 only in a far future and never AVX-1024 but a true vector ISA instead with variable length arrays (much like the string instructions) it will be a far better fit with OpenCL for avoiding the padding issues

Quote:
Originally Posted by CPUarchitect View Post
so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.
I agree with you here, that's why I consider AVX2 a very important target for my code (3D rendering)

Quote:
Originally Posted by CPUarchitect View Post
Note that somewhere in between that evolution we went from 32-bit to 64-bit processors, justifying at least one doubling of the SIMD width.
Sorry but I don't see how it's related

Quote:
Originally Posted by CPUarchitect View Post
So no need to worry about additional code paths.
If you want top performance you need several code paths, one reason is that any intermediate language will always miss some features when a new ISA is disclosed, and so is not a good match for all instructions that you can use by targeting the native ISA. That's exactly why Intel do a dispatch to a lot of different code paths in MKL & IPP for example, the comparison with OCL implementations will be not pretty, I'm sure.
Even if the compiler can help you a lot it's still a lot of development and validation effort for each path, especially if you factor in the combination of all your ISA paths with 32-bit and 64-bit modes. It's not practical do release something without heavy regression testing and it rapidly becomes a significant cost.

Last edited by bronxzv; 01-27-2012 at 09:57 AM.
bronxzv is offline   Reply With Quote
Old 01-27-2012, 04:51 AM   #62
IntelUser2000
Elite Member
 
IntelUser2000's Avatar
 
Join Date: Oct 2003
Posts: 3,494
Default

Quote:
Originally Posted by CPUarchitect View Post
I'm actually quite convinced that the MIC product line will be short lived. There are already 16-core Ivy Bridge parts on the roadmap, so Knight's Corner might not offer much advantage over a 16-core AVX2 equipped Haswell part.
This is probably true, except Ivy Bridge EX is coming in early 2013, about the same time as Knights Corner. Anything 12 cores or more will be put as EX parts, and considering their 1.5 year development schedule, that means you won't see Haswell EX until mid-2014. It would be competing with successor to Knights Corner.

Oh, and Ivy Bridge is said to have up to 12 cores, not 16.

Let's compare theoreticals, shall we?

Knights Corner: 64 cores @ ~1.2GHz
Ivy Bridge EX: 12 cores @ 2.8GHz

GFlops-
KC: 64 cores x 1.2GHz x FMA x 8 DP Flops/cycle = 1.2 TFlops
IVB-EX: 12 cores x 2.8GHz x 2 FP units/core x 4 DP Flops/cycle = 268.8 GFlops

That's not counting additional bandwidth that KC has with GDDR5+ memory. And that I am being reasonable with EX's clocks while being slightly conservative with KC's. Pretty sure latter's market opportunities are limited, but it looks good for what it'll be targeted as - co-processor for HPC workstations. No one except Intel knows at this point what they are going with this.

Haswell is said to have 20 EUs. That's not Larrabee-based, its Gen X based.
__________________
Core i7 2600K + Turbo Boost | Intel DH67BL/GMA HD 3000 IGP | Corsair XMS3 2x2GB DDR3-1600 @ 1333 9-9-9-24 |
Intel X25-M G1 80GB + Seagate 160GB 7200RPM | OCZ Modstream 450W | Samsung Syncmaster 931c | Windows 7 Home Premium 64-bit | Microsoft Sidewinder Mouse | Viliv S5-Atom Z520 WinXP UMPC

Last edited by IntelUser2000; 01-27-2012 at 04:56 AM.
IntelUser2000 is offline   Reply With Quote
Old 01-27-2012, 11:47 AM   #63
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
we don't unroll them because it's actually faster on real targets, if there was a problem it will show in the timings, anyway why will there be a latency problem with typical 2-5 clocks latency for usual instructions ? as already asked I'll be glad to see an actual code sample that shows well the issue, maybe your use cases are so different than mine that I overlooked something?
Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining). That's what I've been saying all along. It's a dilemma that can only be solved with extra register space, which can be provided in the form of AVX-1024.

There's an instruction latency problem with long loops because they typically contain long chains of dependent instructions. Such a chain can easily exceed the length of the scheduling window. For instance Sandy Bridge has a 54 uop scheduler, which means that when you're aiming for an IPC of 3 you basically get 18 uops per arithmetic port. And when the average instruction latency is 3 cycles that means it can only fit dependency chains of 6 instructions. Any longer dependency chains, or any higher instruction latencies, lead to some loss of performance at the scheduler. Note that high latencies are one of the reasons AMD's Bulldozer architecture fails to achieve high performance. So you definitely should care about them.

That said, we're only talking about a potential ~20% performance gain from hiding more latency. There has been lots of research on the topic of increasing the scheduling window and they all agree ~256 entries would offer near-optimal scheduling for a 3-issue architecture but it's just not feasible to have such a large scheduler. With AVX-1024 a scheduler with 54 entries would behave like one with 216 entries! So just because you're not observing obvious latency issues doesn't mean there's not a small but worthwhile amount of performance that can be gained from unrolling with software pipelining if you had more register space.
Quote:
I'll be happy with some more registers, but first of all you can reuse logical registers before the physical register is free, another physical register will be allocated for you from the pool, see for example what the Intel compiler do with a 5x unrolled loop of your example, the true limiter is the number of physical registers divided by the number of running threads, note that it allows *today already* to unroll 4x (as AVX-1024 over 256-bit hw will achieve) any loop with independent iterations (as is the typical OpenCL case) with low extra register fills/spills
Yes that's a remarkable feature, but you'd get even higher performance with AVX-1024 by eliminating the spill and restore instructions. Nothing etraordinary, but together with the power savings it's an ample justification of the nearly negligible cost.

I fully understand that you're trying to explain that things are already pretty good as-is, but compared to throughput-oriented architectures it is really necessary to further improve performance/Watt and for x86 I believe AVX-1024 would be a crucial step.
Quote:
let's say you are right, how much % less power consumption do you envision?
The OpenSPARC T1 is known to spend 40% of power consumption in the front-end, and I expect x86 to be similar. For a realistic throughput-oriented workload, wide vector instructions which take 4 cycles would allow to clock-gate the front-end roughly two times more often. Hence a reduction in power consumption of 20% can be achieved.

Combined with a 20% increase in performance that's a 50% improvement in performance/Watt. With a higher clock frequency and/or more cores you can also get 50% higher performance at equal power consumption! And this is just a conservative estimate. There is reduced switching activity for other stages as well, and the latency hiding can be used to reduce aggressive prefetching.
Quote:
if the only exception is the more power hungry of the bunch I'll suggest to not use as an example to sell you idea...
I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 12:21 PM   #64
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by denev2004 View Post
Isn't it a problem with OCL itself?
No, the issue of not having sufficient register space to optimally hide latencies applies to the vast majority of parallel workloads. You can't fix that with another software API. You need hardware support for it, and unless someone comes up with a better idea it looks like AVX-1024 would be most straightforward and most effective.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 12:36 PM   #65
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
Indeed unrolling long loops is not beneficial. It increases the register pressure which requires spilling code which counteracts any latency hiding benefits of the unrolling (with software pipelining).
not really since spills/fills are mostly to/from the store-load fowarding buffer

Quote:
Originally Posted by CPUarchitect View Post
There's an instruction latency problem with long loops because they typically contain long chains of dependent instructions.
as already requested, do you have a real and complete example to share showing concretely the problem ? for all the high throughput AVX cases I'm dealing with the main limitations come from the L1D / L2 caches bandwidth, I'm stuned you can measure issues due to the instruction latencies, I'll say these are 2 or 3 order of magnitude less important in the real world than the cache bandwidth limitation, instead of all this theory why not sharing an illustrative example ?

Quote:
Originally Posted by CPUarchitect View Post
Hence a reduction in power consumption of 20% can be achieved.
not bad, if indeed the case


Quote:
Originally Posted by CPUarchitect View Post
Combined with a 20% increase in performance
how can you achieve 20% increase in performance with the same peak throughput ? don't forget SMT already help a lot for hiding memory hierarchy latencies and 4-way SMT is a possibility for Haswell, AVX2 code will be already very near it's maximum potential throughput, the only limitations will come from cache and memory bandwidth, something that your idea will not improve (I don't really buy your "less agressive prefetching" argument since there is already very effective safegards in Sandy Bridge against excessive prefetch)

since some codes reach 90% peak throughput
http://software.intel.com/en-us/arti...ntel-mkl-v103/
do you mean you'll achieve 110% of the theoretical peak ?

btw do you have a concrete AVX-256 example suffering from 20% slowdown from the latency or spills/fills problems you are refering to ? excuse-me but it absolutey doesn't match with what we see when tuning actual code

Quote:
Originally Posted by CPUarchitect View Post
I was merely indicating that it has been done before. NetBurst's high power consumption isn't caused by this feature.
I was kidding, anyway I'll be glad to have a pointer to some documentation explaining this single uop for SSE on P4, it isn't explained at all in the documentation I have at hand

Last edited by bronxzv; 01-27-2012 at 01:47 PM.
bronxzv is offline   Reply With Quote
Old 01-27-2012, 12:40 PM   #66
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
to go back to the issue discussed with CPUarchitect, with wide vectors like AVX-1024 will have (32 x 32-bit elements) there is a serious issue of padding for small arrays
Yes, there is no denying that, but let's get something straight. GPUs have logical vector widths of 1024-bit and 2048-bit. So anything you could run efficiently on a GPU would also run well on a CPU with AVX-1024.

But the CPU has two more advantages. You can still use AVX-512 and AVX-256 without losing peak throughput (just a ~20% loss in scheduling efficiency). And secondly the CPU requires far fewer threads per core. If you give a GPU just 1024-bit worth of work it takes forever to get the result back due to moving the data to and from another core, and it can't hide any latencies without a massive amount of additional work. A homogeneous CPU with wide vectors doesn't suffer from that.

So just because I'm advocating a simple extension like AVX-1024 doesn't mean you should use it in situations where it offers no benefit. Use it for workloads which are more parallel than 256-bit but don't have to be as massively parallel as GPGPU workloads.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 01:48 PM   #67
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
You got it reversed, the ISA dictates the minimal execution width, for example you can process the whole AVX instruction set in two 128-bit part because there isn't cross 128-bit lanes instruction, though you can't process it in 4 x 64-bit parts due to instructions such as VPERMLIPS/PD, VEXTRACTF128 and VINSERTF128 which require 128-bit execution width
Point taken, but we were talking about converging LRBni and AVX, which both logically split the vectors into 128-bit lanes. So while the minimum width is indeed 128-bit, the discussion was really about 256-bit versus 512-bit. In theory you could get all of the functionality of LRBni with AVX-512 instructions even if the execution units are 256-bit. Conclusion: any motivation for converging them isn't an argument why executing AVX-512 in a single cycle is more likely than executing AVX-1024 in four cycles.
Quote:
AVX2 introduces new instructions such as VPERMPS/PD/D/Q, VBROADCASTI128, VPERM2I128 that you can't split in two 128-bit parts, so the minimal execution width is 256-bit for AVX2
I don't think those strictly demand 256-bit. Note that SSE1/2 had 128-bit movlhps/movhlps/pshufd while the execution units were still 64-bit.
Quote:
we can't really talk of what we call "AVX-512" and "AVX-1024" since these aren't disclosed yet, at this stage it's well possible that we will see AVX-512 only in a far future and never AVX-1024 but a true vector ISA instead with variable length arrays (much like the string instructions) it will be a far better fit with OpenCL for avoiding the padding issues
You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes. Any vector string extension would require entirely new encodings and I'm not sure if there's any space left for that. Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.
Quote:
I agree with you here, that's why I consider AVX2 a very important target for my code (3D rendering)
That's very interesting because 3D rendering consists of shaders which are fairly long and thus there are long dependency chains and no opportunities for
unrolling and software pipelining unless you have massive register files like a GPU. So I can't see why you're trying so hard to find counter-arguments for AVX-1024 while your software is probably among those that would benefit from it the most.
Quote:
Sorry but I don't see how it's related
Between the Core and Core 2 architecture, Intel doubled up everything. Scalar execution became twice as wide, cache bandwidth doubled, and SIMD execution doubled. So the same balance was maintained. They didn't leave anything bottlenecked or underutilized.

If instead you want to just double the SIMD width you'd be making compromises. You can leave the bandwidth as-is but then it becomes starved, or you could double it too but then that's overkill for scalar workloads (increasing power consumption and chip area). The only way to justify changing the balance is if the workloads are changing too. And that's what expected to happen with AVX2 since its gather support will allows a lot of loops to be vectorized.

But I don't see anything that would justify disturbing the balance even more with 512-bit SIMD units. At least not in the foreseeable future. It's of course still possible that in the distant future it will make some sense due to another shift in workloads, but before that becomes even remotely a viable option we'll have had many years during which executing AVX-1024 on 256-bit units was far easier to justify.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 02:12 PM   #68
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by IntelUser2000 View Post
This is probably true, except Ivy Bridge EX is coming in early 2013, about the same time as Knights Corner. Anything 12 cores or more will be put as EX parts, and considering their 1.5 year development schedule, that means you won't see Haswell EX until mid-2014. It would be competing with successor to Knights Corner.

Oh, and Ivy Bridge is said to have up to 12 cores, not 16.
According to the first post there will be 16 core parts. How many cores will be active remains to be seen of course. Likewise Knights Corner is said to have "more than 50 cores" which likely means 64 and several disabled depending on yield.
Quote:
Let's compare theoreticals, shall we?

Knights Corner: 64 cores @ ~1.2GHz
Ivy Bridge EX: 12 cores @ 2.8GHz

GFlops-
KC: 64 cores x 1.2GHz x FMA x 8 DP Flops/cycle = 1.2 TFlops
IVB-EX: 12 cores x 2.8GHz x 2 FP units/core x 4 DP Flops/cycle = 268.8 GFlops
Yes but notice that Haswell will double the computing density with FMA while Knights can't pull that trick any more.

Also please note that I didn't claim Knights doesn't stand a chance. I said it will be short lived. Of course they won't put something on the market and remove it one year later so short lived still means multiple years. But clearly Knights will have to face Haswell and its successors. There's an unmistakable convergence taking place, and a generic CPU is a better value than a MIC even if the peak throughput is lower because it can run more diverse workloads efficiently. Note that a MIC requires a massive number of threads while a CPU with out-of-order execution doesn't suffer that much from Amdahl's Law. So outside of some niche markets I don't expect the MIC to survive for long.
Quote:
That's not counting additional bandwidth that KC has with GDDR5+ memory.
Granted, but DDR4 is on its way as well and CPUs require less RAM bandwidth due to massive caches.
Quote:
Haswell is said to have 20 EUs. That's not Larrabee-based, its Gen X based.
Do you think all Haswell parts will have an IGP? And do you happen to know how many GFLOPS 20 Gen X EU's amount to?
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 02:25 PM   #69
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
I don't think those strictly demand 256-bit.
if generic permute of a full register doesn't require full width, well, I don't see what requires full width for decent performance

Quote:
Originally Posted by CPUarchitect View Post
Note that SSE1/2 had 128-bit movlhps/movhlps
IIRC these were slow and to be avoided before Conroe, but I may be wrong

Quote:
Originally Posted by CPUarchitect View Post
You can have the choice between AVX-1024, AVX-512, AVX-256 and AVX-128. AVX-512 and AVX-1024 would use the same (currently reserved) VEX encoding bit so basically you get AVX-1024 for free as far as the instruction format goes.
so, if I understand your well, you think AVX-512 and AVX-1024 will be introduced at the same time, right ?

Quote:
Originally Posted by CPUarchitect View Post
Plus I don't see what you'd gain over AVX-128/256/512/1024. It's quite similar in concept.
less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!

Quote:
Originally Posted by CPUarchitect View Post
That's very interesting because 3D rendering consists of shaders
I'm sure you know very well the subject

Quote:
Originally Posted by CPUarchitect View Post
If instead you want to just double the SIMD width you'd be making compromises.
it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD width

Quote:
Originally Posted by CPUarchitect View Post
And that's what expected to happen with AVX2 since its gather support will allows a lot of loops to be vectorized.
the same loops can already be vectorized using software synthetized gather (there is a common misconception that it's not possible), hardware gather will speed them up but it's already way faster than keeping them scalar, now I think we already discussed gather well enough in the past at RWT and on the Intel AVX forum

Last edited by bronxzv; 01-27-2012 at 03:15 PM.
bronxzv is offline   Reply With Quote
Old 01-27-2012, 03:13 PM   #70
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
not really since spills/fills are mostly to/from the store-load fowarding buffer
That sounds like a far too optimistic assumption. These buffers are finite in size and don't hold the data for long. In particular when unrolling and software pipelining loops, the distance between spills and fills increases and the chances of still finding the values in the store buffer are slim. What doesn't help either is that spilling is specifically done on variables which aren't needed for a while (else they'd be kept in a register instead). And last but not least, there's only one store port and two load ports so between all the spills/fills and regular load/store you quickly get contention...
Quote:
as already requested, do you have a real and complete example to share showing concretely the problem ? for all the high throughput AVX cases I'm dealing with the main limitations come from the L1D / L2 caches bandwidth, I'm stuned you can measure issues due to the instruction latencies, I'll say these are 2 or 3 order of magnitude less important in the real world than the cache bandwidth limitation
Exactly! All your spills and fills cause port congestion.

I'm expecting Haswell to double the width of each load and store port, but while that will certainly help it won't completely eliminate the contention problem. You should be able to observe this by using AVX-128 code.
Quote:
instead of all this theory why not sharing an illustrative example ?
The traces I've worked with are just too long to post here, and more importantly contain potential IP from clients. But since you mentioned that you work with 3D graphics you should have access to plenty of shader kernels which are not suitable for unrolling.
Quote:
how can you achieve 20% increase in performance with the same peak throughput ?
That's what you would typically gain from eliminating spill/fill instructions and improving scheduling opportunities.
Quote:
don't forget SMT already help a lot for hiding memory hierarchy latencies and 4-way SMT is a possibility for Haswell, AVX2 code will be already very near it's maximum potential throughput, the only limitations will come from cache and memory bandwidth, something that your idea will not improve
Once again I'd like you to check whether it's the bandwidth or load/store port contention which is hampering your code, by using AVX-128 instead of AVX-256.

SMT doesn't help at all with load/store port contention. And implementing 4-way SMT is far more expensive than AVX-1024, worsens data locality, and consumes more power. AVX-1024 offers latency hiding like SMT but without the bad properties.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 03:35 PM   #71
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
access to plenty of shader kernels which are not suitable for unrolling.
I'll say where unrolling is unecessary to be more precise, there are a lot of cases with no single spill in the unrolled version and the unrolled version not faster, that's why I can't understand your POV on the matter, and since you use the IP / trade secrets excuse, when it's so easy to devise a convincing example when we have something concrete to show, I'm afraid I'll not learn much from you

Quote:
Originally Posted by CPUarchitect View Post
SMT doesn't help at all with load/store port contention.
so it should not be the problem since we typically enjoy 25% + spedup from enabling hyperthreading (slightly better speedups with 64-bit than 32-bit code and with AVX code than with SSE code for some reason, I'm sure you'll say due to less load/store contention in 64-bit mode, but the difference should be 1% or 2%), the workload that scale the best in our series of 3D models for the regression tests is at 31% speedup from HT, note that power consumption isn't significantly up with HT enabled according to the ASUS AI suite

Last edited by bronxzv; 01-27-2012 at 04:50 PM.
bronxzv is offline   Reply With Quote
Old 01-27-2012, 04:31 PM   #72
CPUarchitect
Senior Member
 
Join Date: Jun 2011
Posts: 223
Default

Quote:
Originally Posted by bronxzv View Post
so, if I understand your well, you think AVX-512 and AVX-1024 will be introduced at the same time, right ?
Not necessarily. When AVX-1024 becomes available there will definitely be AVX-512 too for obvious reasons. The reverse isn't strictly going to be true and I can think of two reasons: renaming registers, and marketing. To keep the out-of-order engine running smoothly you need sufficient registers for renaming. Sandy Bridge is supposed to have 144 physical SIMD registers, of which 2 x 16 are required for two thread contexts, leaving 112 for renaming. AVX-1024 would increase the contexts with 3 x 16 256-bit registers, so a total of 240 registers would be desirable to keep the same renaming capacity. Fortunately that's not a lot for the 16 nm process where we might see AVX-1024 at the earliest (two 'tock' generations from Sandy Bridge), and due to the inherent latency hiding there's no strict need to have the same renaming capacity, but at 176 registers AVX-512 would be cheaper still.

As for the marketing reason, well, they might rather have you buy both a CPU with AVX-512 and then one with AVX-1024 instead of immediately offering the
latter and leaving less incentive to upgrade soon. This completely depends on what the competition might offer. So I'm hoping AMD is thinking of offering AVX-1024 support sooner rather than later. Of course their process disadvantage might make it harder to add sufficient registers.

Anyway, if I had to place a bet I would say that getting AVX-1024/512 in one go is more likely than getting just AVX-512 first. It's only a 30% register file increase per 'tock', in line with previous generations, and it's probably fine with less.
Quote:
less useless padding and direct support for wider vectors, it should allow to clock gate the front-end more effectively than with these small AVX-1024 vectors, processing long vectors over narrow physical execution units provide massive performance/power advantages isn't it ? hey, see how you convinced me!
I'm glad it's finally making sense to you but keep in mind that it's a delicate balancing act. Too much of a good thing becomes a bad thing.

In particular it's really still beneficial to have SIMD execution units. Making them more narrow gets rid of the 'padding' but the average performance/Watt goes down because the power consumption doesn't grow linearly width the execution width due to fixed costs.
Quote:
it is exactly what was done for Sandy Bridge, SIMD width doubled, GPR witdh unchanged, now that a PRF is a given it looks rather doable to double again the SIMD width
No, Sandy Bridge cleverly borrows ALUs from another execution stack for foating-point operations only. It doesn't double the bandwidth either but makes a compromise with the symmetric load ports. So in many ways it's just a stepping stone toward Haswell which will offer 256-bit execution for all SIMD instructions, brings us gather support to facilitate parallelizing generic code, and is next to certain going to increase cache bandwidth.

In other words, saying that Sandy Bridge doubled SIMD throughput is wrong. It's only 50% there, with Haswell completing the task and justifying it with gather support. I still don't see any justification for going to 512-bit any time soon.
CPUarchitect is offline   Reply With Quote
Old 01-27-2012, 05:32 PM   #73
bronxzv
Senior Member
 
Join Date: Jun 2011
Posts: 406
Default

Quote:
Originally Posted by CPUarchitect View Post
Fortunately that's not a lot for the 16 nm process
that's 14nm, and it's certainly not a lot of chip area, even if we get 512-bit wide registers as I'm expecting as the next natural step

Quote:
Originally Posted by CPUarchitect View Post
So I'm hoping AMD is thinking of offering AVX-1024 support sooner rather than later. Of course their
don't forget that AVX-1024 isn't disclosed yet and the FMA4 debacle

Quote:
Originally Posted by CPUarchitect View Post
process disadvantage might make it harder to add sufficient registers.
you seem to think the register file takes a lot of chip area, are you sure ? I''ll be interested to see what a CPU designer has to say about it,
see for example the small area taken by 128 x 128 bit registers in the good old 130 nm Northwood
http://www.chip-architect.org/news/N..._1600x1200.jpg
it's certainly not a stretch to imagine 144 x 512-bit registers at 14 nm, is it ?

Quote:
Originally Posted by CPUarchitect View Post
Anyway, if I had to place a bet I would say that getting AVX-1024/512 in one go is more likely than getting just AVX-512 first.
I bet for AVX-512 first, full width, let's see who will be right in a few years

Quote:
Originally Posted by CPUarchitect View Post
and is next to certain going to increase cache bandwidth.
sure, I expect Ivy Bridge to increase cache bandwidth already (at least DL1 to L2 bandwidth), providing nice speedups to AVX-256 code

Last edited by bronxzv; 01-27-2012 at 06:34 PM.
bronxzv is offline   Reply With Quote
Old 01-27-2012, 07:34 PM   #74
IntelUser2000
Elite Member
 
IntelUser2000's Avatar
 
Join Date: Oct 2003
Posts: 3,494
Default

Quote:
Originally Posted by CPUarchitect View Post
According to the first post there will be 16 core parts. How many cores will be active remains to be seen of course. Likewise Knights Corner is said to have "more than 50 cores" which likely means 64 and several disabled depending on yield.
I think we're giving too much attention to a resume. About 2 years or so ago, there was a slide that said 12 cores max on Ivy Bridge EX, and 16 for Haswell. I assume targets change as the product gets closer to reality.

Quote:
Yes but notice that Haswell will double the computing density with FMA while Knights can't pull that trick any more.
How do you know that? Stampede, the 2013 supercomputer that uses Knights Corner has 10 PFlops, and 8 of them comes from KC. The successor to Knights Corner is said to increase Stampede's output to 15PFlops, or more, suggesting co-processor part will double.

Assuming Haswell EX is a 16 core part @ 3GHz, it'll have a theoretical peak of 768GFlops. 2x Knights Corner using my estimates will result in 2.4 TFlops. That's still a big gap.

Quote:
Granted, but DDR4 is on its way as well and CPUs require less RAM bandwidth due to massive caches.
Yes, comparing DDRx to GDDR5 and its derivatives, sure. 4 DDR4-3200 will result in 100GB/s but the GDDR5 in Radeon HD 6970 is already at 175GB/s. Also in practice, 4 DDR4 channels will remain close to 50GB/s than 100GB/s because they only reach max frequencies at the end of its lifetime. We're also talking about the EX, that uses very conservative speeds. But lets say it uses DDR4-2133 for 68GB/s.

Knights Ferry had a cache size that's way bigger than regular GPUs too. Of course not as big as the EX.

Most importantly, its a co-processor that's not designed to replace Xeons entirely. And you seem right that it doesn't make sense. But who knows what Intel's real goal is?
__________________
Core i7 2600K + Turbo Boost | Intel DH67BL/GMA HD 3000 IGP | Corsair XMS3 2x2GB DDR3-1600 @ 1333 9-9-9-24 |
Intel X25-M G1 80GB + Seagate 160GB 7200RPM | OCZ Modstream 450W | Samsung Syncmaster 931c | Windows 7 Home Premium 64-bit | Microsoft Sidewinder Mouse | Viliv S5-Atom Z520 WinXP UMPC
IntelUser2000 is offline   Reply With Quote
Old 01-27-2012, 08:09 PM   #75
IntelUser2000
Elite Member
 
IntelUser2000's Avatar
 
Join Date: Oct 2003
Posts: 3,494
Default

The real reason Haswell's GPU should be based on Gen X is because the driver support is already in place thanks to predecessors. Complete revamp is bad news.

Quote:
Do you think all Haswell parts will have an IGP? And do you happen to know how many GFLOPS 20 Gen X EU's amount to?
Sandy Bridge achieves 8GFlops/EU @ 1GHz, and Ivy Bridge is said to double that with enhanced co-issue. But even if it doubled again, would it really matter for compute? Would they bother to stick something like DP units there? You don't see high DP performance even on enthusiast cards, only on workstation variants. Why would they do it differently on integrated?

And how much room would there be to stick it in the non-PC parts like EPs and EXs? Is there enough gains to be worth the trouble?
__________________
Core i7 2600K + Turbo Boost | Intel DH67BL/GMA HD 3000 IGP | Corsair XMS3 2x2GB DDR3-1600 @ 1333 9-9-9-24 |
Intel X25-M G1 80GB + Seagate 160GB 7200RPM | OCZ Modstream 450W | Samsung Syncmaster 931c | Windows 7 Home Premium 64-bit | Microsoft Sidewinder Mouse | Viliv S5-Atom Z520 WinXP UMPC

Last edited by IntelUser2000; 01-27-2012 at 08:11 PM.
IntelUser2000 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 02:39 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.