Haswell Integrated GPU = Larrabee3?

tweakboy · Jan 23, 2012

NTMBK said:
But what if that integrated graphics consists of a large number of small x86 cores? Which can be used to run your applications instead of graphics if you decide to hook up a discrete card?

If hes using Vista or 7 , the video card on the desktop matters since is DX10 desktop or DX11 desktop..'

Also videos and other things will not be as good looking as a high end dedicated card. gl :colbert:

NTMBK · Jan 23, 2012

tweakboy said:
If hes using Vista or 7 , the video card on the desktop matters since is DX10 desktop or DX11 desktop..'

Also videos and other things will not be as good looking as a high end dedicated card. gl

Yeah... that's why I was walking about using the Larrabee cores as a coprocessor to speed up the rest of your PC in the case where you have a separate graphics card as well. Like how on SB you can already use Virtu to get Quicksync from the IGP as well as using a proper graphics card. And we're talking about Haswell, which won't be out until 2013, so I think DX11 is a safe bet. Did you not read this thread properly? :colbert:

bronxzv · Jan 23, 2012

CPUarchitect said:
No, to achieve good latency hiding you need to repeat every instruction several time (each processing different data - here's an example).

this link applies (obviously ?) to in order cores not to modern OoO cores where uops of several iterations of the loop are in flight at the same time, hey show me a single compiler that do software pipelining for OoO targets

CPUarchitect said:
So this leads to more instructions, and like I said before there will be even more to deal with the increased register pressure.

I'm lost here, what register pressure has to do with vector width, care to show me some concrete code sample ? this discussion is overly theoretical for my taste

CPUarchitect said:
With AVX-1024 executed on 256-bit units the unrolling/pipelining process would be implicit.

good news`that you can achieve pipelining by processing different lanes from one clock to the next, I was afraid it will block store to load forwarding for example, I certainly missed a detail or two

CPUarchitect said:
No, the AVX-1024 instructions keep the 256-bit execution units busy for four cycles, so there is no bubble.

you were talking about clock gating the *front-end* so it should introduce some bubbles, now it looks like you are changing the subject and talk about the execution units so I'm no more sure what is being discussed

Nemesis 1 · Jan 24, 2012

NTMBK said:
Yeah... that's why I was walking about using the Larrabee cores as a coprocessor to speed up the rest of your PC in the case where you have a separate graphics card as well. Like how on SB you can already use Virtu to get Quicksync from the IGP as well as using a proper graphics card. And we're talking about Haswell, which won't be out until 2013, so I think DX11 is a safe bet. Did you not read this thread properly?

Ya I use Virtu with a NV560Ti . I have used it with and without Virtu and I see NO differances in video playback quality. NONE

Nemesis 1 · Jan 24, 2012

tweakboy said:
Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

I take it you don't use an AVR receiver onboard sound sucks. The differance between an onboard sound and a card is very noticeable . Footstep behind you for instance very noticeable when using a 5:1 AVR reciever. We use the Denon AVR 1612 its reasonable price with great capabilities . Of course speakers is were the real money goes . But ya unless U invest in good sound system . Than ya its hard to hear a differance.

On the igpu I agree IVB will come up way short of good gaming experiance same applies to trinity. But Haswell lets just see.

CPUarchitect · Jan 24, 2012

bronxzv said:
this link applies (obviously ?) to in order cores not to modern OoO cores where uops of several iterations of the loop are in flight at the same time

That is only the case for very short loops. Longer code like in OpenCL kernels typically has critical paths the out-of-order execution engine can't schedule around. Sandy Bridge has a 54 entry scheduler, and while that may seem sufficient keep in mind that the average arithmetic instruction latency is about 4 cycles and you're really aiming for 3 instructions per cycle. Last but not least, cache misses severely limit the number of instructions that can be scheduled. And while prefetching attempts to avoid it, note that it comes at a cost in power consumption.

GPUs hardly suffer from this because they hide arithmetic latency by executing very wide vectors over multiple cycles. AVX-1024 would achieve the same thing for CPUs. The sustained throughput would be higher, while the power consumption would be lower.

bronxzv said:
hey show me a single compiler that do software pipelining for OoO targets

The Intel compiler can do SWP and unrolling to hide latencies. And I'm pretty sure LLVM-based compilers have similar capabilities.

That said, it's not very effective in practice because you easily run out of registers and the additional spill/restore instructions negate what you gain from latency hiding. Adding more registers to x86 is not feasible, but with AVX-1024 you don't have to. Each instruction implicitly accesses four 256-bit registers.

bronxzv said:
I'm lost here, what register pressure has to do with vector width, care to show me some concrete code sample ? this discussion is overly theoretical for my taste

Take for example this pseudo-code:

Code:

for(i = 0; i < 1000; i++)
{
    t = a[i] * b[i];   // Takes 4 cycles to complete
    c[i] = t + d[i];   // Has to wait on the multiplication
}

For simplicity's sake there's only two operations, but assume that in a real-world example there's lots more code inside the loop and the out-of-order execution engine can't execute full iterations simultaneously due to long dependency chains. To hide the latency of the dependency in our simplified example, let's unroll the loop four times:

Code:

for(i = 0; i < 1000; i += 4)
{
    t0 = a[i+0] * b[i+0];
    t1 = a[i+1] * b[i+1];
    t2 = a[i+2] * b[i+2];
    t3 = a[i+3] * b[i+3];
    c[i+0] = t0 + d[i+0];
    c[i+1] = t1 + d[i+1];
    c[i+2] = t2 + d[i+2];
    c[i+3] = t3 + d[i+3];
}

There's no more waiting required. Note however that more temporary variables are used. In a more complex real-world example, you'd easily exceed the available architectural register count. Thus there would be extra store and load instructions required for spilling registers, which make things slower again.

To not require more registers, we could instead use vector registers to do the same thing:

Code:

for(i = 0; i < 1000; i += 4)
{
    v = a[i+0..3] * b[i+0..3];   // Vector multiplication
    c[i+0..3] = v + d[i+0..3];   // Vector addition, has to wait
}

Great, the number of registers has been reduced and we're doing four times more work in parallel. But now we're back to the issue of having to hide the latency! We could unroll again, but that would require more registers...

...unless the unrolling is done at the hardware level. Take this code:

Code:

for(i = 0; i < 1000; i += 16)
{
    v = a[i+0..15] * b[i+0..15];
    c[i+0..15] = v + d[i+0..15];
}

The vectors are made four times wider again, but instead of performing 16 multiplications or additions in parallel, only 4 are performed each cycle. This means that by the time the first 4 multiplications have finished, the vector addition can start for the first 4 values, and every subsequent cycle the results are ready for the next 4 values. It's very similar to the above unrolled case, except that it doesn't explicitly use four separate instructions and four separate registers. It just uses one very wide register. Of course at the hardware level this can be 4 individual registers.

x86-64 only has 16 architectural registers visible to the software, but for instance Sandy Bridge has 144 renamed 256-bit physical registers. AVX-1024 would implicitly allow the software to access 64 256-bit registers for latency hiding without requiring additional spilling code.

bronxzv said:
you were talking about clock gating the *front-end* so it should introduce some bubbles, now it looks like you are changing the subject and talk about the execution units so I'm no more sure what is being discussed

No, the front-end contains several queues, and as long as these queues are sufficiently full there's no risk of creating bubbles down the pipeline. Of course the queues themselves should not be clock gated. The AVX-1024 instructions take the same time as four AVX-256 instructions so the execution units are kept busy even if the front-end is regularly clock gated.

NTMBK · Jan 24, 2012

Nemesis 1 said:
Ya I use Virtu with a NV560Ti . I have used it with and without Virtu and I see NO differances in video playback quality. NONE

Read Anand's review about it. In a program which is Quicksync enabled, you can transcode much more quickly than even with GPU acceleration. With sometimes better resulting image quality.http://www.anandtech.com/show/4083/...-core-i7-2600k-i5-2500k-core-i3-2100-tested/9

You need a program which supports it though.

bronxzv · Jan 24, 2012

CPUarchitect said:
The Intel compiler can do SWP and unrolling to hide latencies. And I'm pretty sure LLVM-based compilers have similar capabilities.

I never measured a speedup with #pragma swp (it should be there for IPF targets I suppose) but I use extensively #pragma unroll for *small loops*, typical optimal values are between 2 and 4, that's why I know from concrete experience that there is no actual problem overflowing the uop cache for compute code (unlike you were stating): small loops are still small after unrolling, bigger loops not speed up by unrolling + same unrolling rules for SSE / AVX-256 (vectors width has no impact on unrolling appart some rare instances of explicit software prefetch)

CPUarchitect said:
...unless the unrolling is done at the hardware level.

exactly, and it is *already done at the hw level* since, as explained, several iterations of the loop will be executed in parallel (thanks to register renaming), just try with actual code and timings if you don't trust me, generally unrolling provide no so great speedups (EDIT: around 5% in your example with AVX-256 code and L1$ blocked data, see the non-unrolled version below, FYI unroll 5x is the optimal value, for ex. unroll 3x leads to a slowdown so latency hiding isn't involved but maybe optimal usage of the LSD buffer) and only for small loops, your idea looks much like a solution seeking for a problem, you seem to reason like if for some reason there is a stall at the branch when it is typically enjoying very high prediction hit rate or like if we were discussing in order cores lacking register renaming

Code:

 .B13.2:: 
vmovups ymm0, YMMWORD PTR [rcx+rax*4] ;417.24
vmulps ymm1, ymm0, YMMWORD PTR [rdx+rax*4] ;418.17
vaddps ymm2, ymm1, YMMWORD PTR [r8+rax*4] ;418.20
vmovups YMMWORD PTR [r9+rax*4], ymm2 ;418.5
add rax, 8 ;415.25
cmp rax, 1000 ;415.19
jl .B13.2 ; Prob 99% ;415.19

CPUarchitect said:
The AVX-1024 instructions take the same time as four AVX-256

you are the only one to talk about "AVX-1024" this way, it will be a good idea IMO to explain when you post it on public forums (under a nick that some may assume as the one of an insider at some CPU company) that it's your very own idea for AVX-1024 when the most likely future is plain and simple full width AVX-512 as a first step for further convergence with MIC

Nemesis 1 · Jan 24, 2012

NTMBK said:
Read Anand's review about it. In a program which is Quicksync enabled, you can transcode much more quickly than even with GPU acceleration. With sometimes better resulting image quality.http://www.anandtech.com/show/4083/...-core-i7-2600k-i5-2500k-core-i3-2100-tested/9

You need a program which supports it though.

I already read it . I was not referring to encoding only viewing a video . and I see no differances in viewing a BR movie on the Igpu as comparred to nv560ti. The encoding stuff needs time on intels side but its pretty good

Nemesis 1 · Jan 25, 2012

bronxzv said:
you are the only one to talk about "AVX-1024" this way, it will be a good idea IMO to explain when you post it on public forums (under a nick that some may assume as the one of an insider at some CPU company) that it's your very own idea for AVX-1024 when the most likely future is plain and simple full width AVX-512 as a first step for further convergence with MIC

Well I agree Intel haswell will likely be 512 bit . But no he isn't the only one talking in this manner . Beyond 3D has I few guys talking in the same manner but who has a better understanding of the arth. I wouldn't say . AVX2 is more important than many understand. That pesky VEC prefix cuts alot of unneccesary code out of any and all AVX2 programms That are recompiled with Vec prefix . Something AMD DOESN"T have and never will

bronxzv · Jan 25, 2012

Nemesis 1 said:
Well I agree Intel haswell will likely be 512 bit .

Haswell is known to be with 256-bit, AVX2 instructions
http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/
doesn't change the maximum width of the vectors, I was talking about a more distant future

Nemesis 1 said:
But no he isn't the only one talking in this manner . Beyond 3D has I few guys talking in the same manner but who has a better understanding of the arth.

same poster under different Nicks (pun intended), actually googling for "AVX-1024" is a sure way to discover all the places where he discuss his idea

Nemesis 1 · Jan 25, 2012

Ya Nicks is the guy pushing hard. I never read that haswell was 256 bit part . I do know that knights corner is 512 bit. Yes AVX2 does change the width if were talking about code length. Vex prefix does in fact shorten the code length . as its software hardware compiler to make the vec prefix work . You can read it in intels PDF on AVX.

bronxzv · Jan 25, 2012

Nemesis 1 said:
Yes AVX2 does change the width if were talking about code length

we are talking about vector widths, not code density, AVX change it vs. SSE but AVX2 doesn't change it vs. AVX

now, please don't count on me to to start this discussion about VEX with you, again

Nemesis 1 · Jan 25, 2012

CPUarchitect said:
That is only the case for very short loops. Longer code like in OpenCL kernels typically has critical paths the out-of-order execution engine can't schedule around. Sandy Bridge has a 54 entry scheduler, and while that may seem sufficient keep in mind that the average arithmetic instruction latency is about 4 cycles and you're really aiming for 3 instructions per cycle. Last but not least, cache misses severely limit the number of instructions that can be scheduled. And while prefetching attempts to avoid it, note that it comes at a cost in power consumption.

GPUs hardly suffer from this because they hide arithmetic latency by executing very wide vectors over multiple cycles. AVX-1024 would achieve the same thing for CPUs. The sustained throughput would be higher, while the power consumption would be lower.

The Intel compiler can do SWP and unrolling to hide latencies. And I'm pretty sure LLVM-based compilers have similar capabilities.

That said, it's not very effective in practice because you easily run out of registers and the additional spill/restore instructions negate what you gain from latency hiding. Adding more registers to x86 is not feasible, but with AVX-1024 you don't have to. Each instruction implicitly accesses four 256-bit registers.

Take for example this pseudo-code:

Code:

for(i = 0; i < 1000; i++) { t = a[i] * b[i]; // Takes 4 cycles to complete c[i] = t + d[i]; // Has to wait on the multiplication }

For simplicity's sake there's only two operations, but assume that in a real-world example there's lots more code inside the loop and the out-of-order execution engine can't execute full iterations simultaneously due to long dependency chains. To hide the latency of the dependency in our simplified example, let's unroll the loop four times:

Code:

for(i = 0; i < 1000; i += 4) { t0 = a[i+0] * b[i+0]; t1 = a[i+1] * b[i+1]; t2 = a[i+2] * b[i+2]; t3 = a[i+3] * b[i+3]; c[i+0] = t0 + d[i+0]; c[i+1] = t1 + d[i+1]; c[i+2] = t2 + d[i+2]; c[i+3] = t3 + d[i+3]; }

There's no more waiting required. Note however that more temporary variables are used. In a more complex real-world example, you'd easily exceed the available architectural register count. Thus there would be extra store and load instructions required for spilling registers, which make things slower again.

To not require more registers, we could instead use vector registers to do the same thing:

Code:

for(i = 0; i < 1000; i += 4) { v = a[i+0..3] * b[i+0..3]; // Vector multiplication c[i+0..3] = v + d[i+0..3]; // Vector addition, has to wait }

Great, the number of registers has been reduced and we're doing four times more work in parallel. But now we're back to the issue of having to hide the latency! We could unroll again, but that would require more registers...

...unless the unrolling is done at the hardware level. Take this code:

Code:

for(i = 0; i < 1000; i += 16) { v = a[i+0..15] * b[i+0..15]; c[i+0..15] = v + d[i+0..15]; }

The vectors are made four times wider again, but instead of performing 16 multiplications or additions in parallel, only 4 are performed each cycle. This means that by the time the first 4 multiplications have finished, the vector addition can start for the first 4 values, and every subsequent cycle the results are ready for the next 4 values. It's very similar to the above unrolled case, except that it doesn't explicitly use four separate instructions and four separate registers. It just uses one very wide register. Of course at the hardware level this can be 4 individual registers.

x86-64 only has 16 architectural registers visible to the software, but for instance Sandy Bridge has 144 renamed 256-bit physical registers. AVX-1024 would implicitly allow the software to access 64 256-bit registers for latency hiding without requiring additional spilling code.

No, the front-end contains several queues, and as long as these queues are sufficiently full there's no risk of creating bubbles down the pipeline. Of course the queues themselves should not be clock gated. The AVX-1024 instructions take the same time as four AVX-256 instructions so the execution units are kept busy even if the front-end is regularly clock gated.

Are you Nicks? from beyond 3D. LOL if you are. You change your posting stile to post here?

Nemesis 1 · Jan 25, 2012

bronxzv said:
we are talking about vector widths, not code density, AVX change it vs. SSE but AVX2 doesn't change it vs. AVX

now, please don't count on me to to start this discussion about VEX with you, again

The Vex thing for right now I don't care about. That will come back in time when I can show proof of That Vex prefix is computional slices just like intel used in mitosis. If you read about mitosis you will see it contains all the elements of the Vec prefix . Thats for another time as you said . So with the width thing you guys are still discussing registers. If intel uses FMA 3 and AMD uses both 3 and 4 as AMD has to stay with 4 as they are locked in now . Backwards compatibilities. I think amds efforts are a more interesting discussion at this time . Also If I were intel I would never ever tell AMD when they move to 512bit . I infact would lie about it till release or not say anything and let the rumor sites spread the 256 bit story. I was looking at the agreement and I don't see were intel has to tell AMD anything about this change other than its coming. Its not like Intel having to share AVX with AMD by agreement. AMD can go 512 width anytime they want.

Nemesis 1 · Jan 25, 2012

NTMBK said:
Yeah... that's why I was walking about using the Larrabee cores as a coprocessor to speed up the rest of your PC in the case where you have a separate graphics card as well. Like how on SB you can already use Virtu to get Quicksync from the IGP as well as using a proper graphics card. And we're talking about Haswell, which won't be out until 2013, so I think DX11 is a safe bet. Did you not read this thread properly?

Hi . Welcome you are in for an experiance between me or tweaky I don't know which of us will drive you to the point of a break down. I think tweaky works harder at than I as its natural for me. Haswell pretty safe bet for DX11.1 as IVB is DX 11

NTMBK · Jan 25, 2012

Nemesis 1 said:
Haswell pretty safe bet for DX11.1 as IVB is DX 11

...glad to see we agree on that?

Anyway, my initial point wasn't about using the IGP for running DirectX. I was talking about what to do with it if you don't need to use it for graphics (like if you have a graphics card plugged in). On Haswell (if it does use Larrabee graphics), you could use it for running x86 code to reduce strain on the CPU instead of having the silicon sitting there unused.

Nemesis 1 · Jan 25, 2012

I tell ya since 2001 I been researching everthing on intel until recently and its hard to make a call on future intel, I was willing to bet 3D intel chips at 32nm . we get it at 22nm for me a disappoint in the fact I made the wrong call . Ego can be ugly thing . But on intel and haswell /Rockwell . I think and believe intel wants to use everypart in its die for both compute and graphics . and those who believe NV/amd lead intel in compute gpu are fools with a lead that intel can't approach . I think Knights corner will outsell NV 5to 1 when its released. As nutty as it sounds no baiting here. I believe intel leads in compute gpu befor they even release co processor knights corner because its X86 and recompiles is a no brainer The strangest thing with open CL . Supposedly Apple came up with it . I don't buy it at all . I think intel pushed Apple to come out with opencl. After all Apple intel are tight. Both bought 49% of imaginations tech at around the exact same time . Intel comes up with Lightspeed Thunderbolt and apple gets it a year ahead of all others and gets development honors with Intel . Than You have these AMD fanbois saying Apple switching to trinity . Not going to happen . Than these same people say AMD leads in opencl . Why because AMD had DX11 first LOL. What a bunch . The tears are going to fall hot and heavy come IVB release. Intel right on time AMD to early. Intel seems to be working in a manner that seems to be saying MS has got to go down . After all MS conspired with AMDx86 64 Hammer . Well now MS is going to find out they had a bad handle on that hammer and the vibration is destroying the arm that set it in motion . The biggest setback in cpu history. But how does intel react like a true man . They say to the world you want X86 were going to shove x86 up your rear

tommo123 · Jan 25, 2012

Nemesis 1 said:
I tell ya since 2001 I been researching everthing on intel until recently and its hard to make a call on future intel, I was willing to bet 3D intel chips at 32nm . we get it at 22nm for me a disappoint in the fact I made the wrong call . Ego can be ugly thing . But on intel and haswell /Rockwell . I think and believe intel wants to use everypart in its die for both compute and graphics . and those who believe NV/amd lead intel in compute gpu are fools with a lead that intel can't approach . I think Knights corner will outsell NV 5to 1 when its released. As nutty as it sounds no baiting here. I believe intel leads in compute gpu befor they even release co processor knights corner because its X86 and recompiles is a no brainer The strangest thing with open CL . Supposedly Apple came up with it . I don't buy it at all . I think intel pushed Apple to come out with opencl. After all Apple intel are tight. Both bought 49% of imaginations tech at around the exact same time . Intel comes up with Lightspeed Thunderbolt and apple gets it a year ahead of all others and gets development honors with Intel . Than You have these AMD fanbois saying Apple switching to trinity . Not going to happen . Than these same people say AMD leads in opencl . Why because AMD had DX11 first LOL. What a bunch . The tears are going to fall hot and heavy come IVB release. Intel right on time AMD to early. Intel seems to be working in a manner that seems to be saying MS has got to go down . After all MS conspired with AMDx86 64 Hammer . Well now MS is going to find out they had a bad handle on that hammer and the vibration is destroying the arm that set it in motion . The biggest setback in cpu history. But how does intel react like a true man . They say to the world you want X86 were going to shove x86 up your rear

that reads like you wrote it in 20 seconds while on coke or something! :hmm:

exar333 · Jan 25, 2012

Nemesis 1 said:
I tell ya since 2001 I been researching everthing on intel until recently and its hard to make a call on future intel, I was willing to bet 3D intel chips at 32nm . we get it at 22nm for me a disappoint in the fact I made the wrong call . Ego can be ugly thing . But on intel and haswell /Rockwell . I think and believe intel wants to use everypart in its die for both compute and graphics . and those who believe NV/amd lead intel in compute gpu are fools with a lead that intel can't approach . I think Knights corner will outsell NV 5to 1 when its released. As nutty as it sounds no baiting here. I believe intel leads in compute gpu befor they even release co processor knights corner because its X86 and recompiles is a no brainer The strangest thing with open CL . Supposedly Apple came up with it . I don't buy it at all . I think intel pushed Apple to come out with opencl. After all Apple intel are tight. Both bought 49% of imaginations tech at around the exact same time . Intel comes up with Lightspeed Thunderbolt and apple gets it a year ahead of all others and gets development honors with Intel . Than You have these AMD fanbois saying Apple switching to trinity . Not going to happen . Than these same people say AMD leads in opencl . Why because AMD had DX11 first LOL. What a bunch . The tears are going to fall hot and heavy come IVB release. Intel right on time AMD to early. Intel seems to be working in a manner that seems to be saying MS has got to go down . After all MS conspired with AMDx86 64 Hammer . Well now MS is going to find out they had a bad handle on that hammer and the vibration is destroying the arm that set it in motion . The biggest setback in cpu history. But how does intel react like a true man . They say to the world you want X86 were going to shove x86 up your rear

Awesome post! I think I am starting to 'get' your posts nemesis.

CPUarchitect · Jan 25, 2012

bronxzv said:
I use extensively #pragma unroll for *small loops*, typical optimal values are between 2 and 4, that's why I know from concrete experience that there is no actual problem overflowing the uop cache for compute code (unlike you were stating)

I never said unrolling doesn't work for small loops. But you're ignoring that most loops aren't small at all. The OpenCL type workloads for which AVX is most suited, consists of kernels which are in almost all cases too large for unrolling.

Of course you can split large loops into smaller ones, but then you have to introduce additional load/store instrutions to read/write intermediate results from/to temporary buffers, plus more looping instructions. These would result in a big performance hit, outweighing anything you might gain from running multiple loop iterations simultaneously. It's also a notoriously difficult transformation for a compiler to perform.

With AVX-1024 you'd have none of those issues.

bronxzv said:
exactly, and it is *already done at the hw level* since, as explained, several iterations of the loop will be executed in parallel (thanks to register renaming), just try with actual code and timings if you don't trust me, generally unrolling provide no so great speedups (EDIT: around 5% in your example with AVX-256 code and L1$ blocked data, see the non-unrolled version below, FYI unroll 5x is the optimal value, for ex. unroll 3x leads to a slowdown so latency hiding isn't involved but maybe optimal usage of the LSD buffer) and only for small loops, your idea looks much like a solution seeking for a problem, you seem to reason like if for some reason there is a stall at the branch when it is typically enjoying very high prediction hit rate or like if we were discussing in order cores lacking register renaming

I am very well aware that x86 processors with out-of-order execution can execute multiple loop iterations simultaneously. But once again your conclusions only apply to short loops. Long loops, which are the typical case, contain a critical path which is longer than the CPU's reordering window, thus preventing overlap between the iterations.

bronxzv said:

Code:

 .B13.2:: 
vmovups ymm0, YMMWORD PTR [rcx+rax*4] ;417.24
vmulps ymm1, ymm0, YMMWORD PTR [rdx+rax*4] ;418.17
vaddps ymm2, ymm1, YMMWORD PTR [r8+rax*4] ;418.20
vmovups YMMWORD PTR [r9+rax*4], ymm2 ;418.5
add rax, 8 ;415.25
cmp rax, 1000 ;415.19
jl .B13.2 ; Prob 99% ;415.19

This example is a perfect illustration of how short loops are limited by load and store operations. Imagine if the multiplication and addition were part of a bigger calculation. In that case much of the input/output wouldn't have to be read/written from/to memory but can instead remain in registers.

bronxzv said:
you are the only one to talk about "AVX-1024" this way, it will be a good idea IMO to explain when you post it on public forums (under a nick that some may assume as the one of an insider at some CPU company) that it's your very own idea for AVX-1024

This isn't my idea at all. AVX was announced by Intel to be extendable to 512 and 1024 bit. And it was also Intel who first implemented 128-bit SIMD instructions on 64-bit execution units back in 1999. Despite using the same execution width, SSE proved to be superior to AMD's 3DNow! The main reason for that was the increase in available register space which offered latency hiding while avoiding spilling. These are the exact same qualities AVX-1024 would offer. It's not a new idea by any stretch.

bronxzv said:
the most likely future is plain and simple full width AVX-512 as a first step for further convergence with MIC

Why would that be more likely when the implementation cost would be substantially higher?

CPUarchitect · Jan 25, 2012

bronxzv said:
same poster under different Nicks (pun intended), actually googling for "AVX-1024" is a sure way to discover all the places where he discuss his idea

Intel differentiates between AVX-128 and AVX-256 instrutions. So obviously their 512-bit and 1024-bit instructions would be named AVX-512 and AVX-1024 respectively.

Other people appear to show an interest in AVX-512 and AVX-1024 as well, and some come to the same conclusion that splitting the execution over multiple cycles offers the best cost/gain ratio. I am not "Nick".

bronxzv · Jan 25, 2012

CPUarchitect said:
The OpenCL type workloads for which AVX is most suited, consists of kernels which are in almost all cases too large for unrolling.

indeed, that's exactly why I was reacting to this comment of yours:
"
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling,
"
it never happens in practice since we unroll only small loops

CPUarchitect said:
With AVX-1024 you'd have none of those issues.

as already explained, there isn't such issue with AVX-256 code

CPUarchitect said:
Imagine if the multiplication and addition were part of a bigger calculation. In that case much of the input/output wouldn't have to be read/written from/to memory but can instead remain in registers.

indeed and it changes absolutely nothing to the fact that you don't need to do loop unrolling or software pipelining to hide latencies on OoO cores

CPUarchitect said:
This isn't my idea at all.

I was meaning your idea to process 1024-bit data in 4 chunks with a single uop for the 4 parts, it looks like a new idea in the x86 world

CPUarchitect said:
128-bit SIMD instructions on 64-bit execution units back in 1999

Katmai was different since it used 2 uops per 128-bit SSE instruction like all other cores before Conroe, it changes a lot of things and they are still using independent uops on super low power Atom cores for a reason, I'm sure,

CPUarchitect said:
These are the exact same qualities AVX-1024 would offer. It's not a new idea by any stretch.

by keeping a fat monolitic instuction down the pipe instead of 4 slim instructions change a lot of things IMO so your idea is really new (or an old idea already dismissed because it's more efficient to send several uops)

bronxzv · Jan 25, 2012

CPUarchitect said:
Intel differentiates between AVX-128 and AVX-256 instrutions. So obviously their 512-bit and 1024-bit instructions would be named AVX-512 and AVX-1024 respectively.

Other people appear to show an interest in AVX-512 and AVX-1024 as well, and some come to the same conclusion that splitting the execution over multiple cycles offers the best cost/gain ratio. I am not "Nick".

next time you'll say you are not c0d1f1ed who post at the Intel AVX forum ?

bronxzv · Jan 25, 2012

CPUarchitect said:
Why would that be more likely when the implementation cost would be substantially higher?

1) because MIC products are 512-bit wide and they will certainly try to make standard components for both, (implementation and validation less costly, not more), with a grand plan to unify the ISAs at some point, it makes no sense in the long run to have two x86 based vector ISAs
2) because it was the natural evolution so far, 64-bit MMX to 128-bit SSE to 256-bit AVX, next logical step is 512-bit
3) because most ISVs will not bother to port their code with no performance gain but only a slight perf/power advantage
4) with wider vectors there is on average more padding and thus unused computation slots, if the peak AVX-256 and AVX-1024 throughputs are the same the actual performance will be lower with AVX-1024 (more unused slots), all the pain to support yet another code path and worse performance, i.e. a loss-loss situation

Haswell Integrated GPU = Larrabee3?

Diamond Member

Lifer

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Lifer

Lifer

Lifer

Lifer

Lifer

Platinum Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member