Thoughts on "8 Core" Bulldozer and "4 Core Sandy Bridge"

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

bronxzv

Senior member
Jun 13, 2011
460
0
71
So whats this prove? nothing .

It was just to show you the kind of code actual compilers spit out and that there isn't anything magical or secret about VEX, many people use it daily for producing software

Because he understands that the rex instruction code is encoded to the Ves prefix .

VEX makes REX redundant since all the features of REX are provided by VEX + more, in fact it's not only useless but forbidden to use REX just before VEX or right after

Intel has to pad the Ymm values . Padding as he referrs to it is clearing the upper values to 0 .
Intel, AMD and any provider of x86 CPU with AVX support must clear the 128 MSBs with VEX.128 instructions, but obviously not for the VEX.256 variants as in the example I just provided. It's easy to spot if it's 128 or 256: with VEX.128 you'll see "xmm" operands in the ASM dump and with VEX.256 these are called "ymm"


But if you read the mitosis PDF you can clearly see why the upper YMM register is cleared.

No need to re-read this good Mitosis PDF since it's completely irelevant to te execution of VEX.128 instructions
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
NO where any where other than in forums can you show AMD has the Vex prefix . The Vex Prefix is Intels Mitosis , Plan and simply put . I have showed the proof . Now you guys in denial have to debunk that proof . Beware I will eat ya alive .
It's a CISCy encoding scheme, using very large complicated instructions to decrease the total number of instructions, and total instruction size, while improving efficiency by keeping most of the instructions fairly close in size, and easy to get a read on the size w/o hacky tricks used more much of normal x86. I don't get the link to Mitosis, or x86 equivalents (which are likely to bog the system down as much as help).

Prove that with your link Copy and paste your proof . LOL at YOU. Not the whole link just the part that says AMD can use Vex prefix. AMD doesn't have the compilers to do Mitosis and never will have .
Intel's compiler has a very small amount of some small markets, and that isn't likely to change. It often takes more than just changing the executable you call, aggressive optimizations often backfire, and if you want good performance across different CPUs, it may turn out to be an inferior choice of compiler. AMD only needs to offer the most minimal help to MS, GCC, and LLVM, to get good enough support. Their extensions have historically been simpler and easier to use than Intel's, so they don't really need their own compiler. Likewise, unless the major non-Intel compilers start supporting some feature set, Intel having it is only worth a yawn.

IF amd tried to use the VEXprefix it would cause an up.
Why would this be? The flags that need to be checked do not appear to be to be Intel-only. Can you cite where there is an equivalent of a "GenuineIntel" check?

Barring a direct cite of something that will prevent AMD from having a CPU that supports it (not whether BD does or not, as the answer to that is of course mixed), I will now bow out.

I will, however, end on this:
AMD has XOP and they donot have encodes in XOP for rex value
And say, very clearly: any instructions, or encoding schemes, that AMD comes up with, are entirely irrelevant to the discussion of potential for AVX and AVX2 support, with vex. They are a red herring. It is like saying AMD will not be able to support SSE, because 3DNow! isn't encoded the same way, were this the year 2000.

P.S.
Well cerb we will just have to wait for some SSE2 recompiles to be done than benched marked to find the truth. God speed.
Another red herring. There could be a 10x difference in performance, and that would not matter one little bit, as to whether or not AMD could bring a CPU to market in say, 2015, that has support for all currently defined instructions encoded w/ vex.
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Since 2002 Intel announced that they were prefering Tri-gate over FinFET
http://www.intel.com/technology/silicon/tri-gate.htm

so there was little left for speculation already 10 years before the 1st products are brought to market

btw this is a very bad example of names that change since in this case they kept the original nomenclature unlike for many other technologies

No it was a good example as me and IDC went threw this befor we both know intel has always called theirs tri-gate. Hell we been speculating for along time when it would show I figured @32nm . But my glass is half full . If I would have used an example YOU wouldn't complain about I would have choosen a GPU in research. Like the NV 480. Your knit picking.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
As you can clearly see JF implies intel has to do this because they haven't figured AMDS AVX instructions .

I don't think so, he is refering to the penalty from switching from SSE to AVX and back with Sandy Bridge if your code miss the necessary VZEROUPPER/VZEROALL instructions, I suppose Bulldozer will handle that better and he uses this for some good guerilla marketing.

Its a necessary action for mitosis to work
why, can you elaborate on this?

If you read the AVX the length is a big deal. . One of these 2 processors can do greater than or less than the other processor cann't and that will cause an up. You tell me which is which. SB or BD
Sorry, I can't since I have no clue what the quesion is
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Well cerb we will just have to wait for some SSE2 recompiles to be done than benched marked to find the truth. God speed.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
It was just to show you the kind of code actual compilers spit out and that there isn't anything magical or secret about VEX, many people use it daily for producing software Were not talking abot vex were talking about the Vex prefix .

Read the dang AVX PDF when intel is talking about VEX they use the term VEX . When they are talking about the PREfix of VEX they write it as such its not some over site on intels part



VEX makes REX redundant since all the features of REX are provided by VEX + more, in fact it's not only useless but forbidden to use REX just before VEX or right after.

Thats not so and its covered in the PDF . AMD when using AVX has to use all the REX code that was orginally there +. Intel encodes the vexprefix and the rex code in a bit form rather than a byte form and its inside the Vex prefix so its not befor or after. as already clearly shown . But your right IF amd uses anywere but inside the prefixVEX your correct in your statement and we get an UP . Way less code involved in that scheme Vex prefix scheme. Its in the PDF . You are blantly trolling and thats a forum offense


Intel, AMD and any provider of x86 CPU with AVX support must clear the 128 MSBs with VEX.128 instructions, but obviously not for the VEX.256 variants as in the example I just provided. It's easy to spot if it's 128 or 256: with VEX.128 you'll see "xmm" operands in the ASM dump and with VEX.256 these are called "ymm"

Yep tis tru The differance again is in the PDF Intels is auto clear AMDs is not AMD . Its posted here in this thread clearly posted as it was a main point. From the PDF




No need to re-read this good Mitosis PDF since it's completely irelevant to te execution of VEX.128 instructions

Really the prefix of Vex can add up to 5 operands syntec . It can take a single operand and make it 2 or more. AUTO MATICLY

The vex prefix can make 128 bit instructions into 256 bit instructions but not always This is why a length=L expressed in code on an intel sb can be greater than or less than . AMD does not have Vexprefix . and can't do greater than or less than without switching the registers around and other work. I have asked JF amd that question several times Does AMD have the Vexprefix and he skirts it.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Read the dang AVX PDF when intel is talking about VEX they use the term VEX . When they are talking about the PREfix of VEX they write it as such its not some over site on intels part
I'm lost here, sorry, try to rephrase?


Thats not so and its covered in the PDF

Have a look in the AVX Reference Guide (Ref. # 319433-011), page 4-2:

4.1.3 VEX and the REX prefix​
Any VEX-encoded instruction with a REX prefix proceeding VEX will #UD.

[...]

the three byte VEX provides a compact replacement of REX and 3-byte opcode instructions (including AVX and FMA instructions)​
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Nemesis, are you envisioning this as something that is done in real-time, or during compilation?

The only way I could see this going down the way you claim it would is if it was done in real-time somehow.

I still maintain anything done in hardware should be fair game.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I don't think so, he is refering to the penalty from switching from SSE to AVX and back with Sandy Bridge if your code miss the necessary VZEROUPPER/VZEROALL instructions, I suppose Bulldozer will handle that better and he uses this for some good guerilla marketing.

He is skirting the issue the Ymm is cleared atomaticly to zeros so the it can revert to the sse code . That the whole Idea behind Mitosis you read the top field and bottom field befor you do a write . Gosh darn its planely written and shown to you now several times

why, can you elaborate on this?


Sorry, I can't since I have no clue what the quesion is

This is maybe true that you don't have a clue but I think not . Were was your first post ever here at in these forums . You seem to know coding well . and your using disinformation .
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Nemesis, are you envisioning this as something that is done in real-time, or during compilation?

The only way I could see this going down the way you claim it would is if it was done in real-time somehow.

I still maintain anything done in hardware should be fair game.

Jit compiler possiable software layer
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
I'm lost here, sorry, try to rephrase?

SURE just for you IN the INTEL pdf when intel is referring to vex they express it that way , When they are referring to VEXPREFIX that referr to it in that manner. Its not an oversight on intels part as it occurs 1000's of times. Plain enough.




Have a look in the AVX Reference Guide (Ref. # 319433-011), page 4-2:

4.1.3 VEX and the REX prefix​
Any VEX-encoded instruction with a REX prefix proceeding VEX will #UD.

[...]

the three byte VEX provides a compact replacement of REX and 3-byte opcode instructions (including AVX and FMA instructions)​


SURE just for you IN the INTEL pdf when intel is referring to vex they express it that way , When they are referring to VEXPREFIX they referr to it in that manner. Its not an oversight on intels part as it occurs 1000's of times. Plain enough. This pdf is about AVX and FMA . Not AMDS FMA 4 either . So why intel use both terms differantly 1000's of times throught the PDF I assume there is a differance.
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
4.1.3 VEX and the REX prefix
Any VEX-encoded instruction with a REX prefix proceeding VEX will #UD.


Come on stop! MODS need to look at what your doing here. The AVX PDFclearly states that the rex encode resides inside the prefix of vex. so it can't follow it . The VEX and VEXprefix are not the same . AS clearly shown in the INTEL AVX pdf . Everything your saying applies to AMD only and that why your doing what your doing. It does apply to intel only if they use AVX instruction extension , Not the VEX prefix . HERE I will paste 1 more time so MODS can clearly see what your up to .

1.3.3 VEX Prefix Instruction Encoding Support
Intel AVX introduces a new prefix, referred to as VEX, in the Intel 64 and IA-32
instruction encoding format. Instruction encoding using the VEX prefix provides the
following capabilities:
• Direct encoding of a register operand within VEX. This provides instruction syntax
support for non-destructive source operand.
• Efficient encoding of instruction syntax operating on 128-bit and 256-bit register
sets.
• Compaction of REX prefix functionality: The equivalent functionality of the REX
prefix is encoded within VEX.
• Compaction of SIMD prefix functionality and escape byte encoding: The functionality
of SIMD prefix (66H, F2H, F3H) on opcode is equivalent to an opcode
extension field to introduce new processing primitives. This functionality is
replaced by a more compact representation of opcode extension within the VEX
prefix. Similarly, the functionality of the escape opcode byte (0FH) and two-byte
escape (0F38H, 0F3AH) are also compacted within the VEX prefix encoding.
• Most VEX-encoded SIMD numeric and data processing instruction semantics with
memory operand have relaxed memory alignment requirements than instructions
encoded using SIMD prefixes (see Section 2.5).
VEX prefix encoding applies to SIMD instructions operating on YMM registers, XMM
registers, and in some cases with a general-purpose register as one of the operand.
VEX prefix is not supported for instructions operating on MMX or x87 registers.
Details of VEX prefix and instruction encoding are discussed in Chapter 4.




Intel AVX introduces a new prefix, referred to as VEX, in the Intel 64 and IA-32
What does the above say. Do see AMD written anywere above .

This new instruction set is called AVX not VEX. As far as I know AMD doesn't have VEX, EVERTHING in that PDF referrs to INTEL . When they talk about AMD they say AMD. and they make referance to how AMD has to use the REX prefix when using the AVX instruction set as far as I know AMD has XOP . AMD should never use VEX when referring to AVX instruction set . XOP and VEX are not the same . Ya I caught myself making the same error you did well in that respect . Actually they don't ever mention AMd but they do referance the XOP
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,669
2,541
136
Intel AVX introduces a new prefix, referred to as VEX, in the Intel 64 and IA-32
What does the above say. Do see AMD written anywere above .
Well, it's an Intel PDF, what do you expect?

This new instruction set is called AVX not VEX. As far as I know AMD doesn't have VEX, EVERTHING in that PDF referrs to INTEL .
The instruction set extension is AVX. VEX is the prefix used for encoding all AVX instructions. The PDF doesn't mention AMD because Intel never does.

When they talk about AMD they say AMD. and they make referance to how AMD has to use the REX prefix when using the AVX instruction set as far as I know AMD has XOP . AMD should never use VEX when referring to AVX instruction set . XOP and VEX are not the same .

No. AMD has to use XOP when they define their own new instructions. AMD absolutely can, and eventually will decode instructions that were defined by Intel and have the VEX prefix. Just like they decode SSE instructions, and all other extensions to x86 made by Intel.

The VEX prefix is simply a new encoding method for some x86 instructions. It is not anything else. It's not some shoe-in for VLIW, it's not some harbinger of specific JIT methods. It's just two or three bytes that will precede the opcodes of some instructions. Nothing more.
 

JFAMD

Senior member
May 16, 2009
565
0
0
POST 311 .

Originally Posted by JFAMD
They will do that because Sandybridge has an issue with handling mixed SSE and AVX instructions. They need to clear out their pipeline between switching instructions, and this takes clock cycles. they recommemded at IDF that companies convert all SSE instructions to AVX-128 to avoid performance penalties.





http://news.softpedia.com/news/Intel...n-187568.shtml

Well if you say so , But this is more likely the case .


As you can clearly see JF implies intel has to do this because they haven't figured AMDS AVX instructions . So they don't have the same functionality as AMDs real deal AVX . This is dishonest in a manner . INTEL invented AVX for Intel cpus the very fact that JF amd says intel has to do something differant than AMD should tell you something. JF is referring to clearing the YMM to all zeros than he implies this takes more clock cycles . If you read the AVX pdf you will see this is not a fact. Its a necessary action for mitosis to work

If you read the AVX the length is a big deal. . One of these 2 processors can do greater than or less than the other processor cann't and that will cause an up. You tell me which is which. SB or BD

I am not implying that at all, do not put words in my mouth.

Intel provided this info at IDF 2010. Session ARCS004 by Pallavi Mehrotra

In the presentation on slide #8 he explained how when running AVX-128 the top registers (128-256) are all padded with zeroes. This means that a 128-bit AVX instruction consumes the whole 256-bit pipe. Our Flex FP allows 2 128-bit instructions to run in a 256-bit pipe.

On slide 28 it calls out "SSE instruction followed an AVX256 instruction, dozens of cycles penatly is expected"

On slide 42 the first bullet is "Performance penalty for each transition between Intel AVX and Legacy Intel SSE"

On slide 43 it says "avoid Intel AVX/SSE Transiitions" in a big yellow box at the bottom, followed by "Re-compile all codes with /QxAVX flag"


So, INTEL is implying this, not me.
 

jones377

Senior member
May 2, 2004
462
64
91
I am not implying that at all, do not put words in my mouth.

Intel provided this info at IDF 2010. Session ARCS004 by Pallavi Mehrotra

In the presentation on slide #8 he explained how when running AVX-128 the top registers (128-256) are all padded with zeroes. This means that a 128-bit AVX instruction consumes the whole 256-bit pipe. Our Flex FP allows 2 128-bit instructions to run in a 256-bit pipe.

On slide 28 it calls out "SSE instruction followed an AVX256 instruction, dozens of cycles penatly is expected"

On slide 42 the first bullet is "Performance penalty for each transition between Intel AVX and Legacy Intel SSE"

On slide 43 it says "avoid Intel AVX/SSE Transiitions" in a big yellow box at the bottom, followed by "Re-compile all codes with /QxAVX flag"


So, INTEL is implying this, not me.

Wait a second, are you saying that Bulldozer can do 4x128 bit *FP* SSE(x) per cycle by issuing 2 128bit instructions into each FMA pipe? This was speculated early on by Dresdenboy while he was perusing AMD patents, trying to get a handle on Bulldozer before any real information was released. I know the BD FPU has 4 ports. But 2 of those are for SIMD Integer instructions (you strangely call them MMX in your slides).

Intel has been running SSE integer instructions in their ALUs since Conroe. This is why Core2Duo was almost 3x faster than K8 in the Sandra SSE Integer synthetic benchmark. In SandyBridge, the FPU pipes are still 128-bit so to get 2*256 bit AVX they are using 1 ALU port (these have been reworked to also function with the upper 128 bits AVX FP instructions) and 1 FPU port for each AVX instruction. That would make it impossible to schedule any SSE instruction at the same time as an AVX instruction (like you said, it's just common sense).

btw, what code would be recompiled to use half AVX and half SSE for any sections that would use both at the same time? Even if both are present in a binary doesn't mean that the CPU would execute them concurrently even if it is capable of that. ILP and all that....
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
This is maybe true that you don't have a clue

I was referring to this question of yours :

"
If you read the AVX the length is a big deal. . One of these 2 processors can do greater than or less than the other processor cann't and that will cause an up. You tell me which is which. SB or BD
"

http://forums.anandtech.com/showpost.php?p=31858535&postcount=354

I can't tell you if it's "SB" or "BD" since I don't understand this question, even less after rereading it

now I have to confess it's starting to bore me to speak with you, so I'll stop for a while, please wake me up when your wife will tell you where she has stored this damned CD with that missing post of yours
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
That would make it impossible to schedule any SSE instruction at the same time as an AVX instruction (like you said, it's just common sense).

for well known performance reasons I'll not advise to mix SSE and AVX instructions, it is way slower than simply keeping the legacy SSE code,
though you can issue for example one VEX.128 VMULPS at the same time than a VEX.256 VADDPS since they are issued to two distinct ports (*1)

maximum throughput on SNB is reached when you issue one VEX.256 VMULPS + one VEX.256 VADDPS per clock with well balanced add/mul in your critical loops, that's more than what a Bulldozer module (*2) is able to do (unless you're using FMA4)

*1: http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
*2: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=7
 
Last edited:

Riek

Senior member
Dec 16, 2008
409
15
76
for well known performance reasons I'll not advise to mix SSE and AVX instructions, it is way slower than simply keeping the legacy SSE code,
though you can issue for example one VEX.128 VMULPS at the same time than a VEX.256 VADDPS since they are issued to two distinct ports (*1)

maximum throughput on SNB is reached when you issue one VEX.256 VMULPS + one VEX.256 VADDPS per clock with well balanced add/mul in your critical loops, that's more than what a Bulldozer module (*2) is able to do (unless you're using FMA4)

*1: http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
*2: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=7

but can SB sustain those ADD and MUL AVX256 at the same time? (if you can create such a scenario without them being dependant on eachother).

And wouldn't HT pose a potential problem if mixing AVX and SSE instructions is a performance issue?
 

jones377

Senior member
May 2, 2004
462
64
91
for well known performance reasons I'll not advise to mix SSE and AVX instructions, it is way slower than simply keeping the legacy SSE code,
though you can issue for example one VEX.128 VMULPS at the same time than a VEX.256 VADDPS since they are issued to two distinct ports (*1)

maximum throughput on SNB is reached when you issue one VEX.256 VMULPS + one VEX.256 VADDPS per clock with well balanced add/mul in your critical loops, that's more than what a Bulldozer module (*2) is able to do (unless you're using FMA4)

*1: http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
*2: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=7

Yes I am aware of that. Trust me I am not asking an AMD rep about the capabilities of an Intel product..

Edit: If BD is capable of 1 256bit AVX op plus 2*128bit SSE INT ops in a single cycle, then it should also be capable of 2*128bit AVX INT (+1*256bit AVX) at the same time too. Because currently AVX Integer ops are still only 128bit. This will only be extended to 256-bit in the recently announced AVX2.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
but can SB sustain those ADD and MUL AVX256 at the same time?

sure it can, the latency is higher for AVX-256 than for AVX-128 but the throughput (CPI) is the same, i.e. theoretical peak flops are twice with AVX-256

the main limiters in practice are the L1D$ bandwidth: 32B/clock with SSE or AVX-128 and 48B/clock with AVX-256 and the L2$ bandwidth which is lackluster on SNB, I really hope it will be enhanced in Ivy Bridge


if you can create such a scenario without them being dependant on eachother

such scenari are pretty common, see here for a real world example :
http://forums.anandtech.com/showpost.php?p=31858073&postcount=345
high performance code will typically run two threads with such code on each core decreasing further the dependencies between adds and muls

And wouldn't HT pose a potential problem if mixing AVX and SSE instructions is a performance issue?

Yes, good point, it will certainly lead to a performance issue if you run two binaries on the same core and that one is all SSE and the other one all AVX
 
Last edited:

Riek

Senior member
Dec 16, 2008
409
15
76
sure it can, the latency is higher for AVX-256 than for AVX-128 but the throughput (CPI) is the same, i.e. theoretical peak flops are twice with AVX-256

the main limiters in practice are the L1D$ bandwidth: 32B/clock with SSE or AVX-128 and 48B/clock with AVX-256 and the L2$ bandwidth which is lackluster on SNB, I really hope it will be enhanced in Ivy Bridge
I was under the impression that SB isn't capable of sustaining that rate due to those limitations. e.g. keep loading the 2* 256bit data for execution while writing away the calculated data.



such scenario are pretty common, see here for a real world example :
http://forums.anandtech.com/showpost.php?p=31858073&postcount=345
high performance code will typically run two threads with such code on each core decreasing further the dependencies between adds and muls
Didn't know that. Not that familiar with assembler code though.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I was under the impression that SB isn't capable of sustaining that rate due to those limitations. e.g. keep loading the 2* 256bit data for execution while writing away the calculated data.

indeed it can't sustain it for realistic real world uses cases, it can load only one 256-bit value per clock with AVX-256 but it can load 2 128-bit values per clock with SSE / AVX-128 i.e. the load bandwidth is the same for 128-bit and 256-bit codes

only kernels with a significant ratio of computations within registers show a good SSE to AVX-256 speedup, more than 1.5x speedup is reachable if your workload can be L1D cache blocked


Didn't know that. Not that familiar with assembler code though.

it's quite easy to understand

for example:

vaddps ymm2, ymm0, ymm1 ; ymm2 = ymm0 + ymm1
add the content (8 x 32-bit float values) of a register to another register (ymm0 and ymm1) and store the result in a 3rd register (ymm2)

vmulps ymm2, ymm0, ymm1 ; ymm2 = ymm0 * ymm1
multiply the content (8 x 32-bit float values) of a register with another register (ymm0 and ymm1) and store the result in a 3rd register (ymm2)
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,669
2,541
136
I was under the impression that SB isn't capable of sustaining that rate due to those limitations. e.g. keep loading the 2* 256bit data for execution while writing away the calculated data.

This depends entirely on how much computing you are doing on each piece of data. If the algorithms you are using are complex enough, it's entirely feasible to run the FPU at peak througput.

However, in real-world cases, you usually just want to load some data, do two-three calculations on it, and store it away. And in that case you will be completely limited on the memory subsystem.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Wait a second, are you saying that Bulldozer can do 4x128 bit *FP* SSE(x) per cycle by issuing 2 128bit instructions into each FMA pipe? This was speculated early on by Dresdenboy while he was perusing AMD patents, trying to get a handle on Bulldozer before any real information was released. I know the BD FPU has 4 ports. But 2 of those are for SIMD Integer instructions (you strangely call them MMX in your slides).

Intel has been running SSE integer instructions in their ALUs since Conroe. This is why Core2Duo was almost 3x faster than K8 in the Sandra SSE Integer synthetic benchmark. In SandyBridge, the FPU pipes are still 128-bit so to get 2*256 bit AVX they are using 1 ALU port (these have been reworked to also function with the upper 128 bits AVX FP instructions) and 1 FPU port for each AVX instruction. That would make it impossible to schedule any SSE instruction at the same time as an AVX instruction (like you said, it's just common sense).

btw, what code would be recompiled to use half AVX and half SSE for any sections that would use both at the same time? Even if both are present in a binary doesn't mean that the CPU would execute them concurrently even if it is capable of that. ILP and all that....

No, I am not saying that.

What I am saying is that someone on this thread is getting completely wrapped around the axle on one aspect and given the two choices (argue about it or wait until benchmarks are out) I would choose plan B.