Originally Posted by JFAMD
They will do that because Sandybridge has an issue with handling mixed SSE and AVX instructions. They need to clear out their pipeline between switching instructions, and this takes clock cycles. they recommemded at IDF that companies convert all SSE instructions to AVX-128 to avoid performance penalties.
Interesting but I had this already posted here . But I actually used intels words and not mine.
256-bit VEX-encoded instruction and legacy 128-bit SIMD instructions has internal
state to manage the upper and lower halves of the YMM states. Functionally, VEXencoded
SIMD instructions can be intermixed with legacy SSE instructions (non-VEXencoded
SIMD instructions operating on XMM registers). However, there is a performance
impact with intermixing VEX-encoded SIMD instructions (AVX, FMA) and
Legacy SSE instructions that only operate on the XMM register state.
The general programming considerations to realize optimal performance are the
following:
• Minimize transition delays and partial register stalls with YMM registers accesses:
Intermixed 256-bit, 128-bit or scalar SIMD instructions that are encoded with
VEX prefixes have no transition delay due to internal state management.
Sequences of legacy SSE instructions (including SSE2, and subsequent
generations non-VEX-encoded SIMD extensions) that are not intermixed with
VEX-encoded SIMD instructions are not subject to transition delays.
• When an application must employ AVX and/or FMA, along with legacy SSE code,
it should minimize the number of transitions between VEX-encoded instructions
and legacy, non-VEX-encoded SSE code. Section 2.8.1 provides recommendation
for software to minimize the impact of transitions between VEX-encoded code
and legacy SSE code.
In addition to performance considerations, programmers should also be cognizant of
the implications of VEX-encoded AVX instructions with the expectations of system
software components that manage the processor state components enabled by
XCR0. For additional information see Section 4.1.9.1, “Vector Length Transition and
Programming Considerations”.
I myself would like to see how AMD uses 256bit with sse2 that have a rex prefix , John could you give a code example of code written for this operation 256bit on the YMMstate and than the 128bit XMM lower state using SSe2 with a rex prefix . Because It seems to me that creates an up according to the PDF In some cases, I am more interested in the code length as Xop would have to include the bytes of the rexprefix. As intel has the space reserved that is greater than 128 bits in the lowerXMM state. If code goes over 128bits its a UP same as 256bit ymm upper state intel has that spaced reserved greater than 256 bit. =L than we have the memory SS and the pp . So i sure would like to see an example of codeing. Unless in the XMM state your using legacy SSE2 instructions in which case the AMD would need for intel to stall . To keep up .