Core 2 Duo Pipeline

lambchops511

Senior member
Apr 12, 2005
659
0
0
Can anyone point to me where can I find specifics about the Core 2 Duo Pipeline Architecture?

I'm interested in the number of ALUs it has, how they are organized, cache latency, IPC...etc...

I remember seeing some marketing docs on Anandtech or Xbitlabs before, but I can't find them anymore.
 

lambchops511

Senior member
Apr 12, 2005
659
0
0
thanks . but a lot of the those sites info are just copies of the Intel Marketing Kit (which I already found on their site)

however, I did find some stuff that wasn't on the intel site


>> http://www.behardware.com/arti...l-core-2-duo-test.html

can anyone explain to me why the bswap instruction is so much slower on the Core 2 Duo?

I am using this instruction a lot...and I am wondering if I should find alternatives....or switch processors.......
 

NXIL

Senior member
Apr 14, 2005
774
0
0
http://www.xbitlabs.com/articl...ore2duo-preview_4.html

http://forum.doom9.org/archive/index.php/t-54160.html

http://www.ercb.com/ddj/1995/ddj.9501.html

Incidentally, my claim that BSWAP ECX is a three-cycle instruction may surprise some people. Doesn't Intel say that BSWAP takes one cycle? Yes, but Intel also says to add one cycle for prefixes (there are some exceptions, but that's the general rule, and it's certainly applicable in these examples). Aha, everyone thinks, we're operating on a 32-bit operand; therefore, there's a "32-bit operand" prefix (DB 66h), so BSWAP takes 1+1=2 cycles. That's right, but not everyone realizes that BSWAP, like many other instructions, always has another prefix (DB 0Fh), so BSWAP takes 1+1+1=3 cycles. Intel's documentation has a little note that 0Fh is a prefix, but nowhere have I seen them spell out the horrible implication: Most of the time on 386/486s, all instructions whose first machine opcode byte is 0FH take one cycle longer than the manual says they will. This includes many of the instructions that appeared with the introduction of the 386, including BSR, BT, BTC, BTR, BTS, CMPXCHG, IMUL ,; Jxx ; LFS, LGS, MOVSX, MOVZX, POP FS, POP GS, PUSH FS, PUSH GS, SHLD, SHRD, SETxx, XADD, and various protected-mode instructions--and BSWAP. I'll suggest that BSWAP is not "more useful than you think."


Example 5: This code is faster than using BSWAP.

mov cx,[InitialValue]
and ecx,000ffffh ;Clear the upper part of ECX to 0
or ecx,00630000h ;Put 63h directly in the upper part of ECX
looptop:...
add bx,cx ;Skip BX ahead
inc cx ;Set next skip value
sub ecx,00010000h ;Count down loop
jnc looptop ;The loop will repeat 64h times.


http://homes.esat.kuleuven.be/~cosicart/pdf/AB-9701.pdf

Table 3 is the updated version of [BGV96, Table 4]. All implementations now only use 1-cycle instructions, except for SHA-1 that uses the bswap instruction taking an additional cycle to decode due to the 0Fx-prefix. A value for the cycles per instruction (CPI) of close to 0.5 is therefore an indication of the high percentage of simple paired instructions in the code. Table 1 gives a better idea of the resulting improvement.


http://webster.cs.ucr.edu/AoA/...BitManipulationa2.html

http://flint.cs.yale.edu/cs421...rt-of-asm/pdf/CH06.PDF

http://www.phatcode.net/res/22...13/13-02.html#Heading4

 

NXIL

Senior member
Apr 14, 2005
774
0
0
Search inside books for bswap:

http://www.amazon.com/gp/reade...19-2828045#reader-link

http://www.amazon.com/gp/reade...19-2828045#reader-link

http://www.amazon.com/gp/reade...19-2828045#reader-link

p 243:

http://www.amazon.com/gp/reade...t/002-4063819-2828045#


Since you are down in the registers/doing assembly language type stuff, maybe it's not the pipeline length, etc, that is the issue.

Depending on what you are using bswap for--can you step away from the problem and see it from a different direction? Also, maybe a programming forum/software engineering type group might have some insight.

HTH, GL