P4 vs. P3 vs. Mustang vs. T-Bird

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Soccerman

Elite Member
Oct 9, 1999
6,378
0
0
VladTrishkin, where did you get that 10 stage P3 (coppermine) spec? I never saw that before!
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
Its pipe depth is 12, just like all other CPU's that are based on the P6 core.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
I stand corrected, though like previously stated, all the P6 based cores have equally long pipelines.
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
Everyone was shouting 12, not 10... The PIII katmai has a 12 stage pipeline, and then Intel reduced it to 10 with the release of the coppermine.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
Even on the pages you link to they refer to the "P6 Pipeline", not the Coppermine pipe, and I hardly think INtel would reduce the pipe for the Cumine, since thats quite a big redesign that would need to be done, its not as simple as to "just reduce it".
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
I am at work now, but maybe i'll try to dig up some Intel spec sheets later... Anyway, Its 10, used to be 12. I am not sure how they reduced it, but i know it gave the P6 a boost. If i remember correctly it was in the ALU unit, and it has something to do with the reduction of L2 cache.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
Got a reply at Ace's(wjat better place to ask than the tech BBS there:)).



<< Actually, it's both. The entire pipeline is about 12 stages long. However, that includes a 3 cycle retirement pipeline. The retirement pipeline is not in the critical path from branch mispredict to next execution,
nor is it in the latency from operation to use of the result. As a result, the effective pipeline length is shorter, at around 9-10 pipe stages. This also corresponds to the way AMD draws their pipeline
diagrams. AMD shows a 10 stage pipeline, but they never include a retirement pipe; they end the pipeline after execution and register file writeback.
>>



So I guess we were both right sorta, assuming he knows what he's talking about, which they typically do over there :)
 

IBMer

Golden Member
Jul 7, 2000
1,137
0
76
I think SSE2 with be a big differnce.

If you look at the GTS Ultra Benchmarks that Anandtech originally did, at 1600x1200x32, he compared a P3 550E and T-bird 1Ghz.

With every card including the Ultra, the P3 was 1-2fps faster than that Athlon. This is because Q3 is SSE optimized.

Try running 3dMark2000 without SSE. You can do this buy trying the Software T&amp;L vs, the P3 Optimizations.

Anyways, the SSE2 is more than just more instructions it can also do 128-bit calculations, but this really means 4 32-bit calculations in sequence with each other.

I think we will find out later whether or not it is good or not, but really nobody knows. I will bet that nobody here has even seen a mustang or P4 to talk about what it faster and what is not. We will find out when they are availible for sale.
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
LOL Sunner, i saw your thread @ Ace's, didnt reply to make it fair :)

Anyway, there is a bit more to it, i remember reading (Intel spec sheets) that the Coppermine differs from the Katmai a bit when it comes to the pipeline, but that reply you got sounds pretty fair.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
I figured maybe you were a member(well not that it has members in the same sense as this BBS, but you know what I mean) at Ace's, most people who are somewhat knowledgeable tend to hang out there :)
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
Ya I hang out @ Ace's BBS (along with BurntK.), but there is not too much activity, so I come here when I get bored. :p


BTW: That?s a very nice article @ Ace's (P4 vs. Mustang), but a few technical errors. Here are a few quick ones:

link

&quot;Yes, for today's games, which are the most demanding applications most desktop users run, sustainable memory bandwidth has become a serious bottleneck. No wonder, if you consider that the fastest x86 processor (Athlon 1200 MHz) today runs with a multiplier of 9x.&quot;

-It looks like the author has mixed up PC-2100 DDR (266MHz) Tbird platform with current platforms (SDR, 100Mhz). The current 1200Mhz Tbird has a 12.0x clock multiplier (not 9.0)...


&quot;The trace cache, the huge instruction buffers, and many other considerations have resulted in a Pentium 4 architecture with 20 stages after the trace cache and no less than 28 total. Notice that the branch check is at the 19th cycle, and therefore the branch misprediction penalty is no less than 19 cycles! If an instruction is not the in the Trace cache, the penalty could be even worse (context switches). Luckily, Intel has implemented an excellent branch predictor, which should be better than any existent branch predictor today to minimize the impact of branch misprediction. There is also a bypass between the decoder and the rename/allocate unit, which should lower the performance decrease caused by trace cache (L1- Instruction cache) misses.&quot;

-This is controversial, but it?s an interesting topic. I have talked with a few reliable Intel sources, and they have leaned towards 24 final stages, not 28. It should be more then 20 anyway you look at it. It looks like Intel will be doubling P4's &quot;Rename&quot; stages, and there will most likely be more than one &quot;Dispatch&quot; stage. Rest is speculation.

My 0.02 cents.
 

Soccerman

Elite Member
Oct 9, 1999
6,378
0
0
Vladtrishken, contrary to popular believe (lol yeah right), I am not that knowledgeable. I've been corrected sooo much for the past little while, I don't even know why I try!

then again, the more I get corrected, the more I learn!

in any case, I suspect they did SOMETHING to the pipelines.. whatever it was, it sure helped them in the mhz-mhz race, however, most tests that we use to compare sheer clock speed are tipped in Intel's balance (Quake 3 is a perfect example of that, it SAY's 3DNow! support, but it really doesn't matter much).
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
Soccerman, yes thats what i suspect as well... but it looks like current cB0's are good up to 1.1GHz or so, we need cC0's (and .13 micron later on) for 1.2+GHz...
 

jpprod

Platinum Member
Nov 18, 1999
2,373
0
0
VladTrishkin: Actually that's not a mixup, writer refers to memory speed vs. CPU clock frequency which on KT133 Athlon platform is
1200Mhz/133Mhz = 9
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0
jpprod, the default multiplier for a 1.2GHz Tbird is set to 12x. Yes you can run it at 9x, 133Mhz at the same 1.2Ghz, but this is not the default setting. This will be implemented in the near DDR-266 future.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
jp, Vlad's(sure we can:)) right about that, referring to the mem speed instead of bus speed would be sorta messy anyway, since some people run at 100 MHz mem(like me with my crappy noname PC100).
 

jpprod

Platinum Member
Nov 18, 1999
2,373
0
0
Sunner, Vlad: Read into the very essence of this quote


<< Yes, for today's games, which are the most demanding applications most desktop users run, sustainable memory bandwidth has become a serious bottleneck. No wonder, if you consider that the fastest x86 processor (Athlon 1200 MHz) today runs with a multiplier of 9x. >>


I'm sure Johan of Ace's Hardware doesn't think 1200MHz Athlon runs on a 133MHz FSB :)
 

VladTrishkin

Senior member
Sep 11, 2000
421
0
0


<< I'm sure Johan of Ace's Hardware doesn't think 1200MHz Athlon runs on a 133MHz FSB >>



-this is not for me to judge, i just pointed out an error (not to be confused with a mistake). True, in a few weeks or so we WILL see 1.2Ghz Tbirds on the 133MHz FSB @ 9.0x