Originally posted by: Concillian
Originally posted by: darfur
how do amd cpus keep getting faster
While still running at only 2.4ghz???
I believe the REAL question is:
How do Intel P4 processors do so little while running at 3.6 GHz???
In the days of the PIII vs. Athlon, Athlons were slightly faster than PIIIs per clock. Not much though, things were pretty close to AMD MHz = Intel MHz.
Then Intel put out the P4. It was clocked at insanely high speeds like 1700 and 1800 MHz. Intel lost the race to 1000 MHz but was not going to lose the race to 2000 MHz. Unfortunately these cores clocked at such high speeds did not perform anywhere near what a comparable PIII would have.
Traditionally a new "generation" of architecture has yielded a better MHZ to work done ratio.
1 MHz on a 386 > 1 MHz on a 286
1 MHz on a 486 > 1 MHz on a 386
1 MHz on a Pentium > 1 MHz on a 486
1 MHz on a Pentium Pro (pretty much a PII) > 1 MHz on a Pentium
PII and PIII weren't really significant architecture changes.
And
1 MHz on a P4 WAY less than 1 MHz on a PIII
Because of this I feel you have to ask why Intel P4 processors do so little at such a high MHz rather than asking why AMD processors do so much at such a low MHz.
AMD is actually going the more traditional route, where an architecture change = more crunching power per MHz.
Given the history of the microprocessor in a PC, it should not be at all surprising that a 2.4 GHz A64 outpaces a 2.4GHz AthlonXP, which outpaced a similar speed Athlon, which out paced a similar speed K6, which outpaced a similar speed 5x86.
I believe the answer ur looking for is a mixed bag of reasons....one reason why intel gets higher mhz vs AMD is they sacrifice effeincy to reach highest they can, now they have kinda hit a wall and 4ghz has been cancelled so they are gonna try to focus on effeciency as they have done with their Dothan core(socket 479 laptop CPU(Pentium-M))
Ur main effect on effeincy will be how many stages a CPU has in its pipelines. Stages are the transitions and holding places a value in a CPU goes through the more stages there are the more work the processor has to try to do to keep all the stages full, if the stages are not all full those are considered wasted segments in the clock cycle. Also more stages incurs more of a performance hit when there is a branch misprediction as the processor has to disregard all work done so far in its pipeline and refill them...the less stages there are a CPU doesnt take that much of a performance hit. Theoretically when you increase the number of stages in a core the clock speed can be taken higher, but also causing less effeciency.
Intel chose to take clock speed as high as they could sacrificing performance and AMD has stuck around 2.4ghz on their same architecture but concentrating on as much effeciency as possible.
how many pipelines a CPU has also how many registers, the effeciency of its L1 and L2 cache and its bus properties between each also plays into effeciency.
Cache differences between current architectures are indeed interesting as well, When you have a Processor that has small amount of stages (Athlon) it likes having low latency cache and low latency memory....as you can see bandwidth on Athlon system take socket754 for example does not have a huge hit on its performance as much as a P4 single vs dual channel would...for athlon maintaining decent bandwidth(not huge amounts) yet keeping its latency down to around 95ns for the memory bus on A64 setup is what it loves and why it performs awesome despite its bandwidth being medium speed.
The A64 L1 cache is another prime example, its bandwidth is around 20GB/sec while intel P4 is around 35GB/sec benched, but AMD's latency in the L1 is at least half that of intels 1-2ns vs 2-3ns the same pattern follows into the L2 where AMD's benched is around 10GB/sec Intels is around 20GB/sec yet AMD's latency in L2 is around 13-17 intels is around 15-20ns.
P4 on other hand LOVES bandwidth but doesnt really care about latency it does but not as much as its bandwidth cravings. Longer pipelines(more stages) are dependent on bandwidth and can deal with bad latency.
Athlon and P4 also employ 2 different cache architectures with respect to one being exclusive and one inclusive...meaning on the Athlon the L1 and L2 contain totally different data, while on the P4 the L1 is copied into the L2 and u will find a clone of data from L1 in P4's L2. - not sure the advantages and disadvantages of both but just pointing that out =).
back in the day when PIII 700 and Athlon 700 were around athlon's still held the effiency crown but not by much because intel was nearly as effecient back then. Both CPU architectures had roughly the same specs as far as pipelines and stages in them. Their cache architectures differed a bit, AMD used a lesser bit width but more way assosciativity then the P3..
When Intel released P4 its effency SUCKED, remember the first 1.4 P4's were worse performing then the P3 1ghz and a lot of customers were angry about that. At this time AMD didnt reinvent the Athlon AT ALL same old athlon remains since the origina athlon 550(as with Athlon64 is much the same architecture as the original Athlon Slot A 550)
The Athlon XP's remained huge competition against P4 even tho the P4 had to be nearly 400-600mhz clocked higher to get same performance, The first P4's had a 20stage pipeline and the Athlons still maintained a 12 stage I believe.
As the P4 accelerated in mhz and now the prescotts are out, they had to bring that pipeline up to another 10 stages and it has around 30 stages in it....goin from athlon XP arch to Athlon 64 they only went to 14stages I believe, so as you can see right there....the athlon 64 is theoretically 2x more effeicient.
Intel developed the L1 8k trace cache to try and help the P4's massive ineffeciency on branch prediction misses cuz of its massive pipeline, so that helped a bit and brought its effeciency up a few percent more.
So, in the end thats why we have processors that are same speed (AMD) yet perform 25% better then the old AthlonXP architecture and around 50% better then same clocked P4.
The big thing that going from Athlon XP to Athlon 64 was the integrated memory controller and that upped effecinecy a TON from the XP... because motherboard latencies were cut in half to the memory. simply because the FSB runs at same speed internally inside the CPU to the memory controller.
Another interesting thing is the P4 really really really SUX hardcore on basic x87 FPU code, it needs SSE2 or SSE optimized FPU code to be worth anything, take a look at sandra's benchmarks for the CPU part u will notice it does the FPU and SSE in same bar, and if you notice the bare FPU is about 3/5 of the SSE optimized. AMD really accels in this area as AMD does well in bear x87 FPU and awesome in ALU.
Which is why Win2k may appear faster on a Athlon64 then a Pentium 4 because WindowsXP is heavly SSE optimized...Win2k is not....also hyperthreading helps in WindowsXP.
One thing that hyperthreading really helps the P4 achieve is keeping its pipelines fuller and thats why The P4 architecture performs well with lots of threads is because its pipelines are getting fed like they want.
There is still the huge branchprediction miss penalty tho.
note: I may be wrong on a few numbers here and there, my memory of the exact numbers has kinda faded but the ideas remain =) and my numbers should be near.
PS: if you took a P1 and were able to overclock it to 3ghz, it would kick the P4's ass in math calculations (not memory heavy stuff of course) =) even a Athlon 64 but not by much =)