Thoughts and speculations about Prescott design decisions

RaynorWolfcastle · Feb 2, 2004

I just wanted to put this thread here as opposed to the other forums because I'm pretty sure we can keep the discussion technical here instead of it inevitably degenerating into a flamewar.

Am I the only one that think that it is absolutely amazing that with a 55% pipeline increase Prescott is competitive with Northwood on a clock by clock basis? I'm really kind of curious how the Intel engineers broke up the traditional 5 stage pipeline into 31 steps while keeping everything balanced (if you have references, feel free to share them).

Also, I'm a little surprised that even with a 55% longer pipeline, Intel was unable to clock Prescott any higher than Northwood at launch. Could it be that the 90nm process is holding Prescott back? Is the high power output of Prescott an indication that the 90nm process isn't working as well as they had hoped (103W @ 3.2 GHz :Q)? I think that if Prescott was released at 3.6 GHz or so, it would have received a much warmer reception than it has at the current 3.2 GHz.

I'm also wondering how difficult/long it would have been for Intel to release a Northwood with the better branch predicter and increased caches without completely reworking the pipeline (doable in 12 month frame?). Given the Intel's intial release of Prescott, I'm left wondering what performance would have been on a 90nm Northwood using some of Prescott's architectural enhancements.

I'm curious to hear what everyone's thoughts are on all this and Prescott's architecture/performance in general.

Lynx516 · Feb 2, 2004

I think the design decisions demonstrate how poorly their 90nm process is doing.
There are some interesting differences in reviews floating around also.

Can I take this point to point out a mistake in Anandtechs coverage.

The 5th traditional Pipeline stage should be "Write to Registers" or "Retreat" instead of "Store to Cache"
It is impossible to directly write to cache. And any mirroring done by the cache is done independently of the pipline stages.

CTho9305 · Feb 2, 2004

In learning about Prescott and trying to understand just why Intel did what they did, we came to realization: not only is Prescott designed to be ramped in speed, but there was something else hiding under the surface.

When overclocking a processor, we can expect a kind of linear trend in performance. As Northwood's speed increases, its performance increases. The same is true for Prescott , but what is important to look at is increase in performance compared to increase in clock speed.

Prescott 's enhancements actually give it a steeper increase in performance per increase in clock. Not only can Prescott be clocked higher than Northwood, but as its clock speed is increased, it will start to outperform similarly clocked Northwood CPUs.

Not a very good explanation. You would only expect a linear increase if you also increased the speed of the supporting hardware. But anyway, it's not that the Prescott gets faster - it's that the Northwood gets slower since every cache miss is relatively more expensive at higher clock speeds. The larger cache of the Prescott is helping to better mask the delays in talking to RAM - so when you ramp up the clocks, the Northwood's higher miss rate creates more slowdown.

What we have seen here today does not bode well for the forthcoming Prescott based Celerons. With a 31 stage pipeline and 1/4 the cache size of the P4 Prescott, it doesn?t look like Intel will be able to improve Celeron performance anytime soon.

Everyone knew that the Williamette celerons were going to suck - what happens if you take a bandwidth-starved processor, decrease the available bandwidth, AND chop the cache? You make it need even MORE bandwidth for decent performance and end up with a very idle CPU. This will obviously only become more of a problem with higher-clocked processors unless by some magic the trends of CPU speed vs memory speed growth swaps.

edit: ...and for all the people claiming that 31 pipe stages is too many, read this (summary: up to ~60 pipeline stages is not necessarily bad, if you make necessary other changes).

MadRat · Feb 3, 2004

I thought he meant more like 52 stages BUT only if the clock frequency would double and the memory access latency was figured to be a constant percentage rate of clockspeed, meaning your memory access time would drop to 25% (1/2 ^2) of the baseline, right? That just plain seems wrong. Please correct my interpretation and correct my logic where I went wrong.

MadRat · Feb 3, 2004

Here's the Prescott under the microscope: Image
Compare to the original P4 - Williamette: Link

The picture suspiciously looks like dual-cores. The original poster claims it is two D-caches and 2 execution units. Back in 2002 AT did an article stipulating that Intel was investigating twin cores, but cores at different speeds - one fast one slow - to keep thermal disipation down. So if it is twin cores then don't necessarily expect two 3GHz+ cores.

pm · Feb 5, 2004

Originally posted by: Lynx516
The 5th traditional Pipeline stage should be "Write to Registers" or "Retreat" instead of "Store to Cache"
It is impossible to directly write to cache. And any mirroring done by the cache is done independently of the pipline stages.

I'm used to seeing that 5th stage referred to as write-back - which I guess is the same thing in different words. But why is it impossible to write directly to a cache?

Lynx516 · Feb 5, 2004

Its handled at a different level. Cache is used to mirror memmory locations. So if you wrote to memmory (which woudl be an entire instruction not a part of one) the Cache would be updated imeadiatly. As the retire stage should be writing to registers whose entire purpose is to be faster than main memmory and Cache.

There is also no implementation in any ISA of direct Cache access. Cache by its very nature is invisible.

pm · Feb 5, 2004

I see now what you are saying. Briefly I was wondering if you meant that cache was somehow read-only and that's why I asked for clarification.

There is also no implementation in any ISA of direct Cache access. Cache by its very nature is invisible.

Several ISA's allow speculative prefetches to cache. For example, the SSE instruction PREFETCHx. Additionally, the IA64 instruction set on the Itanium Processor family allows locality hints in the ISA for memory loads and stores. So a compiler or programmer can indicate specifically which level of cache the load or store should affect. Section 4, pages 22-24 of the Intel IA64 Architecture Software Developer's Manual provides an overview of this feature.

Lynx516 · Feb 6, 2004

I stand corrected. I dont have much knowledge of IA64 or SSE. I was mainly speaking about normal x86 or ARM ISA, Thumb ISA e.t.c. Traditionally the cache is invisible due to the fact that cores should be able to work without cache enabled and the fact that the CPU should have total control over what is in the cache though the hints from the program would help cache misses alot. But the point I wanted to make is that as the cache is a mirror of the main memmory you would either write to the registers in which point the cache is pointless as it is slower than the registers or you would write to main memmory which is an entrie instruction in itself.

That being totaly off topic. Back to the topic. IMHO Presscott is a problem for Intel. The CPU by itself is pretty good. It performs pretty well. However it is going to have problems scaling due to the heat output it produces. I have seen Delta Ts of 48C for the standard Intel HSF on the presscott and Delta Ts of 28C on water cooling. This is way too hot. Presscott will stop scaling beyond 4Ghz not due to core limitations but due to it being so insainly hot.

To get a core to run at the speed it does with the pipeline length it has is a major achevement by itslf..

sao123 · Feb 6, 2004

Here are my thoughts...

I'm tired of everyone complaining that prescott is too hot to ________ (fill in the blank)
Lets face the facts here...heat is a problem for everyone in the microprocessor business.

If a processor runs too hot...what do you do? You improve its stock cooling method. Which is excatly what intel is doing. New heatsink, new fan type, new BTX standard helps cooling. Additionally the new socket 775 i believe will also help to consume less power/heat.

But people must realize...
Its only a matter of time till the stock cooling is gonna have to be no longer an air based fan, but water or vapor.
Neither AMD or Intel will make it to 7-10GHz on air. Eventually Video cards, memory, and Mobochipsets will have to be also.

Remember the automobile? The original auto was air cooled... I bet you people would be saying the same thing about how hot the 6 & 8 cylinder new motors were going to get and now have to be water/antifreeze cooled.
Boohoo.

No matter how hot the new processor get, no matter how much heat it disipates...(even though it may not be as efficient as it could be) As long as a proper colling solution is applied...I see no reason to whine about it...other than to be an AMD vs Intel ...nit picker. Well see what happens when AMD makes its switch to 90nm next year.

thank you for allowing me to rant.

MadRat · Feb 6, 2004

The hear issue is nonsensical. Prescott was designed from the outset for a future socket that hasn't even been released. When the new socket comes out the powerflow problems that are possibly causing the heat will be addressed. I know my electrical sockets run much cooler when I plug everything into a separate circuit in the house, whereas I know its possible to overload any one circuit by plugging everything into it. When they get the load balanced a little better it should drastically cut down on the heat. This might be why they couldn't get the voltage down below 1v on their 90nm process with the S-478 package.

I'm not so sure they'll need to make any exotic cooling jumps any time soon. Copper as a material has yet to even be pushed for its capacity to move heat away from a processor. The mass and area of the heatsinks can increase significantly before the need for liquid or vapor cooling. I don't know why it would be wrong to move to an 8"x8" heatsink with a slow speed, high bladecount, low profile fan. A copper band 1" wide, .99% pure, and .040" thick can draw quite alot of heat from a processor, which is exactly what older IBM-designed Pentium laptops did so that they could use the laptop's frame as a heatsink. Widen the band of copper, push for greater purity which is available now, and thicken it to handle the load, then there is no need to mount the fan and radiator on even a Prescott processor. Sounds fanciful, but we've yet to even push what is possible with copper.

Lynx516 · Feb 6, 2004

I would disagree with the saying that S775 would fix the heat issue. It would help power delivery but the main heat producer is current leaking in the transistors. I can for see us not getting to 7-10Ghz. I can see parrellel computing taking more of a part. IA64 with Itanium shows that parralell computing is very very fast.

MadRat · Feb 6, 2004

The limited plumbing is forcing Intel to use more voltage then they would like. Once this is resolved then we should see a noticeable drop in electrical resistance inside the core. We saw the same thing when Intel moved from PPGA to FCPGA and again from FCPGA to FCPGA2, even though all three standards used the same S-370 socket. More pins were assigned to the task and voltages varied with each standard, and with each successive generation we realized cooler running designs. AMD ran into the same challenges with their Socket-A format over the years, too, so its not just limited to Intel.

kpb · Feb 6, 2004

I'm not sure even if the new socket does allow them to drop the voltage on the processors that it will actually decrease the heat generated. They are running into problems with the leakage current and getting to the point where the decrease in voltage is cancled out by the increase in the leakage current. Decreasing the max voltage decreases what the voltage reads as a 1 vs a 0. For example lets say we're looking at old school 5 volt. a 1 might be anything between 4.5 and 5.5 volts and a 0 might be anything below .5 is a 0. If you drop the voltage to 1 like newer processors are approaching you obviously can't have a 0 be anything up to .5. You have to drop it down to something like .1 or lower and in order to get the voltage down that low you end up having to let more current threw and thats the leakage.

SuperTool · Feb 11, 2004

I find the pentium M processor much more appealing than the P4 derivatives. I think P3 is the last Intel processor designed with common sense and not MHz marketing driving the decisions.
I don't understand why Intel is making these decisions of using the P4 as their bread and butter 32 bit processor, and Itanic as their 64 bit processor when they have both the P3 derivatives and the Alpha now in their design portfolio, which I believe are superior to the previously mentioned "innovations" from Intel. Hyperpipelining and VLIW are all great for certain high throughput applications, but for general purpose processor, I think P3 and Alpha designs are superior.
As far as how they were able to release the prescott so soon after northwood, it's because they were probably designed in parallel, and not in series. That is prescott design started before northwood finished.

CTho9305 · Feb 11, 2004

Originally posted by: SuperTool
I find the pentium M processor much more appealing than the P4 derivatives. I think P3 is the last Intel processor designed with common sense and not MHz marketing driving the decisions.
I don't understand why Intel is making these decisions of using the P4 as their bread and butter 32 bit processor, and Itanic as their 64 bit processor when they have both the P3 derivatives and the Alpha now in their design portfolio, which I believe are superior to the previously mentioned "innovations" from Intel.

Forgive me for defending my competitors... but what P3-based design comes CLOSE in performance to a high-end AMD processor? None. The fastest P6 implementation runs up to 1.6GHz (banias). Assuming AMD's PR methodology is at all accurate (e.g. an XP2800 really is about 2x as fast as a 1.4GHz Thunderbird), then Intel would need to be at almost 3GHz with the P6 core since it seems that the Athlon and P3 were generally pretty similar performers.

The 0.13u Tualatins topped out at 1.2GHz and everyone remembers the whole unstable-1.13GHz processors, but AMD didn't move to 0.13u until, somewhere after 1.4GHz (IIRC). It is generally accepted that Intel has better fabs than AMD, so why did they have to move to 0.13u earlier? I would conclude that it would have taken Intel more effort to get the P3 running fast enough to match AMD than it did to design the P4, which could be clocked high enough to compete.

...Alpha now in their design portfolio...

It hasn't been in their portfolio long enough that they could have actually done too much with it yet. Besides, non-x86 doesn't sell, and Itanium proves that. Also, how do you know none of the Alpha tricks were in Prescott?

Hyperpipelining and VLIW are all great for certain high throughput applications, but for general purpose processor, I think P3 and Alpha designs are superior.

I would argue that SMT is good in general... and benchmarks that involve stuff happening in multiple threads show that. I don't know enough about the performance aspects of VLIW to comment on IA64.

As far as how they were able to release the prescott so soon after northwood, it's because they were probably designed in parallel, and not in series. That is prescott design started before northwood finished.

Of course. And Tejas was being designed in parallel with Prescott. The 2006 Ford Taurus is probably being designed in parallel with the 2005 Tuarus

. The amount of time it takes to get a processor from the drawing board to consumers is WAY too long to not work in parallel.

MadRat · Feb 11, 2004

CTho9305-

You probably meant to say the 180nm P!!!'s topped out at 1.2GHz, not the 130nm P!!!'s. Also they did release the 130nm P!!! up to 1.4GHz on a 133fsb, and the 130nm Celeron went up to 1.4GHz on 100fsb. Banias, on the other hand, runs on a 400fsb (100MHz-QDR) just like the original P4 family. To be perfectly honest, though, the P!!! never really was given much consideration for Intel's upper end once they brought out the 130nm version because it was so starved for front-side bus and raising it about 1.4GHz made little sense. There are plenty of people that were able to push 130nm P!!!'s to over 1.6Ghz on air, so its not like it couldn't of scaled more in raw MHz.

And if Banias is to be considered the top of the P6 line, then its not a 1.6GHz cap but rather a 1.7GHz cap. Banias helped to uncork the P!!! legacy by mating it to the P4's front-side bus. It makes me wonder how much headroom it would realize on either the 533fsb (133MHz-QDR) or the 800fsb (200MHz-QDR) of the Northwood, considering how AMD was only able to keep the XP line alive above 2GHz by scaling up to a 200MHz-DDR fsb and strapping on twice the L2 cache. Last time I checked the Banias L2 was 1MB, twice what we find in Mr. Athlon Barton. Banias on the higher fsb could probably easily compete with the Barton.

I'm not sure what your point was about SMT. The main reason Intel uses it in the P4 is because it needs to fill idle pipelines to fight its otherwise high bandwidth, high latency memory design. AMD on the other hand keeps the need for SMT at a minimum by keeping the pipeline short and the memory latency to a minimum. SMT and other TLP designs benefit from high memory bandwidth (not necessarily low latency) and prefetching. As the design decreases in pipeline complexity then load latency is more important than raw memory bandwidth, hence the reason AMD chose a very fast front-side bus. The future for Intel is long pipelines reinforced by SMT tricks, multiple cores running multiple threads in parrallel, high memory bandwidth, small L1 caches since a cache miss isn't necessarily doom and gloom, and large L2/L3 caches for efficient prefetching. The future for AMD would seem to be short pipelines, large L1 caches, SMP on a chip, hypertransport links between CPU's, NUMA memory architecture, and a low latency front-side buses. They both have the same endgame and actually somewhat aim at differing markets, but in the end the performance is nearly identical across the board.

Sahakiel · Feb 12, 2004

Originally posted by: MadRat
CTho9305-

You probably meant to say the 180nm P!!!'s topped out at 1.2GHz, not the 130nm P!!!'s. Also they did release the 130nm P!!! up to 1.4GHz on a 133fsb, and the 130nm Celeron went up to 1.4GHz on 100fsb. Banias, on the other hand, runs on a 400fsb (100MHz-QDR) just like the original P4 family. To be perfectly honest, though, the P!!! never really was given much consideration for Intel's upper end once they brought out the 130nm version because it was so starved for front-side bus and raising it about 1.4GHz made little sense. There are plenty of people that were able to push 130nm P!!!'s to over 1.6Ghz on air, so its not like it couldn't of scaled more in raw MHz.

If I remember correctly, my impression at the time was that Intel didn't want to raise P6 core speeds too fast because otherwise it would cannibalize Pentium 4 sales. Also, Pentium 4 was a year or two late out the door thanks to a last minute decision to shove in HyperThreading. If it had been released on schedule, I think Intel would've been the first to 1 GHz and hyperpipelining would never have become such a public issue. I mean, when Alpha took the speed crown with their 500MHz chip, I don't remember much hooplah over the deep (for the time) pipeline.
As for Banias, it's bus interface is the Pentium 4's design, last I heard. It actually would be pin compatible if it weren't for Intel changing one little pin on the Banias chip ( you have no idea how sore I am over that ).
Oh, yeah, and by the time Pentium 4 came out, I think the P6 core was almost five years old. That's pretty damn old.

And if Banias is to be considered the top of the P6 line, then its not a 1.6GHz cap but rather a 1.7GHz cap. Banias helped to uncork the P!!! legacy by mating it to the P4's front-side bus. It makes me wonder how much headroom it would realize on either the 533fsb (133MHz-QDR) or the 800fsb (200MHz-QDR) of the Northwood, considering how AMD was only able to keep the XP line alive above 2GHz by scaling up to a 200MHz-DDR fsb and strapping on twice the L2 cache. Last time I checked the Banias L2 was 1MB, twice what we find in Mr. Athlon Barton. Banias on the higher fsb could probably easily compete with the Barton.

I can't help but recall reading of the Banias core being designed from scratch. It borrowed lots of features from the P6 and Pentium 4 cores and also mixed in a few new ideas. Also, scaling a chip has little to do with the FSB. You could technically scale a CPU to 3 GHz on a 33MHz FSB. Obviously, it's a bad idea, but I'm just pointing out that the ability to increase core clock is not dependent on FSB. Well, unless you run out of multipliers, or something.

I'm not sure what your point was about SMT. The main reason Intel uses it in the P4 is because it needs to fill idle pipelines to fight its otherwise high bandwidth, high latency memory design. AMD on the other hand keeps the need for SMT at a minimum by keeping the pipeline short and the memory latency to a minimum. SMT and other TLP designs benefit from high memory bandwidth (not necessarily low latency) and prefetching.

Unfortunately, I keep seeing a different picture than you. Everywhere I look, the trend is towards longer pipelines, multiple threads, and VLIW designs. Not a single paper I've read that has been published in the last ten years has concluded that shorter pipelines are required for future designs. In fact, most papers tend towards decreasing complexity, longer pipelines, and multithreading. Some papers are suggesting anywhere from 40-60 pipeline stages depending on the complexity of the architecture. However, the biggest problem with deep pipelines is ILP followed by wire delay. The ILP is such a vexing problem due to x86, which is why Intel chose to try ditching it. In fact, they've been trying since before 1994, the year Intel and HP announced they were already working on Itanium.
Oh, and btw, pipelines have always been "high latency" designs. They sacrifice high latency for throughput and it's especially true these days due to increasing latency everywhere in the system. SMT can benefit from low latency. It would be like comparing wide data bus with slow clock or high clock with narrow bus.
I would venture to say AMD doesn't use SMT on the Athlon because they don't have the resources to implement it. Unfortunately, I don't know much about the memory bus for Athlon or Pentium 4. They both run at the same clock speed so I assume the memory latency is pretty close. Unless, of course, one or the other decided to deeply pipeline the memory bus.

As the design decreases in pipeline complexity then load latency is more important than raw memory bandwidth, hence the reason AMD chose a very fast front-side bus.

The Pentium 4 doesn't "need" HyperThreading in the sense that it's useless otherwise. HyperThreading was thrown in smack dab in the middle of the design cycle because tests showed that on average, about 50% of the Pentium 4's execution engines were in use at any given time. 50% meant you could almost fit another thread in there. It's not a "trick" to keep Pentium 4 on par with Athlon much the same as Athlon's 200MHz FSB isn't a "trick" to keep it competitive with Pentium 4. 200MHz EV6 bus was on the charts long before Pentium 4 came out simply because high FSB becomes a necessity when the core speed scales too high.

The future for Intel is long pipelines reinforced by SMT tricks, multiple cores running multiple threads in parrallel, high memory bandwidth, small L1 caches since a cache miss isn't necessarily doom and gloom, and large L2/L3 caches for efficient prefetching. The future for AMD would seem to be short pipelines, large L1 caches, SMP on a chip, hypertransport links between CPU's, NUMA memory architecture, and a low latency front-side buses. They both have the same endgame and actually somewhat aim at differing markets, but in the end the performance is nearly identical across the board.

I am willing to bet that the future of both companies is "long pipelines reinforced by SMT tricks." Single thread performance has pretty much hit a brick wall due to ISA constraints and the nature of programs in general. The easiest way to increase performance is multi-threading.
Multiple cores is a byproduct of shrinking transistors. When your logic consumes on the order of 15% of your die space, it suddenly becomes feasible to shove another CPU on die. I remember both Intel and AMD saying that CMP could become feasible with 90nm, and more likely than not with 65nm.
Smaller L1 caches are a byproduct of lower latency designs. It may become the trend if clock speeds ramp so high or we may see several small caches feeding several cores. L2 caches may become smaller as well to keep up with clock speed in which case L3 would be requierd to avoid capacity misses.
The future of AMD is longer pipelines, as evidenced by the K8 core. Granted, it's only 2 more stages, but what can you expect from a cash-strapped company? At least the rest of the core isn't simply copied over from the K7.
The larger size of AMD's chips comes from the shorter pipelines. Shorter pipes don't scale as well in raw clock speed, so it's kind of pointless to make a really low latency cache when the larger capacity makes up for the small loss in latency. If AMD does come out with a hyperpipelined CPU, I don't think it would have large L1 caches as well. The P6 core only had small L1 caches because it was designed at a time when 16k L1 isn't really all that small. As to why the Pentium III revision didn't have larger caches, I can only speculate.
Hypertransport is definitely AMD's CPU interface for years to come. Heck, they developed it in the first place and brought memory controllers on-die for that reason. Byproduct of on-die memory controller : low-latency CPU to northbridge connection. Unfortunately, main memory (read: DRAM) is still high latency so thanks to Amadahl's law, that idea is gonna hit a brick wall unless memory technology improves. However, Intel would also benefit.
On the same vein, NUMA memory architecture is likely something AMD would want done, but isn't something the company would kill for. Opteron systems have localized memory banks for sure, but I don't know enough about the latency between banks to wonder if NUMA would bring any benefit.
Intel and AMD have their own roadmaps, but there are similarities. The main differences come from Intel's massive financial and manufacturing advantage and possibly some differences in opinion. If both companies were equally wealthy and capable, I don't think their roadmaps would be too different. After all, the K5 was technically superior to the Pentium, yet shared much of the same features.

CTho9305 · Feb 12, 2004

Originally posted by: MadRat
And if Banias is to be considered the top of the P6 line, then its not a 1.6GHz cap but rather a 1.7GHz cap. Banias helped to uncork the P!!! legacy by mating it to the P4's front-side bus. It makes me wonder how much headroom it would realize on either the 533fsb (133MHz-QDR) or the 800fsb (200MHz-QDR) of the Northwood, considering how AMD was only able to keep the XP line alive above 2GHz by scaling up to a 200MHz-DDR fsb and strapping on twice the L2 cache. Last time I checked the Banias L2 was 1MB, twice what we find in Mr. Athlon Barton. Banias on the higher fsb could probably easily compete with the Barton.

Increasing the cache size is a good way to get a lower clock speed, not a higher one. Bigger arrays are, in general, slower (see Prescott vs. Northwood, for one example). The FSB doesn't necessarily relate to the clock speed. According to your post, taking an Athlon that runs above 2GHz and decreasing FSB while raising the multiplier shouldn't work.

I'm not sure what your point was about SMT. The main reason Intel uses it in the P4 is because it needs to fill idle pipelines to fight its otherwise high bandwidth, high latency memory design.

I had no real point, I was just saying that I felt it was good technology and disagreed with SuperTool's remark that it wasn't relevant for most tasks.

MadRat · Feb 12, 2004

I'd hate to see that CPU that runs 3GHz on a 33fsb. Sure has some scoot when it has data available, but can't do a thing with that PC-33 memory. Even the hard drive outperforms it... Talk about memory bound.

As far as going above 2Ghz with the Athlon XP, it simply was not a viable competitor in the market on either the 133fsb or the 256K L2 cache. (If it was then we should see all the people on motherboards that max out with 133fsb using 2GHz+ Athlon MP's.) Only with a combination of these improvements, 512K L2 and 166-200fsb, was AMD able to push the performance of their core to compete against Intel highend variants. There was only so much they could squeeze out of a processor with but one 64-bit DDR pathway to the L2 cache and main memory. A64 fixes this problem by making it a pair of 64-bit DDR pathways to the L2 cache and main memory. Likewise, the P!!! was hampered by a single 64-bit pathway to main memory, but at least it had a four-way pathway to the L2 cache.

I'm not so sure that larger caches make for slower clock speeds, being that caches are the easiest thing to demonstrate whenever a new smaller manufacturing process comes out. Add in the idea that faulty caches no longer hamper the sale of the processor, being they can remark them and disable the cache, its hard to find real world examples to prove that argument.

The future is not long pipelines "reinforced by SMT tricks" (my quote), but rather ideal pipeline lengths. Scaling MHz with the longer pieplines is a bonus in a way, not really a hinderance. With Prescott's thermal output it makes me wonder if its not wire delays that suffer under long pipelines, but perhaps its power input that suffers. Precott scaled in stages and didn't get any voltage drop when they moved to 90nm. Could it be because it takes more voltage to push those additional stages? They had to keep the voltage up because the signal strength is too weak in its current incantation. I'm guessing that Socket-T could make a huge difference in thermal output since it just happens to cure the input problems.

I'm also not so sure that "long pipelines reinforced by SMT tricks" is a position anyone wants to be put in when comparing performance. Any time SMT has to be used to hide latencies in the design then it leaves one vulnerable. AMD could play the same game as Intel and clump a nice load of crap benchmarks together to benefit from their design but its not going to happen. Intel is the market leader and AMD is the symbionic organism. Intel needs AMD's competition to fend off the market regulators, just like AMD needs Intel's marketing crush to keep people buying x86-compatible processors. If not for a healthy x86 market, then what future does AMD really have? Intel could crush AMD at the "higher IPC game" at any time but it serves no purpose. Intel almost needs to keep themselves vulnerable else they accidentily kill off their faked competition.

CTho9305 · Feb 12, 2004

Originally posted by: MadRat
I'd hate to see that CPU that runs 3GHz on a 33fsb. Sure has some scoot when it has data available, but can't do a thing with that PC-33 memory. Even the hard drive outperforms it... Talk about memory bound.

As far as going above 2Ghz wi....blah blah blah blah blah blah ...ay pathway to the L2 cache.

Oh. I thought you were saying they had to raise the FSB to reach >2GHz, but I guess you actually meant that without raising the FSB they weren't going to be getting any more performance. I think we actually agree

.

I'm not so sure that larger caches make for slower clock speeds, being that caches are the easiest thing to demonstrate whenever a new smaller manufacturing process comes out.

Probably because you can't just take an existing processor and shrink it as fast as you can redesign a cache. What's in a cache? One cell repeated a few million times and some sense amps?

Add in the idea that faulty caches no longer hamper the sale of the processor, being they can remark them and disable the cache, its hard to find real world examples to prove that argument.

As I said, look at the latencies of the Prescott vs Northwood.

With Prescott's thermal output it makes me wonder if its not wire delays that suffer under long pipelines, but perhaps its power input that suffers. Precott scaled in stages and didn't get any voltage drop when they moved to 90nm. Could it be because it takes more voltage to push those additional stages?

Additional stages won't affect voltage requirements. Each individual gate gets Vdd across it, and each gate's output is either Vdd or 0, regardless of how many gates you put in a series. I can draw a picture if it would help (don't interpret that as "talking down"... if you don't know how the logic is implemented a picture would really clear it up

)

They had to keep the voltage up because the signal strength is too weak in its current incantation.

Sounds plausible.

I'm guessing that Socket-T could make a huge difference in thermal output since it just happens to cure the input problems.

I haven't been following the speculation on why the voltage was not decreased - what is socket T? 478 with more power / ground pins?

I'm also not so sure that "long pipelines reinforced by SMT tricks" is a position anyone wants to be put in when comparing performance. Any time SMT has to be used to hide latencies in the design then it leaves one vulnerable. AMD could play the same game as Intel and clump a nice load of crap benchmarks together to benefit from their design but its not going to happen. Intel is the market leader and AMD is the symbionic {symbiotic?} organism. Intel needs AMD's competition to fend off the market regulators, just like AMD needs Intel's marketing crush to keep people buying x86-compatible processors.

Hardly. If Intel dropped x86, it would be suicide. They've been trying since whenever they buddied up with HP for IA64, and the low sales show quite clearly that people do not like change.

If not for a healthy x86 market, then what future does AMD really have?

AMD produces MIPS chips too. If you look at their sales numbers, you'll see that their flash products make up a HUGE portion of their revenue. (Total sales: $1.206 billion, CPG [desktop/server processors]: $581 million, FASL [flash memory]: $566 million) source

Intel could crush AMD at the "higher IPC game" at any time but it serves no purpose. Intel almost needs to keep themselves vulnerable else they accidentily kill off their faked competition.

"faked competition" doesn't usually see that Intel screwed the pooch with IA64 and then proceed to DESTROY Intel with respect to sales volume of 64-bit parts. By the way, the "faked competition" is expanding its operations pretty significantly right now.

Everywhere I look, the trend is towards longer pipelines, multiple threads, and VLIW designs. Not a single paper I've read that has been published in the last ten years has concluded that shorter pipelines are required for future designs. In fact, most papers tend towards decreasing complexity, longer pipelines, and multithreading. Some papers are suggesting anywhere from 40-60 pipeline stages depending on the complexity of the architecture. However, the biggest problem with deep pipelines is ILP followed by wire delay.

Why would wire delay get worse as you do less each cycle? (I haven't thought this through)

Smaller L1 caches are a byproduct of lower latency designs. It may become the trend if clock speeds ramp so high or we may see several small caches feeding several cores. L2 caches may become smaller as well to keep up with clock speed in which case L3 would be requierd to avoid capacity misses.

^ supporting my statement that bigger caches hurt your cycle time

After all, the K5 was technically superior to the Pentium, yet shared much of the same features.

IIRC, the K5 was used in the 486 socket, and was severely held back by the very low memory throughput (hence the PR ratings so far below the clock speed)

Sahakiel · Feb 12, 2004

Originally posted by: MadRat
As far as going above 2Ghz with the Athlon XP, it simply was not a viable competitor in the market on either the 133fsb or the 256K L2 cache. (If it was then we should see all the people on motherboards that max out with 133fsb using 2GHz+ Athlon MP's.) Only with a combination of these improvements, 512K L2 and 166-200fsb, was AMD able to push the performance of their core to compete against Intel highend variants. There was only so much they could squeeze out of a processor with but one 64-bit DDR pathway to the L2 cache and main memory. A64 fixes this problem by making it a pair of 64-bit DDR pathways to the L2 cache and main memory. Likewise, the P!!! was hampered by a single 64-bit pathway to main memory, but at least it had a four-way pathway to the L2 cache.

My point was that the 200 MHz EV6 bus for Athlon processors did not come as a result of the introduction of the Pentium 4. Rather, it's simply the same evolutionary procedure done in the past when processor generations introduced new memory bus speeds. The oldest instance I can remember is the 486 generation with the migration from 25-33 MHz. Pentium had 60 and 66 MHz. P6 had 60, 66, 100, and 133 MHz. Pentium 4 went from 100 to 200. In the same fashion, Athlon went from 100 to 200 simply because it was part of the architecture design. I would almost say Pentium 4 is actually designed to counter Athlon if I wasn't sure that the design team started planning the architecture before Athlon was released. Of course, with the cross-licensing between the two companies, it's hard to say what is designed to one up the other company.

The future is not long pipelines "reinforced by SMT tricks" (my quote), but rather ideal pipeline lengths. Scaling MHz with the longer pieplines is a bonus in a way, not really a hinderance. With Prescott's thermal output it makes me wonder if its not wire delays that suffer under long pipelines, but perhaps its power input that suffers. Precott scaled in stages and didn't get any voltage drop when they moved to 90nm. Could it be because it takes more voltage to push those additional stages? They had to keep the voltage up because the signal strength is too weak in its current incantation. I'm guessing that Socket-T could make a huge difference in thermal output since it just happens to cure the input problems.

Prescott has a bunch of problems, I'm sure. I also doubt many of them can be attributed to the architecture.
You call for ideal pipeline lengths, but I hope you realize that recent research has suggested ideal pipelines longer than current designs. I also wonder if you realize that superpipelining is nothing new. In the last 10-15 years, if you just look at x86 processor designs, you see a couple things. In order to improve performance, you can either increase your clock rate or increase your issue rate. Increasing clock rate involves better process technology and deeper pipelines. Increasing issue rate involves parallel execution through superscalar or VLIW designs. Unfortunately, it's very hard to do both which is why Intel went with deeper pipelines. The designers basically came to the conclusion that they could increase clock rate faster than they could increases parallelism. Hence, the introduction of the Pentium 4.
On the AMD side, they pretty much came to the same conclusion after a slight detour. The K5 had around 5-6 pipeline stages. Instead of increasing the pipeline, they made a really nice branch predictor and shoved it onto what I think is technically a similar core. K6 didn't do too well. K7 doubled pipeline depth to 10 or 12 and see how well that turned out. That's why K7 did slightly better clock for clock than P6's 12-14 stage design. That, and a non-pipelined FP unit, SDR data bus, a much older design, etc.
To be honest, if we simply lengthened the Pentium pipeline to 20 stages and ran it at 3 GHz today, it wouldn't perform nearly as well as a Pentium 4 at the same speed. What really kept performance in line with Moore's law was OOOE. More recently, DDR and QDR have helped, but more and more people are turning towards SMT to help prevent performance from slacking off. At this point, it's really hard to extract much more parallelism from single-threaded applications. The only way is increasing clock speed which means increasing pipeline depth. However, increasing pipeline depth is difficult because on average, your basic block is only 4-7 instructions long and branch mispredicts will prevent attaining ideal speed increases. It's much easier to try SMT with a deep pipeline (or even CMP) to improve performance because quite frankly, there isn't much else is on the table.

Originally posted by: CTho9305

I'm not so sure that larger caches make for slower clock speeds, being that caches are the easiest thing to demonstrate whenever a new smaller manufacturing process comes out.

Click to expand...

Probably because you can't just take an existing processor and shrink it as fast as you can redesign a cache. What's in a cache? One cell repeated a few million times and some sense amps?

Cache is something like one cell per data bit + tag bits + status bits + logic. The logic is dependent on the complexity of the cache, varying as you go from direct mapped to fully associative. The difficulty in scaling larger caches for high speeds likely comes from the number of gates required for the longest access as well as the physical location of the furthest bit. Caches are getting to the point where they take anywhere from 30% to 70% (really rough estimates) of the die space. If you look at Itanium, the core logic is like 20% and the rest is cache. That's why L2 and L3 have higher latencies than L1 (that, and also because they're accessed after misses).

Add in the idea that faulty caches no longer hamper the sale of the processor, being they can remark them and disable the cache, its hard to find real world examples to prove that argument.

Click to expand...

As I said, look at the latencies of the Prescott vs Northwood.

Northwood's 1 cycle L1 access for hits vs Prescott's 4 cycle access. Pretty compelling eveidence, it seems to me.

With Prescott's thermal output it makes me wonder if its not wire delays that suffer under long pipelines, but perhaps its power input that suffers. Precott scaled in stages and didn't get any voltage drop when they moved to 90nm. Could it be because it takes more voltage to push those additional stages?

Click to expand...

Additional stages won't affect voltage requirements. Each individual gate gets Vdd across it, and each gate's output is either Vdd or 0, regardless of how many gates you put in a series. I can draw a picture if it would help (don't interpret that as "talking down"... if you don't know how the logic is implemented a picture would really clear it up )

Don't know much about that. Perhaps the only problem with voltage I can think of is driving more stages with the same voltage source.

They had to keep the voltage up because the signal strength is too weak in its current incantation.

Click to expand...

Sounds plausible.

I hear there's problems with leakage current, which, if I understand correctly, would mean higher voltages to distinguish 1 and 0.

I'm also not so sure that "long pipelines reinforced by SMT tricks" is a position anyone wants to be put in when comparing performance. Any time SMT has to be used to hide latencies in the design then it leaves one vulnerable. AMD could play the same game as Intel and clump a nice load of crap benchmarks together to benefit from their design but its not going to happen. Intel is the market leader and AMD is the symbionic {symbiotic?} organism. Intel needs AMD's competition to fend off the market regulators, just like AMD needs Intel's marketing crush to keep people buying x86-compatible processors.

Click to expand...

Hardly. If Intel dropped x86, it would be suicide. They've been trying since whenever they buddied up with HP for IA64, and the low sales show quite clearly that people do not like change.

I wish Intel could successfully kill x86. It embodies just about every bad design decision I can remember off the top of my head. Too bad they didn't add a new mode to decouple the x86 decode stage and simply forward instructions to the internal core based on a new ISA. Well, that might be quite difficult, now that I think about it...

Everywhere I look, the trend is towards longer pipelines, multiple threads, and VLIW designs. Not a single paper I've read that has been published in the last ten years has concluded that shorter pipelines are required for future designs. In fact, most papers tend towards decreasing complexity, longer pipelines, and multithreading. Some papers are suggesting anywhere from 40-60 pipeline stages depending on the complexity of the architecture. However, the biggest problem with deep pipelines is ILP followed by wire delay.

Click to expand...

Why would wire delay get worse as you do less each cycle? (I haven't thought this through)

The wire delay does not decrase much (if any) after process shrinks. So what happens is that previously, wire delay was essentially zero whereas now, wire delay is becoming an increasingly large percentage of each gate delay. On the same topic, resistence is increasing thanks to thinner wires.

After all, the K5 was technically superior to the Pentium, yet shared much of the same features.

Click to expand...

IIRC, the K5 was used in the 486 socket, and was severely held back by the very low memory throughput (hence the PR ratings so far below the clock speed)

Hmm... I remember the Pentium, K5, and 6x86 using Socket 7. K5 probably came out after Pentium, so may have used Socket 5 as a means of upgrading legacy 486 systems relatively cheaply. However, I do distinctly remember buying Socket 7 motherboards which took Pentium, K5, and 6x86. As for memory throughput, Pentium chips used 60 and 66MHz, but AMD and Cyrix chips sometimes used 75, 83, or even 95 MHz.
Also, the PR ratings were above clock speed. For example, a 133MHz 6x86 had a PR rating of 166. I remember it as such because I used to build systems around Cyrix 6x86 chips (don't kill me, plz

) and I think K5's and 6x86's having similar if not equal PR ratings, at least for the sub 200MHz arena.

Sohcan · Feb 12, 2004

Originally posted by: CTho9305

Intel could crush AMD at the "higher IPC game" at any time but it serves no purpose. Intel almost needs to keep themselves vulnerable else they accidentily kill off their faked competition.

Click to expand...

"faked competition" doesn't usually see that Intel screwed the pooch with IA64 and then proceed to DESTROY Intel with respect to sales volume of 64-bit parts.

A little OT, but while IDC's Q3 numbers showed Opteron shipped more system units than IPF (~10,000 vs. 5,000), IPF shipped moderately more processor units and brought in much more system revenue ($123 million vs. $61 million).

CTho9305 · Feb 12, 2004

To be honest, if we simply lengthened the Pentium pipeline to 20 stages and ran it at 3 GHz today, it wouldn't perform nearly as well as a Pentium 4 at the same speed. What really kept performance in line with Moore's law was OOOE. More recently, DDR and QDR have helped, but more and more people are turning towards SMT to help prevent performance from slacking off.

You have to remember that older processors were designed to work with main memory that was "almost" as fast as the processors themselves, so other improvements (superscalar stuff, all that fancy technology) aside, you still wouldn't get as much performance out of them with memory as relatively slow as it is now.

With Prescott's thermal output it makes me wonder if its not wire delays that suffer under long pipelines, but perhaps its power input that suffers. Precott scaled in stages and didn't get any voltage drop when they moved to 90nm. Could it be because it takes more voltage to push those additional stages?

Click to expand...

Additional stages won't affect voltage requirements. Each individual gate gets Vdd across it, and each gate's output is either Vdd or 0, regardless of how many gates you put in a series. I can draw a picture if it would help (don't interpret that as "talking down"... if you don't know how the logic is implemented a picture would really clear it up )

Click to expand...

Don't know much about that. Perhaps the only problem with voltage I can think of is driving more stages with the same voltage source.

Extra stages alone don't change anything.

After all, the K5 was technically superior to the Pentium, yet shared much of the same features.

Click to expand...

IIRC, the K5 was used in the 486 socket, and was severely held back by the very low memory throughput (hence the PR ratings so far below the clock speed)

Click to expand...

Hmm... I remember the Pentium, K5, and 6x86 using Socket 7. K5 probably came out after Pentium, so may have used Socket 5 as a means of upgrading legacy 486 systems relatively cheaply. However, I do distinctly remember buying Socket 7 motherboards which took Pentium, K5, and 6x86. As for memory throughput, Pentium chips used 60 and 66MHz, but AMD and Cyrix chips sometimes used 75, 83, or even 95 MHz.
Also, the PR ratings were above clock speed. For example, a 133MHz 6x86 had a PR rating of 166. I remember it as such because I used to build systems around Cyrix 6x86 chips (don't kill me, plz ) and I think K5's and 6x86's having similar if not equal PR ratings, at least for the sub 200MHz arena.

It seems you're right. So what were the ~133MHz CPUs I put in 486 boards that was rated a P90?

SuperTool · Feb 12, 2004

I was talking about pentium M processor that is in the centrino that is based on the P3 core with large cache, and gets comparable performance at half the clock rate of P4.
I frankly don't see the point of having 31 pipeline stages. You are getting to the point where diminishing returns kick in and you are just adding flops and wasting area and power.

Thoughts and speculations about Prescott design decisions

Diamond Member

Senior member

Elite Member

Lifer

Lifer

Elite Member Mobile Devices

Senior member

Elite Member Mobile Devices

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Lifer

Elite Member

Lifer

Golden Member

Elite Member

Lifer

Elite Member

Golden Member

Platinum Member

Elite Member

Lifer