Prescott: What would you have done?

Dough1397 · Mar 21, 2005

thanks for the read guys i am bored... this used up a good 30 min of my time as well as talking to friends... very interesting...

uOpt · Mar 22, 2005

Originally posted by: dmens

Originally posted by: MartinCracauer
Well, word is that the dual-core CPUs will be based on what is now the Prescott, same power consumption.

Click to expand...

Oh, that's old stuff. I'm talking about new generation P4's.

Oh, ok. Do you have some links to learn more? The last thing I read was that the dual-cores will consume even more power than the Prescotts (less per core, more per unit).

Trente · Mar 22, 2005

"Intel needs to heavily beef up the L1 cache size, add more decoders, raise the transfer rate from the trace cache to the core, lower the cost of shift operations, and add additional FPU and MMX execution units." Copyright (C) 2001 by Darek Mihocka

But wait... if they do that, they would end up with an Athlon in their hands!

dmens · Mar 22, 2005

Originally posted by: MartinCracauer
Oh, ok. Do you have some links to learn more? The last thing I read was that the dual-cores will consume even more power than the Prescotts (less per core, more per unit).

Each core on Smithfield will be underclocked with a lower Vcc, but the total die will still consume more power. I think the Intel website has the press releases.

imgod2u · Mar 23, 2005

Originally posted by: Trente
"Intel needs to heavily beef up the L1 cache size, add more decoders, raise the transfer rate from the trace cache to the core, lower the cost of shift operations, and add additional FPU and MMX execution units." Copyright (C) 2001 by Darek Mihocka

But wait... if they do that, they would end up with an Athlon in their hands!

Actually, this is what they did with Prescott. Didn't really make much of a difference considering they quadrupled L1 cache latency and doubled L2 latency.

uOpt · Mar 23, 2005

Not sure the caches are the only thing that makes the AMD64 CPUs faster for so many applications.

In my (biased) opinion as a programmer the main problem especially with early Pentium-4s is that Intel drastically overestimated how much effort programmers would invest into tuning their code for Pentium-4s when all other CPUs (Pentium-3s, AMDs, Pentium-M) still obey to the same old performance characteristic. Granted, it was great fun to read through the P4 optimization manual every few months and our application got a lot better, but this was more of a personal fun project for me, not something my employer would willingly invest time in.

And we even have control of our own compiler. Most other companies are at least stuck with their compilers. And C++ doesn't really allow you to freely set inlining thresholds by function without very messy macrology. That makes applying most of the lessons in the optimization manual very difficult.

BitByBit · Mar 23, 2005

The Athlon 64's good performance is in my view primarily the result of its integrated memory controller, combined with its generally more efficient architecture.
I don't believe the Athlon 64's cache system is particularly impressive however, given that it is afterall exclusive, and also that its L2 cache is connected to the core via a 128-bit bus; an improvement over the K7's 64-bit bus, but still half of Intel's 256-bit 'advanced transfer cache'.
Dothan seems to have the most ideal cache system out there: a large, inclusive L1 Cache (64KB) and a huge, low-latency L2 (2048KB).
One overlooked aspect of the Athlon's design is its low instruction latency, that is, the time taken for a single instruction to propagate through the pipeline.
According to my calculations, a Prescott P4 would have to operate at a clock speed of 4.65GHz in order to acheive the same instruction latency as a 1.8GHz K8 (of course, at this clock speed, the Prescott's maximum throughput would be alot higher).
Then we have the Athlon's ability to do more in parallel owed to its wider execution core, and the reasons for its high performance become more clear.
Had Intel been successful in scaling Prescott into 5Ghz territory as was their intention, we'd very likely be seeing AMD playing catch-up (although it is foolish to speculate on this matter, especially given AMD's resilience).

MyK Von DyK · Mar 23, 2005

I'd shorten the pipeline first and lower the frequency of the main ALU. I'd also cut cache size in half (Power & Thermal Dissipation issues). Then I'd attach several smaller/faster FPUs with its own cache (not big, say 2x16K) and higher clocked frequency and one 8-bit 1/2 the frequency of main clock DSP (seperate addressing). This should give games/multimedia/scientific calculations quite a boost. If performance is not such an issue than quality/usability then I'd do the same except with only one or two seperate FPUs plus one more DSP and one Physics Unit.

hurtstotalktoyou · Mar 25, 2005

hurtstotalktoyou · Mar 25, 2005

This may be an over-simplified way of looking at things, but it seems like part of the problem with x86 CPUs in general is that they are too reliant on old, out-of-date technology. All this backwards-compatible concern seems to be getting in the way of advanced progress. I'd be curious to see what would happen if Intel and Microsoft worked together on a new, built-from-scratch CPU/OS combination. The new configuration would have to run old software, but a collaboration between the two companies could allow for a sort of x86 emulation. Microsoft, in the mean time, could focus on stability.

uOpt · Mar 26, 2005

Originally posted by: hurtstotalktoyou
This may be an over-simplified way of looking at things, but it seems like part of the problem with x86 CPUs in general is that they are too reliant on old, out-of-date technology. All this backwards-compatible concern seems to be getting in the way of advanced progress. I'd be curious to see what would happen if Intel and Microsoft worked together on a new, built-from-scratch CPU/OS combination. The new configuration would have to run old software, but a collaboration between the two companies could allow for a sort of x86 emulation. Microsoft, in the mean time, could focus on stability.

Itanium?

Leper Messiah · Mar 28, 2005

it all comes down to an evolutionary vs. a revolutionary process. Net burst was supposed to be revolutionary, and allow for these massive clock speeds, correct? It was going to be the brute force method to own on AMD.

AMD's processor is more evolutionary. The A64 is basically a barton with an intergrated memory controller and 64bit instructions tacked on. Still, its far more effiecent than a Prescott, and appearently AMD's wizards have been able to tweak the process to allow for more speeds (although, we should have known that winchesters could do 2.6, since mobile athlons can do them).

If I were intel, what would I do? Buy AMD.

clarkey01 · Mar 28, 2005

Originally posted by: MartinCracauer

Originally posted by: hurtstotalktoyou
This may be an over-simplified way of looking at things, but it seems like part of the problem with x86 CPUs in general is that they are too reliant on old, out-of-date technology. All this backwards-compatible concern seems to be getting in the way of advanced progress. I'd be curious to see what would happen if Intel and Microsoft worked together on a new, built-from-scratch CPU/OS combination. The new configuration would have to run old software, but a collaboration between the two companies could allow for a sort of x86 emulation. Microsoft, in the mean time, could focus on stability.

Click to expand...

Itanium?

x86 does have its problems, but a lot of PPC and IA64 supporters honestly couldn't tell you what these problems are, only that they exist. The PPC supporters are especially guilty of this.. not to say that the latest PowerPC processors are nothing special (they're really quite nice), just that Apple fanboyism often leaks over into architecture.

If you ask me, we're past most of what made x86, well, suck. Modern x86 CPUs are really just RISC in sheep's clothing (hey, who said I couldn't randomly mix analogies?) for the most part, and the only reason x86 is said to drag us down is that CPUs basically have to translate x86 instructions into an easier-to-digest form (basically/semi-incorrectly: CISC to RISC, stupid to smart). Do we lose some speed doing this? Sure! Do we lose enough that we need to go through making an entirely new architecture for PCs? Well.. maybe not.

The trouble isn't so much designing the new architecture, really. The engineers doing this probably find it fun. The trouble is that you have to tell the market "hey, we're going to break compatibility with everything out right now, but look at it this way-- if you buy all this expensive new hardware then run the expensive software coded for this new architecture, you'll get a moderate speed boost over the old hardware!" Who out there is going to say "ooh! Me first!"?

There are more.. delicate ways to handle this situation, of course. Ace's Hardware went over it in that Kill x86 article of theirs. It may not be the easiest thing in the world, but it would be relatively painless for the market. The point is that these light-handed (on the "market treatment" side, nevermind the poor engineers who're told that they have to make a CPU that's effectively two architectures in one) ways of introducing a new instruction set / arch were not the ways that Intel chose.

But I digress, a lot. Let's assume that Intel somehow make the Itanium 2 emulate x86 code at a reasonable pace. Now you just have the issue of getting it to the market, right? Surely that's all? No, sadly, it isn't. The 1GHz LV Deerfield puts out 62 watts of heat over a 180mm^2 die. Does that sound familiar? A die the size of a farm animal, power consumption in the low sixties? Why, that's what the 130nm Opteron 246s look like. Except they're a lot faster than 1GHz Deerfields, even with this horrible "maintaining backwards compatibility" deal, and even running in 32-bit mode only.

But that's not really fair, is it? The 1GHz Deerfield is awfully slow. It's an LV part, after all. But that brings me to my other point: the LOW-VOLTAGE part puts out 62W of heat. Even someone a few crayons short of a box (someone such as myself, I guess) can see that you might just have a few heat output issues with the non-LV parts. Huge die or no, that's a lot of heat to dump into a PC. And what do you get out of it? Something maybe as fast as an Opteron that, if market adoption suddenly grew by an ENORMOUS amount, might not cost TOO much more.

The Itanium may have been promising at its debut, but the fact is that it's ill-suited to anything except massively parallel supercomputers... and even in those, there aren't really many reasons to use them over other, better processors (did someone say POWER5?)

In short:

- The Itanium is a bad choice for desktops for several obvious reasons. Even assuming excellent software support and dirt cheap motherboards, you still have heat output, power consumption, and price : performance.

- The Itanium is a bad choice for most workstations simply due to low price : performance. Workstations perform a lot of different work, and this work is often float-heavy (and when it's int-heavy, why you shouldn't pick an Itanium is pretty obvious), but it also generally benefits from multiple CPUs, and you easily could get a quad Opteron for the price of a dual Itanium 2.

- The Itanium is a bad choice for servers because of low price : performance and the integer-heavy work that servers are meant to perform. (What, you really think serving web pages involves a lot of floating point ops?)

- The Itanium is a moderately poor choice for massive parallel computers because of low price : performance and high heat output (both of which become very important on a large scale, especially the former). I say moderately poor because, hey, at least it scales well, right?

In short: IA-64 has/had promise, but the Itanium doesn't really have a niche.

Performance relative to clock speed is very good, but really, who cares about performance relative to clock speed when you can look at performance or performance relative to price? Frankly, I don't care what clock speed a CPU runs at, I just want to see how it performs. To put it another way, an 800MHz Pentium III offers better performance : clock speed than a P4 3.4C, but which is better? (It's an extreme example, of course; I'm just pointing out that performance : clock speed isn't important when you just want to see performance.)

The Itanium does well in fp, yeah, but you still have price-performance (stupid

emoticon ruining my ratio sign!) to worry about. You could easily get a dual Opteron 250 for much less than a single Itanium 2 1.5GHz, and the dual Opteron 250 would outperform the Itanium 2 in any SMP-aware tasks (and like I said before, if you're paying that much for CPUs, you're going to be using an SMP-aware ap). Why bother paying more for nothing?

Performance is VERY dependent on the compiler for the Itanium, yeah. We've already gotten a series of compiler speed boosts, though... if we see anything more, I doubt it'll be any greater than what we'll see in the future for the Opteron.

You could say it's about performance, yes, but ignoring price is silly. If you want to ignore price, you could say there's no reason to buy anything other than the FX-55 for a gaming desktop, because everything else is slower in games. Truth is, not everyone would want to pay that much.. and in this thread's case, it's nice to get better performance at a lower price point (dual Opteron : single Itanium 2).

Take your pick: very large die, yield problems, lack of demand. I think it's a combination of the first and third, myself... which, unfortunately, means that demand alone wouldn't bring prices down.

Development of the first Itanium costed HP and Intel around $4 billion, I don't think
they even sold $4 billion worth of Itanium/Itanium2 processors in these 4 years.

clarkey01 · Mar 28, 2005

but in the Prescott's case it's outweighed by the longer pipeline that NEEDS more cache to feed it just to reach the Northwood's performance level. It's sort of like saying that Car X has more horsepower than Car Y and therefore must be faster, even though Car X is so much heavier that they perform exactly the same. The processor needs the extra cache just to haul it up to the speeds of the predecessor; since the advantage of extra cache is extra performance, and since said extra performance is not provided in this case, the extra cache isn't really a benefit. Then you have the higher latency of the larger cache, which surely helps to mask any small performance advantage that could theoretically exist despite the drastically lengthened integer pipeline.

Prescott does have its advantages, mind, I just think listing cache as one of them is silly when it's not doing anything for performance overall. I'm sure there are a few isolated examples where it actually hurts, but I'm equally sure that are just as many or more isolated examples where the higher latency (and/or the longer pipeline, possibly) hurts.

You'll notice I'm not commenting (well, until now) on the vast amount of scenarios in which Prescott is slightly slower than Northwood. This is because no one in their right mind should care about a microscopic performance difference, no matter how often it occurs.

uOpt · Mar 29, 2005

On the topic of whether x86 is actually bad:

working on a compiler for a garbage-collected language I can tell that x86 with its few registers sucks compared to actual RISC chips.

Modern x86 CPUs are different CPUs with a x86 emulator on top of it anyway. Which is mostly fine, unless you have a severe limitation on the top view that the actual CPU wouldn't have. The few registers on i386 and the "architecture" of i387 are examples.

Same thing as with the Java VM: is fine for certain languages with certain features, but once you have e.g. a language that needs integer overflow checking you get a huge performance drop that you wouldn't have when compiling to the actual processor, not the JVM.

clarkey01 · Mar 29, 2005

Considering northwoods 21 stage pipeline V Prescotts 31 stage pipeline It just goes to show that pipeline increases do not affect performance the way most people think. In fact, most instructions will never go down all 31 pipe stages. Pipe stage length increases increase the latency it takes for a single instruction to complete, but increases the throughput (number of instructions completed in a given amount of time). Fact is, Intel does actually know how to design a decent microprocessor. Granted, a number of decisions may be made for marketing reasons, but they always back it up with a great product. Hyperpipelining is simply a design approach, not a right or wrong decision. Intel has chosen this method and it appears to work for them.

Also remember, when frequency does up, the time required for a single cycle is smaller. There are a huge number of operations that benefit from very short cycles, like additions and simple logic ops (ors, ands, nots, adds, subtracts). These are the types of instructions that, once the operands are in registers, don't need to access memory, or the fpu pipes, so they complete in the pipeline very soon. This explains why video (which is mostly masking, scaling, interlacing operations like adds and parallel simd mult and other packed simd ops) benefit so much on intel's design. simple alu ops only take 1 cycle to get through the alu. the faster you can make that single cycle (frequency increase), the faster those operations will go. And the faster the video performance will be.

clarkey01 · Mar 29, 2005

Originally posted by: MartinCracauer
On the topic of whether x86 is actually bad:

working on a compiler for a garbage-collected language I can tell that x86 with its few registers sucks compared to actual RISC chips.

Modern x86 CPUs are different CPUs with a x86 emulator on top of it anyway. Which is mostly fine, unless you have a severe limitation on the top view that the actual CPU wouldn't have. The few registers on i386 and the "architecture" of i387 are examples.

Same thing as with the Java VM: is fine for certain languages with certain features, but once you have e.g. a language that needs integer overflow checking you get a huge performance drop that you wouldn't have when compiling to the actual processor, not the JVM.

Yup, I'd have to agree the instruction set blows. I recently took a look at SSE and it too blows (compared to MIPS VU and PowerPC Altivec) Just count the number of instructions to do a 4x4 matrix multiply(!) No multiply-accumulate??? No broadcast during arithmetic??? Overlapping src/dst? Only eight xmm registers? And forget using MMX inline intrinsics, blast that f'ing emms instruction. And the equiv. integer operations for SSE didn't mature until SSE2, which limits you to P4 or Athlon64. Doh!

I do wish Intel would standardize the internal micro-op instruction set, and allow the asm coders to write directly to that instead of treating it as a cpu that can only "emulate" x86. Maybe that's itanium's role, although it appears to be facing the same fate as the i860 -- pieces migrated back into the x86 black box. Doh!

But at the end of the day, I can't really complain about how fast the latest x86 designs can chew up general purpose c code.

Promit · Mar 30, 2005

If I could compile to Intel or AMD microcode instead of x86...

Well, in one sense that'd be pretty awesome. But it seems like I'd risk fragmentation when suddenly the x86 platform becomes a dozen different platforms. And shipping a binary for every CPU and asking the user to figure out what CPU they have...nope, doesn't work.

As far as the P4, well, in short I'd build it like an Athlon. Although the IBM 970 (G5) is a fairly interesting architecture...I'm a little skeptical of things in it. It might theoretically be able to have 200+ instructions in flight, but given the restrictions on how instructions can be grouped, I'm wondering how many instructions are actually in flight at any given time for general use.

IntelUser2000 · Mar 30, 2005

Did anyone remember the Xeon "Irwindale" DP article by this site couple of months ago? Can't you see that the performance increase of doubling the cache was significant?

Did anyone also remember the 6xx Pentium 4 review by this site too? Did you see how the performance increase was almost nothing?

Does anyone see the significance? Prescott is more workstation/server chip than desktop chip.

TO THOSE IDIOTS THAT THINK INTEL'S 90nm LEAKAGE IS THE MAIN RESULT OF PRESCOTT RUNNING HOT: Dothan runs 17.6% higher clock speed with 13.3% lower TDP, that means power reduction is somewhere around 30% at least.

Another speculation: Intel had a presentation and one of the names of the slide was: "Power as a competitive advantage". According to Intel, their process technology has lower leakage at same clock speed than any other process whether SOI or not at same 90nm. Why Prescott's transistor may be having high amounts of leakage? Because they said for 10% faster transistors leakage increases by COUPLE OF TIMES!!!

The 65nm power saving feature was told by Intel as this: "You can either have 15% faster switching transistors, or 4x the leakage reduction with same switching speed", so we'll see something like this: Yonah-using low leakage slower transistors, Presler-using higher leakage faster transistors for example.

TuxDave · Mar 30, 2005

Originally posted by: IntelUser2000
The 65nm power saving feature was told by Intel as this: "You can either have 15% faster switching transistors, or 4x the leakage reduction with same switching speed", so we'll see something like this: Yonah-using low leakage slower transistors, Presler-using higher leakage faster transistors for example.

That trade-off has been used in industry and academia for years now. If you have two paths leading to a gate, and one has alot of crap and the other has almost nothing, there's no point in putting in blazingly fast gates in the 'almost nothing' path because the speed will be gated by the longer path. So you dump in small low-leakage transistors.

clarkey01 · Mar 30, 2005

Originally posted by: IntelUser2000
Dothan runs 17.6% higher clock speed with 13.3% lower TDP, that means power reduction is somewhere around 30% at least.

You may want to rethink that, or maybe its me, but I didnt know Dothan was hitting 3 Ghz +

travers · Apr 1, 2005

Reduce the pipeline and ditch the extra branch predictors. Make the thing more efficient per clock-cycle. Don't chop everything into little pieces. Don't suggest a new form-factor to cool your hot chips.

Fox5 · Apr 1, 2005

Hmm, this is what the future holds for AMD and Intel:
Pentium M will be made more like the Athlon core.(more powerful)
Athlon core will be made more like the Pentium M.(more cache, or at least that's what AMD planned for the Athlon core when it was introduced 6 years ago)
And both will go towards a common middle ground with multi core cpus.

I wouldn't say there's anything to be done to fix Prescott because there's nothing wrong with it, besides being too hot. Just slap some water cooling on it, or go back to northwood and focus on dual core northwoods or northwoods with large amounts of cache. Hmm, well that could have made prescott less complicated and then maybe things would have worked out.

BTW, I know one person who argued for Itanium 2 over Opteron, saying something about how the massive opteron farm that was being developed somewhere(red octane or something like that) is way late, way over budget, and costs more than the more powerful Itanium system that NASA bought.

Considering northwoods 21 stage pipeline V Prescotts 31 stage pipeline It just goes to show that pipeline increases do not affect performance the way most people think. In fact, most instructions will never go down all 31 pipe stages.

As long as the pipelines are kept full, there is no performance decrease. Plus, when the choice is a long pipeline versus a wider pipeline like the athlon, I think there are more advantages to the long pipeline than the wide.

BitByBit · Apr 1, 2005

Originally posted by: Fox5
As long as the pipelines are kept full, there is no performance decrease.

Operating at the same clock speed, a longer-pipelined processor will take a performance hit in branch-intensive operations due to the higher instruction latency (something I talked about previously in this thread), which means the core takes longer to flush and refill than a processor operating at the same clock speed but with a shorter pipeline.
Non-branch-intensive programs, like rendering for example, will show a negligable performance hit.
The point of extending the pipeline is to allow for higher clockspeeds, and hence higher throughput.
If you double a processor's pipeline depth while doubling its clockspeed, the instruction latency should be maintained, which means a branch misprediction causes the same impact in terms of time wasted refilling the pipeline.
However, doubling the clock speed in this case has also doubled the theoretical maximum throughput - something that would be made especially evident in rendering and encoding.
As we have seen with Prescott though, extending the pipeline does not always guarantee higher clock speeds.

Slaimus · Apr 11, 2005

I do not understand why Intel does not handpick less leaky, but slower transistors in the new revision, since they know that they will not have to break the 4GHZ barrier.

Prescott: What would you have done?

Senior member

Golden Member

Golden Member

Platinum Member

Senior member

Golden Member

Senior member

Member

Platinum Member

Platinum Member

Golden Member

Banned

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Member

Elite Member

Lifer

Diamond Member

Junior Member

Diamond Member

Senior member

Senior member