• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Whats the hangup with AMD's Mhz?

oconnect

Member
Jun 29, 2004
50
0
0
I've noticed that AMD is lacking in the GHz arena. I've also noticed that AMD utilizes clock cycles more effective then Intel. Thus is why Intel must push more MHz then AMD. My system is clocked at 2.3 GHz. If I bench my computer against a p4 2.3 I kick its but.

I think an excellent thing for AMD do is work on catching up with the MHz compared to Intel & continue the better utilization of the clock cycles. If that could be done that would shut up those Intel lovers out there.

This is not an appropriate opening post. Next time, this will be locked.

AnandTech Moderator
 

SuperTool

Lifer
Jan 25, 2000
14,000
2
0
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.
 

AFB

Lifer
Jan 10, 2004
10,718
3
0
Originally posted by: SuperTool
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.

:D
 

Mday

Lifer
Oct 14, 1999
18,647
1
81
Originally posted by: SuperTool
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.

That has to be the worst defense of AMD I have ever heard.

It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things. Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Mday
Originally posted by: SuperTool
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.

That has to be the worst defense of AMD I have ever heard.

It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things. Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.

Give that Intel canned Tejas because it was too hot, and Sun canned their Sparc V, it seems like many of the big guys are having problems gaining more performance by scaling frequency. MHz is overrated. Power = C*V*V*F, where F is frequency, so as you scale frequencies up, power tends to go up too. Obviously process shrinks (.18u->.13u) can help mitigate this, but even though we're at 90nm now, any desktop chip now burns more power than even high-end Alpha processors which were considered hot at just 30W not too long ago.

AMD, Intel, and Sun have all announced plans to go multi-core. This makes sense, because like in a lot of fields, getting 90% of the performance takes 50% of the effort (made up those numbers, but you should see my point). For example, to predict branches with 90% accuracy requires only a set of 2-bit counters. To get 99% requires ridiculously complicated logic (and large amounts of it). As another example, a "standard" out-of-order pipeline that just handles the basic data dependencies isn't too complicated, but to gain more performance, you have to do some really crazy optimizations (for example, do tricks like value prediction by throwing in large amounts of hardware, even though you only gain a few percent more performance). With multi-core parts, you do the 50% of the work to reach 90%, and then just dump a whole bunch of those CPUs in a system and convince the software guys to write multi-threaded software.

Maybe in ideal-world with perfect branch prediction, no interrupts, and no exceptions, but in reality, longer pipelines aren't always better performers (as you're implying). It's reasonable to say that the P4 does less useful work per cycle.

Your statement that the P4 takes more cycles to do stuff, but doesn't do less per cycle implies that it does more work overall. This is true from the perspective of raw number instructions executed, but not true if you look at the number of instructions that are committed and not killed (if it was, Athlons would be getting slaughtered in pretty much every benchmark).

At a circuit level, higher clock speeds imply less work per nanosecond because you spend a higher percent of your time doing nothing but waiting in flip flops - if you assume your flipflop takes 10ps, then at 1GHz you can do 990ps of useful work per cycle, so you're doing work 99% of the time, but at 10GHz you can do 90ps of work per cycle, 10 cycles per nanosecond, and end up wasting 100ps in flipflops doing nothing every nanosecond.

At an architectural level, you have things like data dependencies whose relative effects get worse as latencies in number of cycles goes up. If you have a 1GHz processor that can multiply two 32-bit numbers in 4 cycles, and a 2GHz processor that can do it in 8 cycles (same amount of time), when you have a data dependency so a future instruction depends on a multiply, the 1GHz part has to wait 3 cycles, so 3 cycles were wasted, but the 2GHz part wasted 7 cycles (more than double).
 

cquark

Golden Member
Apr 4, 2004
1,741
0
0
Originally posted by: Mday
It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things. Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.

It's more complex than that. Both AMD and Intel processors can issue multiple instructions per clock cycle.

Even if a processor did perform one action per pipeline stage, that one action on one processor can be divided into several things on another processor. Look at how the P4 reduced the work done by some pipeline stages when they went from around 20 stages in WMT/NWD to 31 stages in Prescott.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: CTho9305
Originally posted by: Mday
Originally posted by: SuperTool
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.

That has to be the worst defense of AMD I have ever heard.

It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things. Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.

Give that Intel canned Tejas because it was too hot, and Sun canned their Sparc V, it seems like many of the big guys are having problems gaining more performance by scaling frequency. MHz is overrated. Power = C*V*V*F, where F is frequency, so as you scale frequencies up, power tends to go up too. Obviously process shrinks (.18u->.13u) can help mitigate this, but even though we're at 90nm now, any desktop chip now burns more power than even high-end Alpha processors which were considered hot at just 30W not too long ago.

You'll notice in that equation that capacitance, C, is an equal factor of power consumption as frequency. What do you think you would need to do to make an K7 instead of Netburst core? You need to add more decoders, more execution units, more issue ports, etc. That increases capacitance and thus, increases power consumption. Power issues come from both wider-and-short and narrow-and-long designs alike and neither is inherently "worse" at it. It's a matter of implementation and how efficiently you take advantage of the resources you have, not how much work you get done per cycle. Case and point is Itanium. It extracts massive amounts of ILP and achieves a very high IPC, and yet, it consumes tons of power (more than any Prescott out there at 1.6 GHz).
The K7 doesn't utilize "cycles" more, it simply has more parallel execution resources. The K8, on the other hand, does utilize its cycles more as it spends less time idling on memory misses (thanks to the integrated memory controller) and *that* is efficiency, and it has little to due with whether the chip is high-clockspeed or high-IPC designed.

AMD, Intel, and Sun have all announced plans to go multi-core. This makes sense, because like in a lot of fields, getting 90% of the performance takes 50% of the effort (made up those numbers, but you should see my point). For example, to predict branches with 90% accuracy requires only a set of 2-bit counters. To get 99% requires ridiculously complicated logic (and large amounts of it). As another example, a "standard" out-of-order pipeline that just handles the basic data dependencies isn't too complicated, but to gain more performance, you have to do some really crazy optimizations (for example, do tricks like value prediction by throwing in large amounts of hardware, even though you only gain a few percent more performance). With multi-core parts, you do the 50% of the work to reach 90%, and then just dump a whole bunch of those CPUs in a system and convince the software guys to write multi-threaded software.

Unfortunately, while it is easy, it's also incredibly inefficient. With dual-core you have twice the power consumption, twice the heat output and, depending on your packaging (I'm assuming stacked cores), twice the heat/area. And you gain 0% increase in single-threaded applications and perhaps 80% performance increase in well-balanced multithreaded applications. People aren't going dual-core because it's more efficient than other methods (increasing clockspeed, widening the core), they're going that way because the other two methods have reached the point of diminishing returns (whether it's by jamming more execution resources into the chip or extending the pipeline) and dual-core is another method (albeit another inefficient method) to increase performance for some applications.

Maybe in ideal-world with perfect branch prediction, no interrupts, and no exceptions, but in reality, longer pipelines aren't always better performers (as you're implying). It's reasonable to say that the P4 does less useful work per cycle.

It has been in the past. The Alphas around the time of the Pentium 2's didn't dominate its competition because of a wide core, it dominated because it clocked the highest. Traditionally, higher clockspeeds have been a better method to gain performance. Of course, as with anything, you get to a point where it stops being so great and I think Prescott has reached and exceeded that.

Your statement that the P4 takes more cycles to do stuff, but doesn't do less per cycle implies that it does more work overall. This is true from the perspective of raw number instructions executed, but not true if you look at the number of instructions that are committed and not killed (if it was, Athlons would be getting slaughtered in pretty much every benchmark).

It does take more cycles to do stuff. The P4 is a 6-way, 6-issue (assuming you're using the double-pumped ALU's) design while the K7/K8 is a 9-way, 9-issue design. That's 50% more the K7/K8 can do at peak every clockcycle (requiring, of course, more hardware on the chip). The K7/K8 also has a 3-way decoder that's capable of decoding any 3 x86 instructions and issue 3 macro-ops (fused, 2-micro-ops) per clockcycle. This issue rate (although micro-op implementations are different, they're usually similar for simple instructions) is twice that of the P4's trace cache. Of course, there are all sorts of things on the P4 (Northwood) design that made it come up short. Only 1 FP issue port, no dedicated shifters, etc. but solving those problems requires more hardware. Intel relied on software to get around them.

At a circuit level, higher clock speeds imply less work per nanosecond because you spend a higher percent of your time doing nothing but waiting in flip flops - if you assume your flipflop takes 10ps, then at 1GHz you can do 990ps of useful work per cycle, so you're doing work 99% of the time, but at 10GHz you can do 90ps of work per cycle, 10 cycles per nanosecond, and end up wasting 100ps in flipflops doing nothing every nanosecond.

I don't know where you get this notion but the whole point of a clockcycle is to synchronize things. Which means multiple events *do not occur* in one clockcycle. Each clockcycle through each part of a circuit does exactly one thing, whether that means doing it and waiting for 990ps (as in the case of accessing a flip flop that takes 10 ps to access) or doing it and waiting for 90 ps (as with the 10GHz circuit). There are tricks of course, such as using both edges of the clock frequency to trigger, but again, it has to be synchronized. Only one event occurs, if you had more, you could never synchronize multiple events.
Sorry, but delays due to higher frequencies do not occur at the circuit level.

At an architectural level, you have things like data dependencies whose relative effects get worse as latencies in number of cycles goes up. If you have a 1GHz processor that can multiply two 32-bit numbers in 4 cycles, and a 2GHz processor that can do it in 8 cycles (same amount of time), when you have a data dependency so a future instruction depends on a multiply, the 1GHz part has to wait 3 cycles, so 3 cycles were wasted, but the 2GHz part wasted 7 cycles (more than double).

Erm, no, data dependencies do not stall modern processors (at least, no on a scalar level). Modern processors use forwarding to deal with dependencies of instructions in the pipeline. The only exceptions to this would be 1. branches and 2. dependency on load (memory or cache). The latter is somewhat solved (at least, for cache latency) by out-of-order execution. The former is a huge problem (even with 99%+ accurate branch predictors).

Loads due to memory are just as big a problem (perhaps the biggest problem) for a wide-but-short processor as it is for a narrow-but-long processor. Using your example (but using simpler numbers), let's say the 1 GHz processor does 2 operations each cycle, so 2 32-bit ops would take 1 cycle. The other, the 2 GHz processor, does 1 operations each cycle, so 2 32-bit ops would take 2 cycles.
If there is cache miss and the processor stalls due to memory (assume similarly clocked memory), then the 1 GHz chip waits for 10 cycles (assuming 100 MHz memory and 1 clock delay for load) and the 2 GHz chip waits for 20 cycles. So yes, the 2 GHz chip wasted more clockcycles, so it's more inefficient, right? No. Clockcycles aren't all the resources a processor has. As I mentioned before, capacitance is also a factor of power usage and the 1 GHz chip has twice the execution width of the 2 GHz chip and, during that 1 GHz, it wasted just as much "potential work" (read: idle transistors) as the 2 GHz chip. Had it not stalled, the 1 GHz chip could've done 10 clockcycles x 2 ops/clock = 20 ops. The 2 GHz chip, waiting 20 cycles, could've done 20 clockcycles x 1 ops/clock = 20 ops. The same amount of potential work (and active transistor time) is wasted. The only difference is, the 1 GHz chip would have a higher statistic IPC (which is often confused with efficiency). It would still draw as much power and produce as much heat due to waste.

So no, there's nothing inherently "more wasteful" about high-frequency, narrow-issue processors vs low-frequency, wide-issue processors. It's implementation-specific (i.e. one processor, say Prescott, may be less efficient than another processor, say the Pentium-M). Again, look at Itanium, very short pipeline, relatively low clockspeeds, very high IPC, and yet, huge power requirements.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: imgod2u
Unfortunately, while it is easy, it's also incredibly inefficient. With dual-core you have twice the power consumption, twice the heat output and, depending on your packaging (I'm assuming stacked cores), twice the heat/area. And you gain 0% increase in single-threaded applications and perhaps 80% performance increase in well-balanced multithreaded applications. People aren't going dual-core because it's more efficient than other methods (increasing clockspeed, widening the core), they're going that way because the other two methods have reached the point of diminishing returns (whether it's by jamming more execution resources into the chip or extending the pipeline) and dual-core is another method (albeit another inefficient method) to increase performance for some applications.
The second processor can be in C1 HLT when you're running single-threaded applications, so you aren't using twice the power. You said yourself you gain 80% performance in well-threaded applications - and your two single cores can be less complex than one core that would have gotten maybe 15% extra performance. Your design cost falls dramatically.

At a circuit level, higher clock speeds imply less work per nanosecond because you spend a higher percent of your time doing nothing but waiting in flip flops - if you assume your flipflop takes 10ps, then at 1GHz you can do 990ps of useful work per cycle, so you're doing work 99% of the time, but at 10GHz you can do 90ps of work per cycle, 10 cycles per nanosecond, and end up wasting 100ps in flipflops doing nothing every nanosecond.

I don't know where you get this notion but the whole point of a clockcycle is to synchronize things. Which means multiple events *do not occur* in one clockcycle. Each clockcycle through each part of a circuit does exactly one thing, whether that means doing it and waiting for 990ps (as in the case of accessing a flip flop that takes 10 ps to access) or doing it and waiting for 90 ps (as with the 10GHz circuit). There are tricks of course, such as using both edges of the clock frequency to trigger, but again, it has to be synchronized. Only one event occurs, if you had more, you could never synchronize multiple events.
Sorry, but delays due to higher frequencies do not occur at the circuit level.
It's not that events necessarily occur in less than one clock cycle, it's how many cycles it takes to do something. Take the K7 multiplier - 4 cycles are required to multiply 2 32-bit number to produce a 32-bit result. If you wanted to drop it into a CPU running at twice the frequency, you'd need to make it an 8 cycle latency, meaning 4 extra flipflops (actually a few hundred or thousand, meaning HUGE power consumption and extra clock load - even MORE power), but any given path only goes through 4). Since gate delay and flop clk->q delays aren't clock-speed dependent, this means you're wasting 40ps in my 10ps flop example... but real flops are much slower than 10ps, so you're wasting a lot of time doing nothing.

You say more than one event can't occur in a cycle, and depending on your definition of event, this is true. But an event can comprise multiple stages of logic - consider an x86 AGU: you need to add effective address, displacement, and index (potentially also segment base, but K7 and K8 just added an extra cycle in the pipeline if it's nonzero). This could be done in a low-frequency design with a 3:2 compressor, then an adder. In a (ridiculously) high-frequency design with maybe 5 gate delays per cycle, you'd need to put flipflops between the compressor and the adder. You'd be spending unnecessary time in flops as a result of the shorter pipelines used to increase frequency. I'd suspect that any arbitrarily-long event that doesn't need control signal changes part way through it can be done in one clock stage.

At an architectural level, you have things like data dependencies whose relative effects get worse as latencies in number of cycles goes up. If you have a 1GHz processor that can multiply two 32-bit numbers in 4 cycles, and a 2GHz processor that can do it in 8 cycles (same amount of time), when you have a data dependency so a future instruction depends on a multiply, the 1GHz part has to wait 3 cycles, so 3 cycles were wasted, but the 2GHz part wasted 7 cycles (more than double).

Erm, no, data dependencies do not stall modern processors (at least, no on a scalar level).
Please demonstrate how your modern processor schedules the following code:
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
....

A MUL reading from and writing to AX (e.g. IMUL AX, 3... on x86, IMUL writes to [DX:]AX) would create stalls if you didn't have enough independent instructions between them. Forwarding doesn't help.

Modern processors use forwarding to deal with dependencies of instructions in the pipeline. The only exceptions to this would be 1. branches and 2. dependency on load (memory or cache). The latter is somewhat solved (at least, for cache latency) by out-of-order execution. The former is a huge problem (even with 99%+ accurate branch predictors).
That doesn't help when you have functional units with >1 cycle latency. All floating point ops (ignoring the SSE byte swizzles / negate / absolute value / etc. instructions) have long latencies.

So no, there's nothing inherently "more wasteful" about high-frequency, narrow-issue processors vs low-frequency, wide-issue processors. It's implementation-specific (i.e. one processor, say Prescott, may be less efficient than another processor, say the Pentium-M). Again, look at Itanium, very short pipeline, relatively low clockspeeds, very high IPC, and yet, huge power requirements.
Itanium is a different beast - but originally my point was supposed to be that going for the extra 10% performance costs much more than 10% in design effort. The lack of an x86 frontend alone probably saves the IA64 guys more than 10% work ;).
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: CTho9305
The second processor can be in C1 HLT when you're running single-threaded applications, so you aren't using twice the power. You said yourself you gain 80% performance in well-threaded applications - and your two single cores can be less complex than one core that would have gotten maybe 15% extra performance. Your design cost falls dramatically.

Looking at Prescott, today's power usage isn't dominated by how much of the processor is active, it's dominated by leakage. That means, even at idle, the extra core will leak and drain a lot of power and generate a lot of heat. Perhaps not 2x, but definitely, a lot more.
If an application is well-balanced enough to gain 80% performance from dual-core, then it has enough threads to feed to a wider, more complex core, with SMT. The only problem is the design cost, which is prohibitive. Even though a more complex core with SMT will enhance both single-threaded and multithreaded applications than dual simple cores, it's, unfortunately, too much of a design complexity.

It's not that events necessarily occur in less than one clock cycle, it's how many cycles it takes to do something. Take the K7 multiplier - 4 cycles are required to multiply 2 32-bit number to produce a 32-bit result. If you wanted to drop it into a CPU running at twice the frequency, you'd need to make it an 8 cycle latency, meaning 4 extra flipflops (actually a few hundred or thousand, meaning HUGE power consumption and extra clock load - even MORE power), but any given path only goes through 4). Since gate delay and flop clk->q delays aren't clock-speed dependent, this means you're wasting 40ps in my 10ps flop example... but real flops are much slower than 10ps, so you're wasting a lot of time doing nothing.

Erm, not in a pipelined processor you don't. That's the whole point of pipelining. If something takes so much longer than another component, you split a certain task into multiple parts so that each task is well-balanced and takes roughly similar amount of time. Meaning that while a single multiple may take 4 clockcycles on a high-frequency machine, and 2 cycles on a low-frequency machine, if you pipeline the multiplies, your throughput will still be 1 multiply per cycle. Of course, in real life, some things, such as arithmetic operations, aren't pipelined (although FP operations are). Which is why multiply and add latencies do matter, and are the main things inhibiting clockspeed (you need to have your adds all take 1 cycle as a rule of thumb last I recall, so the highest clockspeed you could go is how fast your adder can go). Seeing as Intel has managed to make 10 GHz adders (with a 1 cycle latency), I don't think it's much of a problem. So no, higher frequency processors don't neccessarily have longer add/multiply latencies, in fact, a general rule among architects is to *never* have adds take longer than 1 cycle.

You say more than one event can't occur in a cycle, and depending on your definition of event, this is true. But an event can comprise multiple stages of logic - consider an x86 AGU: you need to add effective address, displacement, and index (potentially also segment base, but K7 and K8 just added an extra cycle in the pipeline if it's nonzero).

All of which occur either in different circuit components or in multiple clockcycles. At the circuit level, there's simply no way to trigger multiple events (short of using multiple edges of the clockcycle, such as with edge-trigger flip-flops).

This could be done in a low-frequency design with a 3:2 compressor, then an adder. In a (ridiculously) high-frequency design with maybe 5 gate delays per cycle, you'd need to put flipflops between the compressor and the adder. You'd be spending unnecessary time in flops as a result of the shorter pipelines used to increase frequency. I'd suspect that any arbitrarily-long event that doesn't need control signal changes part way through it can be done in one clock stage.

Last I checked, flip-flops were clock-driven. So if you do put flip-flops between the components, you've effectively decoupled them into multiple stages. Unless of course, there's an asynch circuit in there. And btw, as far as relative to the rest of the circuit is concerned, flip-flops are pretty much instantaneous. The delay they present is trivial compared to even an AND gate last I checked.

Please demonstrate how your modern processor schedules the following code:
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
....

A MUL reading from and writing to AX (e.g. IMUL AX, 3... on x86, IMUL writes to [DX:]AX) would create stalls if you didn't have enough independent instructions between them. Forwarding doesn't help.

Why not? Instructions go through tons of stages before they are actually executed by the ALU. There's fetching, decoding (muiltiple decoding stages in modern MPU's), cracking/popping in some cases (Power4, K8), scheduling and finally execution and retire. All the stages except execution could be done without the result from the first IMUL and on a scalar level, the second IMUL can be done as soon as the first IMUL is finished. If IMUL has a one-cycle latency, then you've effectively reached a throughput of 1 CPI. It may even be possible (using some ungodly weird logic design) to pipeline multiplies that take multiple cycles, by splitting multiplies into different intermediate results that can be used by the next multiply. I'm not sure how many modern MPU's do this, but even assuming they don't, the processor is not stalled, instructions are still fetched, decoded, and issued independent of the instruction before them no matter what.

That doesn't help when you have functional units with >1 cycle latency. All floating point ops (ignoring the SSE byte swizzles / negate / absolute value / etc. instructions) have long latencies.

Last I checked, FP ops were pipelined on most modern MPU's. I'm not sure whether they have some form of forwarding between stages, and I'm not sure it's even worth it as FP code generally has a lot of ILP to provide (seeing as how the CPI of your average FP op isn't 30+ cycles).

Most dependencies occur in integer code and most integer operations have very low latency (with add being always 1 as a rule of thumb and 1/2 on Netburst).

Itanium is a different beast - but originally my point was supposed to be that going for the extra 10% performance costs much more than 10% in design effort. The lack of an x86 frontend alone probably saves the IA64 guys more than 10% work ;).

There does come a point where it becomes much more complicated to design just to gain a bit of performance (and at one point, it reaches ridiculous levels and at that point, something else should be considered, and now we have dual-core, what after that?). Today's modern MPU's aren't limited by their ALU's and the latencies they take. The fact that the "ridiculously" high-clocked Netburst core has a *lower* add latency than the "reasonably clocked" K7/K8 should say as much.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: imgod2u
Looking at Prescott, today's power usage isn't dominated by how much of the processor is active, it's dominated by leakage. That means, even at idle, the extra core will leak and drain a lot of power and generate a lot of heat. Perhaps not 2x, but definitely, a lot more.
Cool'n'Quiet seems to reduce K8's power consumption significantly. Of course, it scales down vcore, and I don't know if the dual core K8s will be able to run at different voltages, but I think you'd get pretty good power savings.

If an application is well-balanced enough to gain 80% performance from dual-core, then it has enough threads to feed to a wider, more complex core, with SMT. The only problem is the design cost, which is prohibitive. Even though a more complex core with SMT will enhance both single-threaded and multithreaded applications than dual simple cores, it's, unfortunately, too much of a design complexity.
I don't think we disgree.

It's not that events necessarily occur in less than one clock cycle, it's how many cycles it takes to do something. Take the K7 multiplier - 4 cycles are required to multiply 2 32-bit number to produce a 32-bit result. If you wanted to drop it into a CPU running at twice the frequency, you'd need to make it an 8 cycle latency, meaning 4 extra flipflops (actually a few hundred or thousand, meaning HUGE power consumption and extra clock load - even MORE power), but any given path only goes through 4). Since gate delay and flop clk->q delays aren't clock-speed dependent, this means you're wasting 40ps in my 10ps flop example... but real flops are much slower than 10ps, so you're wasting a lot of time doing nothing.

Erm, not in a pipelined processor you don't. That's the whole point of pipelining. If something takes so much longer than another component, you split a certain task into multiple parts so that each task is well-balanced and takes roughly similar amount of time. Meaning that while a single multiple may take 4 clockcycles on a high-frequency machine, and 2 cycles on a low-frequency machine, if you pipeline the multiplies, your throughput will still be 1 multiply per cycle.
Not if one multiply depends on a preceeding multiply. Even if you can theoretically issue one multiply per cycle, you can't issue a multiply until all of its operands are ready, meaning an instruction that depends on a multiply is going to have to wait between 3 and 5 cycles (depending on what size multiply it was).

Of course, in real life, some things, such as arithmetic operations, aren't pipelined (although FP operations are). Which is why multiply and add latencies do matter, and are the main things inhibiting clockspeed (you need to have your adds all take 1 cycle as a rule of thumb last I recall, so the highest clockspeed you could go is how fast your adder can go). Seeing as Intel has managed to make 10 GHz adders (with a 1 cycle latency), I don't think it's much of a problem. So no, higher frequency processors don't neccessarily have longer add/multiply latencies, in fact, a general rule among architects is to *never* have adds take longer than 1 cycle.
I was talking about floating point adds:
Add/Substract/Multiply, the most common FP-instructions, require 1 throughput clock cycle and 4 clock latency clock cycles to be calculated on the K7.
(source)

You say more than one event can't occur in a cycle, and depending on your definition of event, this is true. But an event can comprise multiple stages of logic - consider an x86 AGU: you need to add effective address, displacement, and index (potentially also segment base, but K7 and K8 just added an extra cycle in the pipeline if it's nonzero).

All of which occur either in different circuit components or in multiple clockcycles. At the circuit level, there's simply no way to trigger multiple events (short of using multiple edges of the clockcycle, such as with edge-trigger flip-flops).
Of course the steps occur in differnet circuit components - I agree with you that you can't use the same piece of logic twice in one cycle. I'm saying you have different circuit components (see the quote below)

This could be done in a low-frequency design with a 3:2 compressor, then an adder. In a (ridiculously) high-frequency design with maybe 5 gate delays per cycle, you'd need to put flipflops between the compressor and the adder. You'd be spending unnecessary time in flops as a result of the shorter pipelines used to increase frequency. I'd suspect that any arbitrarily-long event that doesn't need control signal changes part way through it can be done in one clock stage.

Last I checked, flip-flops were clock-driven. So if you do put flip-flops between the components, you've effectively decoupled them into multiple stages. Unless of course, there's an asynch circuit in there. And btw, as far as relative to the rest of the circuit is concerned, flip-flops are pretty much instantaneous. The delay they present is trivial compared to even an AND gate last I checked.
Yes, you're breaking it into multiple stages. That's how you increase the frequency. No, flipflops are FAR from instantaneous. I have numbers for multiple designs, but I don't know which are public, and I couldn't find numbers for modern processors with some quick googling, but I won't post them.

The simplest type of flip-flop to understand, made of master-slave D latches is going to give you two NAND gate delays.

Even if your flip flop WAS instantantaneous, thanks to clock skew and clock jitter, you can't use all of the cycle time (if you assume worst case jitter + skew = 10%, that means on a 1ns clock, the receiving flipflop's clock might fire up to 100ps early relative to the launching flip flop, giving you only 900ps usable time for logic. It may also fire up to 100ps late, giving you 1100ps, but you don't know this at design time, thus you'd have to design assuming that at 1GHz, you only have 900ps cycles).

Check out this paper; they discuss latch-based design (basically you have 2 latches to define a stage, rather than 1 flipflop). Latches are faster than flip flops (though not twice as fast). They came up with 1.8 gate delays for a latch (.8 delays of that is clock skew/jitter, with 1 gate delay from the actual latch circuitry).

Please demonstrate how your modern processor schedules the following code:
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
IMUL AX, 7
....

A MUL reading from and writing to AX (e.g. IMUL AX, 3... on x86, IMUL writes to [DX:]AX) would create stalls if you didn't have enough independent instructions between them. Forwarding doesn't help.

Why not? Instructions go through tons of stages before they are actually executed by the ALU. There's fetching, decoding (muiltiple decoding stages in modern MPU's), cracking/popping in some cases (Power4, K8), scheduling and finally execution and retire. All the stages except execution could be done without the result from the first IMUL and on a scalar level, the second IMUL can be done as soon as the first IMUL is finished. If IMUL has a one-cycle latency, then you've effectively reached a throughput of 1 CPI.
1) Yes, the second IMUL will execute immediately after the first executes (thanks to forwarding)
2) NO multiplies on K7 or K8 take a single cycle to produce a result that can be forwarded. The fastest, 8 bits * 8 bits, requires 3 cycles.

It may even be possible (using some ungodly weird logic design) to pipeline multiplies that take multiple cycles, by splitting multiplies into different intermediate results that can be used by the next multiply. I'm not sure how many modern MPU's do this, but even assuming they don't, the processor is not stalled, instructions are still fetched, decoded, and issued independent of the instruction before them no matter what.
If you had a hypothetical instruction sequence composed solely of dependent multiplies or FPU ops, eventually the processor would HAVE to stall as the reservation stations and reorder buffer get full.


That doesn't help when you have functional units with >1 cycle latency. All floating point ops (ignoring the SSE byte swizzles / negate / absolute value / etc. instructions) have long latencies.

Last I checked, FP ops were pipelined on most modern MPU's. I'm not sure whether they have some form of forwarding between stages, and I'm not sure it's even worth it as FP code generally has a lot of ILP to provide (seeing as how the CPI of your average FP op isn't 30+ cycles).
There is forwarding, even in FPUs. However, a 3-cycle multiply, no matter what pipelining tricks you do, WILL NOT be able to provide data to a dependent instruction until 3 cycles after it begins, meaning the other instruction waits, and if all succeeding instructions are dependent, you'll have to stall. If it's only a few instructions, you won't have to stall the front end of the pipeline, but there will be bubbles in the execution and writeback stages.

Most dependencies occur in integer code and most integer operations have very low latency (with add being always 1 as a rule of thumb and 1/2 on Netburst).
I think it's only 1/2 for 16-bit adds that you get 1/2 cycle latency.
 

Sahakiel

Golden Member
Oct 19, 2001
1,746
0
86
1. Clock skew, jitter, wire delay, and gate delays are starting to chew up significant portion of each clock period. That means the more you increase the frequency, the less time you have to do useful work. Extending the pipeline only exacerbates the problem by introducing more circuitry in the critical path. Increasing the issue rate does the same.

2. In general, circuit elements are active at all times. Depending on how you define an event, each clock cycle can be broken down into a series of events, each corresponding to a state change. Not all events are triggered by the global clock, but each event is followed by a period of time during which signals must propogate to the next logic gate. At the end of each clock period, faster signals are held until the next clock period, which allows the slowest signals the necessary time to propogate. Thus, CPU clocks are used to synchronize the million or so elements of each CPU.

3. An old question is why doesn't anyone design a pipeline with one gate per stage. Problems with physics means that such a design ends up wasting more time than a shorter, equivalent architecture. If I recall correctly, the latest limit is about 4 or 5 FO4 gate delays per stage. Any shorter and you end using more time.

4. Speculative execution at the component level introduces the possibility of reducing data dependencies among multi-cycle operations for a certain percentage of time. However, good luck developing the necessary algorithms.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: CTho9305
Cool'n'Quiet seems to reduce K8's power consumption significantly. Of course, it scales down vcore, and I don't know if the dual core K8s will be able to run at different voltages, but I think you'd get pretty good power savings.

Yes, and dynamic clock gating seems to work fine for Banias/Dothan. But again, at 90nm, there will still be leakage. Meaning dual-core will still gain quite a bt of power over single-core (and seeing as the leakage grows continuously with smaller processes, it may be close to 2x at some point).

Of course, in real life, some things, such as arithmetic operations, aren't pipelined (although FP operations are). Which is why multiply and add latencies do matter, and are the main things inhibiting clockspeed (you need to have your adds all take 1 cycle as a rule of thumb last I recall, so the highest clockspeed you could go is how fast your adder can go). Seeing as Intel has managed to make 10 GHz adders (with a 1 cycle latency), I don't think it's much of a problem. So no, higher frequency processors don't neccessarily have longer add/multiply latencies, in fact, a general rule among architects is to *never* have adds take longer than 1 cycle.
I was talking about floating point adds:
Add/Substract/Multiply, the most common FP-instructions, require 1 throughput clock cycle and 4 clock latency clock cycles to be calculated on the K7.
(source)

And like I said, FP latencies usually don't matter as ILP in FP code is usually rich enough to mask any such latencies.

Last I checked, flip-flops were clock-driven. So if you do put flip-flops between the components, you've effectively decoupled them into multiple stages. Unless of course, there's an asynch circuit in there. And btw, as far as relative to the rest of the circuit is concerned, flip-flops are pretty much instantaneous. The delay they present is trivial compared to even an AND gate last I checked.
Yes, you're breaking it into multiple stages. That's how you increase the frequency. No, flipflops are FAR from instantaneous. I have numbers for multiple designs, but I don't know which are public, and I couldn't find numbers for modern processors with some quick googling, but I won't post them.

The simplest type of flip-flop to understand, made of master-slave D latches is going to give you two NAND gate delays.

Even if your flip flop WAS instantantaneous, thanks to clock skew and clock jitter, you can't use all of the cycle time (if you assume worst case jitter + skew = 10%, that means on a 1ns clock, the receiving flipflop's clock might fire up to 100ps early relative to the launching flip flop, giving you only 900ps usable time for logic. It may also fire up to 100ps late, giving you 1100ps, but you don't know this at design time, thus you'd have to design assuming that at 1GHz, you only have 900ps cycles).

Check out this paper; they discuss latch-based design (basically you have 2 latches to define a stage, rather than 1 flipflop). Latches are faster than flip flops (though not twice as fast). They came up with 1.8 gate delays for a latch (.8 delays of that is clock skew/jitter, with 1 gate delay from the actual latch circuitry).[/quote]

News to me. Last time I used a NAND gate, it took a hell of a lot longer than any flip-flop did.

1) Yes, the second IMUL will execute immediately after the first executes (thanks to forwarding)
2) NO multiplies on K7 or K8 take a single cycle to produce a result that can be forwarded. The fastest, 8 bits * 8 bits, requires 3 cycles.

Last I recall, multiply of a 32-bit x 32-bit number on Prescott had a 3 cycle latency. Yes, in the purely hypothetical situation in which all multiplies are dependent (assuming there's no method of fowarding between intermediate results on multiplies), it would stall. Of course, the original implication was that a higher-frequency processor would have a longer multiply latency. As I've pointed out, clockspeed is limited by ALU operation latencies. As a general rule of thumb, nobody would ever make an add take 2 cycles. So in practice, higher-clock processors would not have any longer latencies than lower-clocked processors.

If you had a hypothetical instruction sequence composed solely of dependent multiplies or FPU ops, eventually the processor would HAVE to stall as the reservation stations and reorder buffer get full.

Which rarely, if ever, happens in FP code.

Last I checked, FP ops were pipelined on most modern MPU's. I'm not sure whether they have some form of forwarding between stages, and I'm not sure it's even worth it as FP code generally has a lot of ILP to provide (seeing as how the CPI of your average FP op isn't 30+ cycles).
There is forwarding, even in FPUs. However, a 3-cycle multiply, no matter what pipelining tricks you do, WILL NOT be able to provide data to a dependent instruction until 3 cycles after it begins, meaning the other instruction waits, and if all succeeding instructions are dependent, you'll have to stall. If it's only a few instructions, you won't have to stall the front end of the pipeline, but there will be bubbles in the execution and writeback stages.[/quote]

If we're talking about FP code, then the situation you've portrayed is extremely rare.

Most dependencies occur in integer code and most integer operations have very low latency (with add being always 1 as a rule of thumb and 1/2 on Netburst).
I think it's only 1/2 for 16-bit adds that you get 1/2 cycle latency.

Yes.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
News to me. Last time I used a NAND gate, it took a hell of a lot longer than any flip-flop did.
I can't imagine how a flipflop like that would work. Got a circuit? The NAND gates I play with aren't that much slower than inverters, and there isn't really any logic you can do in less than an inverter delay. Various published flip flop circuits I've seen all present multiple gate delays in the setup + clock-to-q.

Last I recall, multiply of a 32-bit x 32-bit number on Prescott had a 3 cycle latency. Yes, in the purely hypothetical situation in which all multiplies are dependent (assuming there's no method of fowarding between intermediate results on multiplies), it would stall. Of course, the original implication was that a higher-frequency processor would have a longer multiply latency.
Regardless of the tricks used, multiplying two numbers in a given manufacturing process is going to take at least x picoseconds. If you want to have a clock period faster than that, you're going to have to break the multiply into more stages, and the higher the frequency you want to reach, the shorter the stages have to be. There's no way around it. Higher frequencies for a given process require more stages, and thus more flipflop delays.

If you had a hypothetical instruction sequence composed solely of dependent multiplies or FPU ops, eventually the processor would HAVE to stall as the reservation stations and reorder buffer get full.

Which rarely, if ever, happens in FP code.
You said processors never stall:
Erm, no, data dependencies do not stall modern processors (at least, no on a scalar level). Modern processors use forwarding to deal with dependencies of instructions in the pipeline. The only exceptions to this would be 1. branches and 2. dependency on load (memory or cache). The latter is somewhat solved (at least, for cache latency) by out-of-order execution. The former is a huge problem (even with 99%+ accurate branch predictors).
I presented a case where data dependencies force a stall.

If we're talking about FP code, then the situation you've portrayed is extremely rare.
That doesn't mean it can't happen. I've never looked at assembly for FPU-intensive software, so I'll have to take your word for it.

The point is, data dependencies can limit your pipelining. If we went to an extreme and pipelined to the level where even a 32-bit integer add takes two cycles (say, make the whole Pentium 4 run at the ALU double-speed clock), you'd have a design in which dependent integer operations are going to force a 1-cycle stall if you don't have other independent operations. I'm not arguing that this is a sensible thing to do, only that data dependencies DO put upper limits on realistic pipelining.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things.
Uh, isn't that tautologically equivalent to the contention you are trying to disprove?

Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.
Not so. Ever hear of superscaler? It's been around since the Pentium classic. Chips can do more than one thing per cycle even without multi-whatever setups. And also, many instructions still require multiple cycles to complete. For example, MUL & FMUL in both the Athlon and P4 take 4 cycles to complete.