Originally posted by: CTho9305
Originally posted by: Mday
Originally posted by: SuperTool
Thanks for your highly technical advice. Has it occured to you that the slower the clock rate is, the more time you have per clock, and the more you can accomplish per clock? AMD is not lacking in GHz arena. Their engineers just made different design decisions from Intel, better ones in my opinion. Intel is moving away from P4 style high MHz designs because they aren't very efficient in terms of energy, cooling, design resources, etc.
That has to be the worst defense of AMD I have ever heard.
It's not that the athlon chips are doing MORE per clock cycle. It's that the P4 takes more clock cycles to do things. Without multi-core or multiprocessor set ups, ANY PROCESSOR will do ONE THING and ONE THING per clock cycle, that that's either something or nothing.
Give that Intel canned Tejas because it was too hot, and Sun canned their Sparc V, it seems like many of the big guys are having problems gaining more performance by scaling frequency. MHz is overrated. Power = C*V*V*F, where F is frequency, so as you scale frequencies up, power tends to go up too. Obviously process shrinks (.18u->.13u) can help mitigate this, but even though we're at 90nm now, any desktop chip now burns more power than even high-end Alpha processors which were considered hot at just 30W not too long ago.
You'll notice in that equation that capacitance, C, is an equal factor of power consumption as frequency. What do you think you would need to do to make an K7 instead of Netburst core? You need to add more decoders, more execution units, more issue ports, etc. That increases capacitance and thus, increases power consumption. Power issues come from both wider-and-short and narrow-and-long designs alike and neither is inherently "worse" at it. It's a matter of implementation and how efficiently you take advantage of the resources you have, not how much work you get done per cycle. Case and point is Itanium. It extracts massive amounts of ILP and achieves a very high IPC, and yet, it consumes tons of power (more than any Prescott out there at 1.6 GHz).
The K7 doesn't utilize "cycles" more, it simply has more parallel execution resources. The K8, on the other hand, does utilize its cycles more as it spends less time idling on memory misses (thanks to the integrated memory controller) and *that* is efficiency, and it has little to due with whether the chip is high-clockspeed or high-IPC designed.
AMD, Intel, and Sun have all announced plans to go multi-core. This makes sense, because like in a lot of fields, getting 90% of the performance takes 50% of the effort (made up those numbers, but you should see my point). For example, to predict branches with 90% accuracy requires only a set of 2-bit counters. To get 99% requires ridiculously complicated logic (and large amounts of it). As another example, a "standard" out-of-order pipeline that just handles the basic data dependencies isn't too complicated, but to gain more performance, you have to do some really crazy optimizations (for example, do tricks like
value prediction by throwing in large amounts of hardware, even though you only gain a few percent more performance). With multi-core parts, you do the 50% of the work to reach 90%, and then just dump a whole bunch of those CPUs in a system and convince the software guys to write multi-threaded software.
Unfortunately, while it is easy, it's also incredibly inefficient. With dual-core you have twice the power consumption, twice the heat output and, depending on your packaging (I'm assuming stacked cores), twice the heat/area. And you gain 0% increase in single-threaded applications and perhaps 80% performance increase in well-balanced multithreaded applications. People aren't going dual-core because it's more efficient than other methods (increasing clockspeed, widening the core), they're going that way because the other two methods have reached the point of diminishing returns (whether it's by jamming more execution resources into the chip or extending the pipeline) and dual-core is another method (albeit another inefficient method) to increase performance for some applications.
Maybe in ideal-world with perfect branch prediction, no interrupts, and no exceptions, but in reality, longer pipelines aren't always better performers (as you're implying). It's reasonable to say that the P4 does less useful work per cycle.
It has been in the past. The Alphas around the time of the Pentium 2's didn't dominate its competition because of a wide core, it dominated because it clocked the highest. Traditionally, higher clockspeeds have been a better method to gain performance. Of course, as with anything, you get to a point where it stops being so great and I think Prescott has reached and exceeded that.
Your statement that the P4 takes more cycles to do stuff, but doesn't do less per cycle implies that it does more work overall. This is true from the perspective of raw number instructions executed, but not true if you look at the number of instructions that are committed and not killed (if it was, Athlons would be getting slaughtered in pretty much every benchmark).
It does take more cycles to do stuff. The P4 is a 6-way, 6-issue (assuming you're using the double-pumped ALU's) design while the K7/K8 is a 9-way, 9-issue design. That's 50% more the K7/K8 can do at peak every clockcycle (requiring, of course, more hardware on the chip). The K7/K8 also has a 3-way decoder that's capable of decoding any 3 x86 instructions and issue 3 macro-ops (fused, 2-micro-ops) per clockcycle. This issue rate (although micro-op implementations are different, they're usually similar for simple instructions) is twice that of the P4's trace cache. Of course, there are all sorts of things on the P4 (Northwood) design that made it come up short. Only 1 FP issue port, no dedicated shifters, etc. but solving those problems requires more hardware. Intel relied on software to get around them.
At a circuit level, higher clock speeds imply less work per nanosecond because you spend a higher percent of your time doing nothing but waiting in flip flops - if you assume your flipflop takes 10ps, then at 1GHz you can do 990ps of useful work per cycle, so you're doing work 99% of the time, but at 10GHz you can do 90ps of work per cycle, 10 cycles per nanosecond, and end up wasting 100ps in flipflops doing nothing every nanosecond.
I don't know where you get this notion but the whole point of a clockcycle is to synchronize things. Which means multiple events *do not occur* in one clockcycle. Each clockcycle through each part of a circuit does exactly one thing, whether that means doing it and waiting for 990ps (as in the case of accessing a flip flop that takes 10 ps to access) or doing it and waiting for 90 ps (as with the 10GHz circuit). There are tricks of course, such as using both edges of the clock frequency to trigger, but again, it has to be synchronized. Only one event occurs, if you had more, you could never synchronize multiple events.
Sorry, but delays due to higher frequencies do not occur at the circuit level.
At an architectural level, you have things like data dependencies whose relative effects get worse as latencies in number of cycles goes up. If you have a 1GHz processor that can multiply two 32-bit numbers in 4 cycles, and a 2GHz processor that can do it in 8 cycles (same amount of time), when you have a data dependency so a future instruction depends on a multiply, the 1GHz part has to wait 3 cycles, so 3 cycles were wasted, but the 2GHz part wasted 7 cycles (more than double).
Erm, no, data dependencies do not stall modern processors (at least, no on a scalar level). Modern processors use forwarding to deal with dependencies of instructions in the pipeline. The only exceptions to this would be 1. branches and 2. dependency on load (memory or cache). The latter is somewhat solved (at least, for cache latency) by out-of-order execution. The former is a huge problem (even with 99%+ accurate branch predictors).
Loads due to memory are just as big a problem (perhaps the biggest problem) for a wide-but-short processor as it is for a narrow-but-long processor. Using your example (but using simpler numbers), let's say the 1 GHz processor does 2 operations each cycle, so 2 32-bit ops would take 1 cycle. The other, the 2 GHz processor, does 1 operations each cycle, so 2 32-bit ops would take 2 cycles.
If there is cache miss and the processor stalls due to memory (assume similarly clocked memory), then the 1 GHz chip waits for 10 cycles (assuming 100 MHz memory and 1 clock delay for load) and the 2 GHz chip waits for 20 cycles. So yes, the 2 GHz chip wasted more clockcycles, so it's more inefficient, right? No. Clockcycles aren't all the resources a processor has. As I mentioned before, capacitance is also a factor of power usage and the 1 GHz chip has twice the execution width of the 2 GHz chip and, during that 1 GHz, it wasted just as much "potential work" (read: idle transistors) as the 2 GHz chip. Had it not stalled, the 1 GHz chip could've done 10 clockcycles x 2 ops/clock = 20 ops. The 2 GHz chip, waiting 20 cycles, could've done 20 clockcycles x 1 ops/clock = 20 ops. The same amount of potential work (and active transistor time) is wasted. The only difference is, the 1 GHz chip would have a higher statistic IPC (which is often confused with efficiency). It would still draw as much power and produce as much heat due to waste.
So no, there's nothing inherently "more wasteful" about high-frequency, narrow-issue processors vs low-frequency, wide-issue processors. It's implementation-specific (i.e. one processor, say Prescott, may be less efficient than another processor, say the Pentium-M). Again, look at Itanium, very short pipeline, relatively low clockspeeds, very high IPC, and yet, huge power requirements.