How much faster can single-core CPUs be

bryanl · Apr 30, 2013

without higher clock rates?

TuxDave · Apr 30, 2013

I think you underestimate the difficulty in answering the question. So here's my shot (spoiler: there is no answer in this post)

Take the nVidia Titan review recently from Anandtech. DGEMM is a double precision general matrix multiply that mostly involves doing the same type of floating point calculation on a massive amount of data. It has a specific series of instruction it needs and has a specific memory footprint and a very specific sequence of instructions that repeat for a long time. The nVidia Titan has plenty of floating point units to work on. So the graph below shows how amazing it's suited for these types of workloads.

Then you have this.

The Titan EU utilization has tanked and it's hard to tell why without knowing what's limiting performance here.

So back to your original question regarding "how much faster can a CPU go without increasing cores or clocks", it depends on the workload because each program will be stressing a different part of the chip. Maybe it's just sheer floating point capacity. Maybe it's cache thrashing or cache misses. Perhaps you're getting a huge pile of branch mispredicts. Maybe integer performance is limiting you. Maybe software just needs some new instructions or maybe it needs to learn how to schedule workloads better. Maybe something else (who knows, I suck at architecture)

There's a lot of hardware to help keep the execution units busy in face of all the above problems. So when the CPU stalls and the instruction is forced to wait before it executes, you have hyperthreading or OOO execution to make sure the CPU stays busy. This, again, only helps absorb unpredictable stalls. If you rarely stall, maybe all you really wanted was just more execution units to do more work at the same time.

There's plenty of other ways too but it really starts to depend on "what % of all workloads does this help vs doesn't help" and how much complexity/power/cost are you willing to pay for it. There are definitely some "mostly good idea for many workloads but not all workloads" ideas out there but those are getting harder to find and harder to design/validate.

Oric · May 19, 2013

Higher frequency mean shorter distances so that the electrons have time to travel. Shorter distances mean miniaturization Ans at that moment you reach your limits on how small you can build a circuit without falling into quantum effects

Sheep221 · May 19, 2013

Oric said:
Higher frequency mean shorter distances so that the electrons have time to travel. Shorter distances mean miniaturization Ans at that moment you reach your limits on how small you can build a circuit without falling into quantum effects

This is not correct, higher frequency means, something(electrical current, signal)is repeating more times within time of one second, regardless the distance the electrons must travel.
For example you go from home to school which is 10 miles far, you need 3 hours to go there and back once, with car you would be able to do that 10 times within 3 hours. Do you think that after you traveled by car, the school was closer to your house?

TuxDave said:
I think you underestimate the difficulty in answering the question. So here's my shot (spoiler: there is no answer in this post)

Take the nVidia Titan review recently from Anandtech. DGEMM is a double precision general matrix multiply that mostly involves doing the same type of floating point calculation on a massive amount of data. It has a specific series of instruction it needs and has a specific memory footprint and a very specific sequence of instructions that repeat for a long time. The nVidia Titan has plenty of floating point units to work on. So the graph below shows how amazing it's suited for these types of workloads.

Then you have this.

The Titan EU utilization has tanked and it's hard to tell why without knowing what's limiting performance here.

So back to your original question regarding "how much faster can a CPU go without increasing cores or clocks", it depends on the workload because each program will be stressing a different part of the chip. Maybe it's just sheer floating point capacity. Maybe it's cache thrashing or cache misses. Perhaps you're getting a huge pile of branch mispredicts. Maybe integer performance is limiting you. Maybe software just needs some new instructions or maybe it needs to learn how to schedule workloads better. Maybe something else (who knows, I suck at architecture)

There's a lot of hardware to help keep the execution units busy in face of all the above problems. So when the CPU stalls and the instruction is forced to wait before it executes, you have hyperthreading or OOO execution to make sure the CPU stays busy. This, again, only helps absorb unpredictable stalls. If you rarely stall, maybe all you really wanted was just more execution units to do more work at the same time.

There's plenty of other ways too but it really starts to depend on "what % of all workloads does this help vs doesn't help" and how much complexity/power/cost are you willing to pay for it. There are definitely some "mostly good idea for many workloads but not all workloads" ideas out there but those are getting harder to find and harder to design/validate.

The difference between multicore and single core CPU is that when single core CPU reach 100% utilization on its one core, it has no other cores available and therefore other programs are working by time-sharing its execution capacity. It also happens on multicore CPUs when all cores are running on 100% but you still can time-share for cores rather than one, and you still have ture multicore performance if your application is multi-threaded, in that sense it can utilize all cores to the max just for itself. Sadly most programs are not able to do so and one instance of program is able to use only one core, however it don't gets time shared unless all other cores are used by something else due to system management in operating system.
The multicore CPU is simply like 2 or 4 or 8 single core CPUs, but they are more energy efficient and faster due to latest additions such as shared L3 cache and integrated memory controller.
They are also better for low power segment because the less chance of time-sharing makes processors perform well also under lower frequencies and therefore saving power.

intx13 · May 20, 2013

Sheep221 said:
This is not correct, higher frequency means, something(electrical current, signal)is repeating more times within time of one second, regardless the distance the electrons must travel.
For example you go from home to school which is 10 miles far, you need 3 hours to go there and back once, with car you would be able to do that 10 times within 3 hours. Do you think that after you traveled by car, the school was closer to your house?

What Oric meant was that higher frequencies require smaller distances or clock skew becomes an insurmountable issue.

In a low-frequency circuit, propagation times (dependent on the speed of electrons in copper and the distance between circuit elements) are insignificant compared to the delay between clock edges. Therefore, signals can be imagined to propagate instantaneously at the clock edge.

In a high-frequency circuit, clock edges appear so quickly that the electrons may not have propagated across the entire circuit before the next edge. This introduces clock skew, where some parts of the circuit may still be propagating the previous clock edge, while others are on the next.

Clock distribution to minimize skew is an important part of CPU design, and it gets tougher as clock rates increase. The easiest solution is to shrink the distance between circuit elements, minimizing propagation time, but there's a fundamental limit to that.

Additionally, as clock rates increase, signals begin to move into the microwave range. The physical layout of the chip becomes even more important as physical structures (like pad sizes, distance between channels, etc.) have complex effects on the signal.

Finally, higher clock rates require crazy high frequency harmonics. To get a clean square wave you need many harmonics of the fundamental frequency. So if you want a clock rate of 10 GHz you'll need to generate harmonics beyond 30 GHz. That's well into the microwave range. Sinusoidal clocks exacerbate skew problems and start putting the transistors into linear modes, driving up power consumption.

Personally, I think clock speeds are "high enough" right now. I haven't had a project in many years that could be improved by throwing more cycles at it. But I have had many projects that could have been drastically improved through the use of multiple-cores, offloading algorithms into FPGAs, direct digital synthesis, multi-channel memory, wider buses, and other modern microcontroller features.

koshling · May 20, 2013

intx13 said:
In a high-frequency circuit, clock edges appear so quickly that the electrons may not have propagated across the entire circuit before the next edge. This introduces clock skew, where some parts of the circuit may still be propagating the previous clock edge, while others are on the next.

The electrons certainly will NOT have had enough time. You mean (as I'm sure you know actually, but in case other readers might not) the electric field may not have had enough time to propagate. The electrons themselves travel at order of fractions of a millimeter per second.

TuxDave · May 20, 2013

Sheep221 said:
The difference between multicore and single core CPU is that when single core CPU reach 100% utilization on its one core, it has no other cores available and therefore other programs are working by time-sharing its execution capacity.

I'll just resummarize my post based on what you mention here. My statement is true for multicore systems and single core system. Getting high CPU utilization depends on the workload. You can run DGEMM on a single core CPU (instead of on the nVidia Titan) and if the machine is properly balanced, you probably will be using 100% of the AVX execution width. In that case, you probably could improve performance by widening the execution stack some more until you become cache limited or something. However, in branch heavy code, you probably aren't AVX limited and your theoretical speedup starts becoming branch prediction limited. So I'm back to my original point of "the speedup you want depends on what you're running"

On that note, 100% CPU utilization is a funny number, I'm not really sure what metric is being used.

intx13 · May 21, 2013

koshling said:
The electrons certainly will NOT have had enough time. You mean (as I'm sure you know actually, but in case other readers might not) the electric field may not have had enough time to propagate. The electrons themselves travel at order of fractions of a millimeter per second.

Yes - an important distinction; I was being sloppy.

I'll try to clarify this for people interested in the OP's question. In a CPU, binary signals are represented by voltage levels. The clock signal is a square wave at a particular frequency: it's a signal that alternates between 0 and 1, or more specifically, between the two voltages chosen to represent 0 and 1, such as 0.5 V and 2.7 V.

Suppose you have a long copper wire. On one end you attach a clock generator. On the other end you attach some piece of equipment (like an ALU) that needs to "see" that clock in order to do something (like add two values). The clock generator will alternately increase and decrease the voltage at the first end of the wire to represent a 0 and a 1. But how does that voltage "move" from the clock generator to the CPU?

The copper wire is full of charge carriers (electrons). Most of the time they're just bouncing around randomly, not really going anywhere. And most of the time they all have the same energy. Suppose the clock generator increases the voltage at one end of the wire by stuffing more electrons into the wire. More electrons bouncing around near the start of the wire means more voltage at that end.

The "pressure" of these electrons all bunched up at the start of the wire forces all the electrons in the wire to start drifting towards the other end. Eventually the electrons are evenly spread again, but now there are more electrons in the wire, so the voltage at both ends (and everywhere in between) is higher. The ALU at the end of the wire now "sees" the same voltage that the clock generator "created" at the start of the wire.

Unfortunately, it takes a very long time for this drift to balance out the voltage in the wire. The clock generator would have to increase the voltage (to signal a rising clock edge) and then wait entire seconds for the electrons to finish drifting towards the ALU. If it didn't wait long enough, parts of the CPU that are closer to the clock generator would "see" the new voltage before the ALU. For example, the CPU might alert the motherboard that an answer is ready before the ALU even gets the signal to start computing! Different parts of the CPU would no longer be in sync and everything breaks.

So if it takes such a long time for voltage to propagate from one end of a wire to the other via electrons "bunching up" and drifting, how can CPUs operate so quickly? It turns out that there is another physical phenomenon at work other than drift. When the electrons at the start of the wire start drifting they actually cause the electrons at the end of the wire to start drifting almost immediately. We don't have to wait for the "bunching up" and drift to stabilize.. everything starts moving nearly at the same time.

This happens through the instantaneous generation of an electromagnetic field, which itself moves through the wire at the speed of light. In other words, we don't have to wait for the drift to "ripple" through the wire. The fact that electrons at one end are moving causes electrons at the other end to start moving almost immediately. This means that the ALU "sees" the new voltage almost immediately after the clock generator establishes it. In fact, the electromagnetic field propagates so fast that it doesn't really matter whether some parts of the CPU are close to the clock generator or far away. They all "see" the new voltage almost at the same time. The clock generator doesn't have to wait very long to ensure that everybody "sees" the new voltage level.

For the first 30+ years of CPU design that worked fine, because the electromagnetic field transmits voltage changes ludicrously fast.

But is it fast enough? The speed of light is approximately 300 million meters per second. Suppose the ALU is 3 cm away from the clock generator. How long will it take for the voltage at the clock generator to propagate to the ALU? If we had to wait for the electron drift to stabilize it could take tens of seconds! The electromagnetic field, propagating at the speed of light, takes only (3 cm) / (300 million m/s) = 100 picoseconds to propagate the new voltage. This means that the clock generator must wait at least 100 picoseconds between clock edges to ensure that the ALU "saw" the new clock edge. 100 picoseconds between edges means 200 picoseconds per clock cycle, or a clock speed of 5 GHz. That's not much higher than current clock rates!

In summary, the electrons themselves move very slowly through the CPU. Too slowly to be useful in propagating voltage changes. The clock rate would have to be really, really slow if we had to wait for electrons to slowly drift from one part of the CPU to another to transmit voltage levels around. Luckily, we don't have to wait that long, because the electromagnetic field propagates voltage changes at the speed of light. However, even the speed of light incurs a slight delay between the time that different parts of the CPU "see" a new voltage. The clock generator must wait for every part of the CPU to see the previous voltage level before sending out a new one. If the furthest part of a CPU is 3 cm away from the clock generator, this means a maximum clock rate of 5 GHz.

Now in reality, the clock generator does not actually have to wait for all parts of the circuit to "see" the previous voltage before sending out the new one. It's possible to use a clock rate that is too high, so long as you never incur bugs (like telling the motherboard an answer is ready before you've computed it). Large CPUs with very high clock rates (like modern x86 chips) have clock skew, but work anyway, thanks to clever techniques to clock distribution.

TL;DR: Koshling makes the important point that electrons in a CPU move very slowly, but the voltage at different points propagates at the speed of light because it is the electromagnetic field, not electron drift, that transmits new voltage levels (like a clock signal) from one part of a CPU to another. Oric's point (that CPU size limits clock rates) is still valid, because clock rates have reached the point that the speed of light no longer seems "instantaneous".

Note: I was somewhat intentionally vague and clumsy in this description. The exact nature of electrons "bunching" and "bouncing" and "drifting" is not as simple as I made it seem, and I completely ignored the definitions of voltage and current and the connection to potential energy, charge, and flux. But hopefully this description was accurate enough to explain clock skew and the connection between CPU size and clock rate.

oynaz · May 23, 2013

Good post!

Search

How much faster can single-core CPUs be

bryanl

Golden Member

TuxDave

Lifer

Oric

Senior member

Sheep221

Golden Member

intx13

Member

koshling

Member

TuxDave

Lifer

intx13

Member

oynaz

Platinum Member

TRENDING THREADS