Can single core get much faster or we hitting a dead end?

GundamF91 · Apr 5, 2009

There are two ways to make processor faster, having a single one that works faster, or having multiples of them working in parallel and split the work. It looks like the future is fully focused on parallel processing, since we're not seeing single core to perform much higher than 4Ghz (you can go higher but only with dedicated cooling). So is this limited by the physical size of the atoms that make up the processors? Is there no chance for single core to get much faster?

Gillbot · Apr 5, 2009

Originally posted by: GundamF91
There are two ways to make processor faster, having a single one that works faster, or having multiples of them working in parallel and split the work. It looks like the future is fully focused on parallel processing, since we're not seeing single core to perform much higher than 4Ghz (you can go higher but only with dedicated cooling). So is this limited by the physical size of the atoms that make up the processors? Is there no chance for single core to get much faster?

I think most manufacturers have shifted the train of thought that multiple CPU's are the best option as opposed to a faster single core. Now they are making what the consumer wants, more and more cores.

magreen · Apr 5, 2009

Don't forget that 4GHz != 4GHz. A Pentium 4 @4GHz < 1 core of a c2d @4GHz < 1 core of an i7 @4GHz.

No reason to think architectures won't keep improving.

Idontcare · Apr 5, 2009

Single cores have quite a bit of room for improvement still.

Clockspeed scaling clearly helps performance to first order (but not IPC), and improved branch prediction (expanded tables) combined with larger on-die caches will continue to improve the IPC.

The challenge for the logic guys is that the point of diminishing returns in parallelism is brought on pretty severely once the core count hits about 16. Amdahl's law really rears its ugly head at that point for all but the most embarrassingly parallel codepaths.

So don't be too worried about this most recent phase of "core wars" seeming to cause a slowdown in the pace of IPC advancements for the individual core. It's only temporary.

Another process node or two (circa 16nm) and the logic guys will have beat the "add more cores" mantra to death and they'll be back to the drawing board on the next best thing.

magreen · Apr 5, 2009

Originally posted by: Idontcare
Single cores have quite a bit of room for improvement still.

Clockspeed scaling clearly helps performance to first order (but not IPC), and improved branch prediction (expanded tables) combined with larger on-die caches will continue to improve the IPC.

The challenge for the logic guys is that the point of diminishing returns in parallelism is brought on pretty severely once the core count hits about 16. Amdahl's law really rears its ugly head at that point for all but the most embarrassingly parallel codepaths.

So don't be too worried about this most recent phase of "core wars" seeming to cause a slowdown in the pace of IPC advancements for the individual core. It's only temporary.

Another process node or two (circa 16nm) and the logic guys will have beat the "add more cores" mantra to death and they'll be back to the drawing board on the next best thing.

So do you think that in order to improve them, the individual cores on the cpu will look like SoC (system-on-a-chip) from today's standards? What I mean is, they'll pack in more transistors on them, and many/most of those transistors will be used for cache, creating a few GB of on-die high speed cache. So from today's perspective it will be like the system RAM is on the chip with the cpu. Of course, systems then will probably have terabytes of system RAM off the chip, so it won't look like SoC from the perspective of the people designing it.

Cogman · Apr 5, 2009

Well, there are really several things that can be done. Idontcare mentioned a few of them.

One thing that might help branch prediction is to not do it, instead just take both branchs (of course, if there is a lot of branches you can't do that forever, so eventually you'll have to do a branch prediction, however, for most applications I think that processing the two branches will proved a suitable speed increase).

The next would be to add more specialized processing units onto the processor. We have an ALU and an FPU, and now an IMC. Perhaps the next step will be something like adding media decoding instructions to the IPC (they are already kind of doing this in SSE4, I know). But maybe an onchip raytracer, or an on chip physics chip, ect. The problem, of course, with doing this is to make compilers aware of the new instructions, and to get them to use them effectively (as well, legacy chips kill a lot of new instructions).

We have room for improvement still. At very least, if we hit a architecture limit, we can focus our efforts on die shrinks and clock speed increases.

soonerproud · Apr 5, 2009

Originally posted by: GundamF91
There are two ways to make processor faster, having a single one that works faster, or having multiples of them working in parallel and split the work. It looks like the future is fully focused on parallel processing, since we're not seeing single core to perform much higher than 4Ghz (you can go higher but only with dedicated cooling). So is this limited by the physical size of the atoms that make up the processors? Is there no chance for single core to get much faster?

Using clock speed to gauge how fast a CPU runs is flawed. Even though we have moved to multiple core , CPU's are getting more efficient at the same time meaning that each core is becoming faster at the same clock speed. So in essence, single core CPU's are becoming faster as they become more efficient at how much work they can perform per clock cycle. This is why a Phenom II with all but one core disabled is faster than a single core A64 at the same clock cycle.

magreen · Apr 5, 2009

Originally posted by: Cogman
Well, there are really several things that can be done. Idontcare mentioned a few of them.

One thing that might help branch prediction is to not do it, instead just take both branchs (of course, if there is a lot of branches you can't do that forever, so eventually you'll have to do a branch prediction, however, for most applications I think that processing the two branches will proved a suitable speed increase).

The next would be to add more specialized processing units onto the processor. We have an ALU and an FPU, and now an IMC. Perhaps the next step will be something like adding media decoding instructions to the IPC (they are already kind of doing this in SSE4, I know). But maybe an onchip raytracer, or an on chip physics chip, ect. The problem, of course, with doing this is to make compilers aware of the new instructions, and to get them to use them effectively (as well, legacy chips kill a lot of new instructions).

We have room for improvement still. At very least, if we hit a architecture limit, we can focus our efforts on die shrinks and clock speed increases.

Hey, that's great. I'd never thought of that. That's a fascinating use for multiple cores after we've reached the stage where adding more parallelism doesn't help due to Amdahl's law, as idc eloquently expressed. Even if we can't process more threads (i.e. there aren't any more useful threads to process), we can run more and more of the possible branches. Kindof like a different type of quantum computing, where many possible outcomes are all computed, and you "collapse" the result down afterwards to the desired outcome.

CTho9305 · Apr 5, 2009

Originally posted by: magreen

Originally posted by: Cogman
Well, there are really several things that can be done. Idontcare mentioned a few of them.

One thing that might help branch prediction is to not do it, instead just take both branchs (of course, if there is a lot of branches you can't do that forever, so eventually you'll have to do a branch prediction, however, for most applications I think that processing the two branches will proved a suitable speed increase).

The next would be to add more specialized processing units onto the processor. We have an ALU and an FPU, and now an IMC. Perhaps the next step will be something like adding media decoding instructions to the IPC (they are already kind of doing this in SSE4, I know). But maybe an onchip raytracer, or an on chip physics chip, ect. The problem, of course, with doing this is to make compilers aware of the new instructions, and to get them to use them effectively (as well, legacy chips kill a lot of new instructions).

We have room for improvement still. At very least, if we hit a architecture limit, we can focus our efforts on die shrinks and clock speed increases.

Click to expand...

Hey, that's great. I'd never thought of that. That's a fascinating use for multiple cores after we've reached the stage where adding more parallelism doesn't help due to Amdahl's law, as idc eloquently expressed. Even if we can't process more threads (i.e. there aren't any more useful threads to process), we can run more and more of the possible branches. Kindof like a different type of quantum computing, where many possible outcomes are all computed, and you "collapse" the result down afterwards to the desired outcome.

Chips nowadays are often (usually?) power-limited, meaning the design could run faster and reliably enough at a higher voltage and frequency, but nobody* wants to buy 200-watt processors. Given that branch prediction accuracies are well over 90%, it seems insane to me to quadruple the power (if you handle 2 branches) for a few percent performance. Also, in real-world integer code**, about 20% of the instructions are branches. With a conservative estimate of 32 instructions in-flight***, you're already dealing with ~6 branches (so your chip that burns 4X the power still has to use prediction for the majority of branches). There's an additional complexity too - if you decide to take a branch, you have to figure out where the "branch taken" path actually is. In some cases, it's encoded into the instruction, but sometimes it's a calculated value. If that calculation isn't finished yet, you have to guess where to jump to, which is 1 of ~2^32 options (anywhere in the address space), not just 1 or 2 options (taken or not). Even when there are only 2 branches in flight, you may not be able to guess where they branch to.

*Nobody = not enough people who are willing to pay enough extra
**Floating point code tends to have fewer branches, but prediction accuracy for floating point code tends to be in the 98%+ range, so you almost never mispredict the branches. Few branches + low mispredict rate = minimal possible performance benefit.
***P4 apparently had ~128, I can't find Phenom / i7 numbers.

nyker96 · Apr 5, 2009

simply? a dead end.

magreen · Apr 5, 2009

Originally posted by: CTho9305

Originally posted by: magreen

Originally posted by: Cogman
Well, there are really several things that can be done. Idontcare mentioned a few of them.

One thing that might help branch prediction is to not do it, instead just take both branchs (of course, if there is a lot of branches you can't do that forever, so eventually you'll have to do a branch prediction, however, for most applications I think that processing the two branches will proved a suitable speed increase).

The next would be to add more specialized processing units onto the processor. We have an ALU and an FPU, and now an IMC. Perhaps the next step will be something like adding media decoding instructions to the IPC (they are already kind of doing this in SSE4, I know). But maybe an onchip raytracer, or an on chip physics chip, ect. The problem, of course, with doing this is to make compilers aware of the new instructions, and to get them to use them effectively (as well, legacy chips kill a lot of new instructions).

We have room for improvement still. At very least, if we hit a architecture limit, we can focus our efforts on die shrinks and clock speed increases.

Click to expand...

Hey, that's great. I'd never thought of that. That's a fascinating use for multiple cores after we've reached the stage where adding more parallelism doesn't help due to Amdahl's law, as idc eloquently expressed. Even if we can't process more threads (i.e. there aren't any more useful threads to process), we can run more and more of the possible branches. Kindof like a different type of quantum computing, where many possible outcomes are all computed, and you "collapse" the result down afterwards to the desired outcome.

Click to expand...

Chips nowadays are often (usually?) power-limited, meaning the design could run faster and reliably enough at a higher voltage and frequency, but nobody* wants to buy 200-watt processors. Given that branch prediction accuracies are well over 90%, it seems insane to me to quadruple the power (if you handle 2 branches) for a few percent performance. Also, in real-world integer code**, about 20% of the instructions are branches. With a conservative estimate of 32 instructions in-flight***, you're already dealing with ~6 branches (so your chip that burns 4X the power still has to use prediction for the majority of branches). There's an additional complexity too - if you decide to take a branch, you have to figure out where the "branch taken" path actually is. In some cases, it's encoded into the instruction, but sometimes it's a calculated value. If that calculation isn't finished yet, you have to guess where to jump to, which is 1 of ~2^32 options (anywhere in the address space), not just 1 or 2 options (taken or not). Even when there are only 2 branches in flight, you may not be able to guess where they branch to.

*Nobody = not enough people who are willing to pay enough extra
**Floating point code tends to have fewer branches, but prediction accuracy for floating point code tends to be in the 98%+ range, so you almost never mispredict the branches. Few branches + low mispredict rate = minimal possible performance benefit.
***P4 apparently had ~128, I can't find Phenom / i7 numbers.

Ahah. I had no idea that branch prediction success was up above 90%. But if it's that accurate, can't you use the extra parallel power in a different strategy? Don't calculate the other branches which are at <10% chance of turning out useful, as you explained. Instead, add power to the calculations of the predicted branch. Instead of just using the extra unused stages in the pipeline in the cpu (is that what it does today?... I'm no expert in this) for the forward-calculations of the predicted branch, use a separate core or cores.

Another point, but more of a question: Once you're calculating the next prediction based on a prediction, is there more uncertainty in the prediction of what the next branch will be? If so, and the probabilities fall to far below 90%, it might become useful to calculate both possible outcomes of that branch around the corner, since the probability of the non-predicted branch being correct is elevated. OTOH, if the only question is whether that branch you're calculating now will end up being useful, then the non-predicted branch of the next branch prediction doesn't have a higher probabilty of being right. Sorry if I'm rambling

But you could be right about it not being worth doubling/quadrupling power for this. It all depends on whether there's extra power available in our thermal envelope that nobody minds burning. At some future date where there's plenty of power envelope for extra transistors. Although if it's really quadrupling the power envelope like you said, then that would only be insignificant when those 16 cores only use 5W or so -- when your choices are between a 5W energy efficient or a 20W barn burner

dmens · Apr 5, 2009

Branch prediction is a dead end.

C2D has 96 instructions in flight, i7 has 128. But i7 has a mechanism to recover from an incorrect branch prediction even faster than C2D, without waiting for retirement up to the bad branch. I'm not sure if that feature is publicly announced yet so I won't describe it in detail. Anyways, the cost of sending speculative work to another core and the subsequent merge and kill would be prohibitively expensive and not much faster than the baseline case. The P4 replay system would send a single core down a deeply speculative path and that was already very power inefficient. By the time another core gets the speculative work, the master core would have already recovered, I would think.

Back to the original, there's still plenty of optimizations left. There won't be any giant leaps but single threads will still creep forward, if that is what you're concerned with.

Nemesis 1 · Apr 5, 2009

Originally posted by: GundamF91
There are two ways to make processor faster, having a single one that works faster, or having multiples of them working in parallel and split the work. It looks like the future is fully focused on parallel processing, since we're not seeing single core to perform much higher than 4Ghz (you can go higher but only with dedicated cooling). So is this limited by the physical size of the atoms that make up the processors? Is there no chance for single core to get much faster?

Well to ans. Question. If you look at Intels road maps. It looks like in 2011 we have a P4 to C2D event in core performance. Intel adds AVX which in theory should double FP.

So programms that use FP heaveyly are going to see a dramatic increase in performance. Than in 2012 @ 22nm Ivy Bridge arrives . Thats suppose to add FMA Which doubles FP how this relates to AVX unknown to self. I also read that intel would add DMA.

I can't wait to see intel do this with a cisc backend . So it Goes from 2 operand (nehalem) to 4 operand core Sandy. With only 3 operand function on sandy and 4 operand with FMA on Ivy bridge.

In apps . were FP are important . This is staggerring . Not to mention intel can increase AVX from 256 bits to 512 bits all the way to 1024 bits . This is Monumental increase in FP. on a CPU in just 3 years.

PandaBear · Apr 6, 2009

In the early 90s it was the RISC vs CISC
Then it is the internal cache
Then it is pipeline, branch prediction, super scalar that changed the world
Then it is the integrated memory controller
Then it is dual / quad core

Now I think it is how fast you can get code into the CPU, because memory has reached a limit and those designs who could get around the memory limit better win.

We'll probably start seeing dedicated co-processors for various operations from north bridge and GPU integrated soon, because we are running out of things to get significant speed boost for low power/cost.

When that reaches the end, well, you'll probably have FPGA onboard that can be programmed on the fly to run hardware accelerated processing instead of software.

Idontcare · Apr 6, 2009

Originally posted by: CTho9305
Chips nowadays are often (usually?) power-limited, meaning the design could run faster and reliably enough at a higher voltage and frequency, but nobody* wants to buy 200-watt processors. Given that branch prediction accuracies are well over 90%, it seems insane to me to quadruple the power (if you handle 2 branches) for a few percent performance. Also, in real-world integer code**, about 20% of the instructions are branches. With a conservative estimate of 32 instructions in-flight***, you're already dealing with ~6 branches (so your chip that burns 4X the power still has to use prediction for the majority of branches). There's an additional complexity too - if you decide to take a branch, you have to figure out where the "branch taken" path actually is. In some cases, it's encoded into the instruction, but sometimes it's a calculated value. If that calculation isn't finished yet, you have to guess where to jump to, which is 1 of ~2^32 options (anywhere in the address space), not just 1 or 2 options (taken or not). Even when there are only 2 branches in flight, you may not be able to guess where they branch to.

*Nobody = not enough people who are willing to pay enough extra
**Floating point code tends to have fewer branches, but prediction accuracy for floating point code tends to be in the 98%+ range, so you almost never mispredict the branches. Few branches + low mispredict rate = minimal possible performance benefit.
***P4 apparently had ~128, I can't find Phenom / i7 numbers.

Awesome post :thumbsup:

Originally posted by: soonerproud
Using clock speed to gauge how fast a CPU runs is flawed.

Is a 3GHz PhII faster than a 2.8GHz PhII? Is a 3.33GHz i7 faster than a 2.93GHz i7?

Clockspeed is a perfectly effective metric for gauging processor speed. But just as with all metrics of characterization they can become ineffective if employed and interpreted outside their applicable (intended) context. A 4GHz P4 is not faster than a 3.2GHz A64 because now I've changed more than one variable in my comparison (clockspeed and architecture (IPC)).

The metric is not flawed, nor is its use, but some do use the metric in a flawed manner. User error.

Originally posted by: Nemesis 1
Well to ans. Question. If you look at Intels road maps. It looks like in 2011 we have a P4 to C2D event in core performance. Intel adds AVX which in theory should double FP.

So programms that use FP heaveyly are going to see a dramatic increase in performance. Than in 2012 @ 22nm Ivy Bridge arrives . Thats suppose to add FMA Which doubles FP how this relates to AVX unknown to self. I also read that intel would add DMA.

I can't wait to see intel do this with a cisc backend . So it Goes from 2 operand (nehalem) to 4 operand core Sandy. With only 3 operand function on sandy and 4 operand with FMA on Ivy bridge.

In apps . were FP are important . This is staggerring . Not to mention intel can increase AVX from 256 bits to 512 bits all the way to 1024 bits . This is Monumental increase in FP. on a CPU in just 3 years.

It makes sense to incorporate a heterogeneous processing unit hierarchy at some point around the Haswell time-frame where Larrabee (like) cores are brought into the die to carry out the embarrassingly parallel stuff where having 100 threads processing at the same time is still an effective use of transistors, power-consumption, and die-space.

Basically have hardware replicate processing units for a subset of the ISA which handles the kind of instructions one would expect to have in applications that would effectively scale beyond 16 or so cores. Video, audio (media) ISA's...your AVX ISA stuff, etc.

Kind of like (barely like it really) the initial implementation of Niagara where the 8 cores really were for processing integer threads and the chip only had a single FP unit to be shared. In the 90nm process technology for Niagara prioritizing integer processing at an 8:1 ratio over FP made sense for the applications in mind when bounded by die size, transistor count, clockspeed, and cost.

soonerproud · Apr 6, 2009

Is a 3GHz PhII faster than a 2.8GHz PhII? Is a 3.33GHz i7 faster than a 2.93GHz i7?

Clockspeed is a perfectly effective metric for gauging processor speed. But just as with all metrics of characterization they can become ineffective if employed and interpreted outside their applicable (intended) context. A 4GHz P4 is not faster than a 3.2GHz A64 because now I've changed more than one variable in my comparison (clockspeed and architecture (IPC)).

The metric is not flawed, nor is its use, but some do use the metric in a flawed manner. User error.

Isn't that the same thing I just said? The way clock speed has been used in the past by Intel, Dell and others to determine if a certain CPU is faster than another one however flawed and full of errors it was became the industry standard. For years AMD users said instructions per clock cycle was a better way to determine a CPU's speed and efficiency yet Intel fans were in a race for a CPU with the fastest clock cycle regardless of how efficient the CPU actually was. (P4 era) AMD was proved right and Intel was forced to adopt a architecture that focused more on number of instructions per clock cycle instead of pure clock speed because they hit a wall in the GHZ race due to massive power increases and they were being spanked in benchmarks by AMD A64 processors.

Most consumers still look to GHZ first as the gauge of how fast all cpus are regardless of efficiency and architecture differences because the industry convinced them long ago this was the best way to determine the better CPU.

Elias824 · Apr 6, 2009

so die shrinks can help alot in terms of ghz, but what is the limit on how small we can make transistors? I know there is a limit on the manufacturing technique that is fast approaching, but what about a limit of phyciscs? what the smallest device we can make to hold a charge?

A5 · Apr 6, 2009

Originally posted by: Elias824
so die shrinks can help alot in terms of ghz, but what is the limit on how small we can make transistors? I know there is a limit on the manufacturing technique that is fast approaching, but what about a limit of phyciscs? what the smallest device we can make to hold a charge?

1 Si atom takes up ~0.2nm, but I have no idea what the minimum number of atoms it takes to make a useful transistor is.

Intel believes that they can take current lithography down to 11nm, but most researchers think that getting there (and especially beyond that) will require a move to EUV, which uses a much more powerful photon to change the silicon. EUV is apparently having some issues, but I won't bore you with that.

EUV: http://en.wikipedia.org/wiki/E...ltraviolet_lithography

raisethe3 · Apr 6, 2009

Single core is pretty much dead these days.

magreen · Apr 7, 2009

Originally posted by: raisethe3
Single core is pretty much dead these days.

If I understand your post, I don't think you're quite grasping the discussion here. Nobody's discussing whether a single-core cpu is better than a multi-core cpu. The question is whether there's room for each of the individual cores in a multi-core cpu to get faster, or has the speed of each core reached its limit, and the only way to improve cpus is to add more and more cores.

<gratuitous car analogy>
We're talking about whether each cylinder in a V6 engine can be made larger and more powerful. Nobody's talking about whether a single-cylinder engine is better than a V6.
</gratuitous car analogy>

EDIT: Hey, that's my 1000th post. I'm feelin' golden!

Idontcare · Apr 7, 2009

Originally posted by: magreen
EDIT: Hey, that's my 1000th post. I'm feelin' golden!

Congrats magreen! :thumbsup: 1k in the bag. What a neffer.

magreen · Apr 7, 2009

Originally posted by: Idontcare

Originally posted by: magreen
EDIT: Hey, that's my 1000th post. I'm feelin' golden!

Click to expand...

Congrats magreen! :thumbsup: 1k in the bag. What a neffer.

Thanx. I must be if I replied 2 mins after you posted

Spoelie · Apr 7, 2009

Just to add: Intel's EPIC processors do calculate multiple branches and discard the invalid results. I'm not convinced it would double power at all to calculate 2 branches at the same time. However, the EPIC instruction set is specifically designed to facilitate this, I'm not quite sure how feasible it is to do so on an X86 processor.

EDIT: It's called branch predication
http://www.cs.umd.edu/class/fa...11.htm#Intel%E2%80%99s

bfdd · Apr 7, 2009

I forgot how smart some of you guys are here

Vee · Apr 7, 2009

Of course single threaded, single core performance can be increased.
I don't think the point has been made directly here yet, at least not put directly: I think the question is how you get best value for your transistor budget. And in those terms we have come to a point where diminishing returns often make more cores more interesting (= more computing done) than larger cores.

But there is one aspect where there is plenty room to increase performance with good returns from the number of transistors dedicated to it. And that is vectorized computing and floating point. This is a fairly simple case of increasing the computing width of the processor, more execution pipes, longer vectors, wider memory access.
I also see a good deal of future demand for this kind of computing performance. And this is already the method Intel have used to achieve boosted benchmark results for P4 and C2.

And AMD, Intel and nVidia all intends to radically ramp up this kind of performance, in different ways. If anything really dramatically ever evolves from this will, in my estimate, depend on the market situation. That is, Intel monopoly will lead only to one or few iterations before stagnation, and that will not change the computing landscape by opening up new fields for applications.
So, despite that his dearly beloved Intel is the least interesting part, and even maybe the obstacle, Nemesis1 actually has a point this time, in his own confusing ways.

Can single core get much faster or we hitting a dead end?

Golden Member

Lifer

Golden Member

Elite Member

Golden Member

Lifer

Golden Member

Golden Member

Elite Member

Diamond Member

Golden Member

Platinum Member

Lifer

Golden Member

Elite Member

Golden Member

Golden Member

Diamond Member

Member

Golden Member

Elite Member

Golden Member

Member

Lifer

Senior member