Future of CPU architecture

WaitingForNehalem · Jan 28, 2014

As someone wanting to get into CPU design, I always wonder what the future of CPU architecture holds. I've talked to industry experts and I've heard some very interesting and different things:

CPUs will be relegated to low power, low cost and that the future is really in software and the user experience.
The Von Neumann architecture has been exhausted and that more exotic architectures such as neural networks will take its place.
We are in the dark ages of parallelism and that highly parallel, many core CPUs will come after compiler breakthroughs.
Heterogeneous CPU/GPU architecture will take over.
Analog computers will make a comeback.

While I realize there currently isn't a need for more performance in conventional computing for most average users, things like computer vision, big data, security, and artificial intelligence will play a big role in the future.

As of now, CPU design has really stagnated and all of the performance tricks such as OoO execution, pipelining, instruction level parallelism, and branch prediction have all been used. In fact, many of these techniques are either scaled down or discarded to save power.

Since computing requirements won't stay constant, what do you think future CPU architectures will be like?

Kippa · Jan 28, 2014

I think that there will be a big shift with regards to internal networking in the way multiple cpus are connected with regards to a nodal network. For example rather than have one large chip with 48 cores you might get much smaller chips say 12 small chips with 4 cores instead, it is how they are connected and communicate which will be of greater importance.

I also believe that cpus will possibly become 3d dimension in design so that rather than a cpu being 1 layer, it might be a multiple layer say 10 or 20 layers like pancake. Also the possibility that a computer in with hundreds of cpus/cores that they might work asymmetrically, in that they aren't all working to the beat of the same clock.

gdansk · Jan 28, 2014

I find the Mill architecture very interesting, though I don't feel it is the only way forward.

Rakehellion · Jan 28, 2014

I think the modern CPU will be extinct and everything will be run on the GPU.

Homeles · Jan 28, 2014

Kippa said:
I also believe that cpus will possibly become 3d dimension in design so that rather than a cpu being 1 layer, it might be a multiple layer say 10 or 20 layers like pancake. Also the possibility that a computer in with hundreds of cpus/cores that they might work asymmetrically, in that they aren't all working to the beat of the same clock.

I agree. It's going to take some time for that shift to happen, though. But 3D logic is inevitable.

What I think will happen in the short term is that architectures will leverage the high bandwidth, low latency, relatively high capacity stacked memory solutions. By this, I mean the architectures will actually be optimized around stacked DRAM, rather than simply strapping stacked DRAM to today's modern processors.

What I'd really like to see is ternary computing. I am doubtful we'll see that concept see major adoption any time soon, if ever, but it does offer some serious advantages over binary.

I think the inevitable computing endgame will be with synthesized architectures. There's so much computation involved with making today's modern processors. Computational lithography is incredible:

The image on the far right is the "optimal" mask for creating the structure on the far left, but it's practically impossible for humans to solve for that on their own. Creating such a mask requires lots of computing power, though.

Rakehellion said:
I think the modern CPU will be extinct and everything will be run on the GPU.

Why would everything run on a high latency processor that's difficult to code for? While GPUs are becoming increasingly useful for computing, they're simply not going to ever displace CPUs.

ninaholic37 · Jan 28, 2014

What about all that graphene / quantum stuff?

tynopik · Jan 28, 2014

process stagnation

more special purpose blocks (heterogeneous) to make up for the lack of general performance increase

ARM

Homeles · Jan 29, 2014

tynopik said:
more special purpose blocks (heterogeneous) to make up for the lack of general performance increase

Yes.

ARM

Not necessarily.

WaitingForNehalem · Jan 29, 2014

One of the many issues facing highly parallel architectures is that certain algorithms can't be made to run in parallel. There still has to be good single threaded performance. Maybe a big.LITTLE-like approach where there is one large core surrounded by many small cores. This is similar to what AMD is going for but I think if the smaller cores used the same ISA as the large core, there would not be a huge hurdle to adoption like with their APUs. Something like Xeon Phi mixed with a single large, highly clocked Xeon core with a dedicated hardware scheduler all on one die. Again though, running certain tasks in hundreds of threads would be very difficult. Compiler support would have to improve tremendously.

Torn Mind · Jan 29, 2014

Rakehellion said:
I think the modern CPU will be extinct and everything will be run on the GPU.

I might be ignorant of a lot about how CPUs work, but damn, people do drink the Kool-Aid with regards to "parallel computing" without even bothering to understand it.

Telling 6 people to calculate 1+1 works in parallel.

But tell them to do a sequence such as Solve for X, then multiply X by 62 and call that result Y, and then subtract Y by 90 won't benefit from parallelism. One "core" will be waiting for X and the another for Y.

And don't even bother with logic. There will always be some sequentialness to solving logic problems, although you can tell multiple cores to solve more than one at the same time.

Kippa · Jan 29, 2014

What if you knew that x would take longer to do than y and process the problem asymmetrically giving a program weighting values depending on the problem they're trying to solve? For example if you were running x on one core and gave it a heavy weighting and overclocked it to 6ghz+ whilst leaving the other core solving y running at 1ghz on a low weighting. Basically going back to what I was saying about asymmetric processing and not all cores/cpus running at the same speed.

It makes me wonder, how fast could you get a current tech cpu based on the assumption that only 1 core is going to be maxed out all the time and the other running at much lower frequencies. Maybe 5ghz+ for one single core? So long as the others are running at much lower frequencies?

BrightCandle · Jan 29, 2014

I don't honestly know how they get out of the current funk. I would like to think the future is parallel but I have been writing parallel programs for well over a decade and nothing I have seen so far has fixed the problems we have in the real world with using it. Functional programming helps in some ways but it doesn't solve all the issues like all the other small advances in this direction. If that is the only possible future a whole class of problems that have no known parallel solution are going to turn out to be impossible to compute.

DigDog · Jan 29, 2014

i'd be happy with consumer-price boards with double CPU socket

Torn Mind · Jan 29, 2014

Kippa said:
What if you knew that x would take longer to do than y and process the problem asymmetrically giving a program weighting values depending on the problem they're trying to solve? For example if you were running x on one core and gave it a heavy weighting and overclocked it to 6ghz+ whilst leaving the other core solving y running at 1ghz on a low weighting. Basically going back to what I was saying about asymmetric processing and not all cores/cpus running at the same speed.

In my example Y depends on the value of X; it is the result of multiplying X by 62. It probably is better expressed as in equations.
Find X
X*62=Y
Y-90=?
So, finding Y involves how fast it takes to find X first and then how fast it takes to multiply X by 62.

I don't see how X can be solved at the same time as Y when Y needs X solved first. It seems this is more of a matter of programming the program such that it is aware of telling which CPU execution units should calculate a problem based on the difficulty of the problem

In your example, it seems that Y is a separate number to be solved. The value does not depend on the answer of another problem being solved. It seems to more of a matter of programming to the program to make sure solving for X is executed on the "big" core while Y is executed on the "little" core. Of course, the cores you speak can be traditional CPU cores. It is a matter of coding the program of which core to use

Anyway, my response was to a guy who said everything will be run on GPUs. As far as I know, on the GPU, the "lower level" design itself that is much different from the CPU. For example, while the CPU has typically only 3 ALUs per core, a GPU has many more. Your example is more of properly telling CPU execution units what to do. In a GPU, there are a ton of ALUs in one execution unit, and they are ready to do math simultaneously, such as solving 1+1 a thousand times in one fetch of the data.

CPUs actually also have instruction sets that do SIMD processing, such as MMX, SSE, etc.

But I'm no expert in this area, being pretty much a total igornamus of many of these concepts even until just now. Task-level parallelism vs data-level parallelism is something I'm still struggling to comprehend, and my knowledge of GPUs is practically non-existent. It seems that with SIMD multiprocessing, everything is executed simultaneously once the data is fetched from memory, so it doesn't seem feasible to do big-little, as that would slow things down, somehow.

Galatian · Jan 29, 2014

3D structures are all nice and dandy, but heat transfer will put a limit on that. We're already swing the problem with Ivybridge/Haswell where the smaller transitor size actually resulted in a smaller area the heat could be transferred. Add several layer above and beneath that, how are you going to cool that?

el etro · Jan 29, 2014

WaitingForNehalem said:
One of the many issues facing highly parallel architectures is that certain algorithms can't be made to run in parallel. There still has to be good single threaded performance. Maybe a big.LITTLE-like approach where there is one large core surrounded by many small cores. This is similar to what AMD is going for but I think if the smaller cores used the same ISA as the large core, there would not be a huge hurdle to adoption like with their APUs. Something like Xeon Phi mixed with a single large, highly clocked Xeon core with a dedicated hardware scheduler all on one die. Again though, running certain tasks in hundreds of threads would be very difficult. Compiler support would have to improve tremendously.

Big.Little... AMD+ARM packed in big.Little style

NTMBK · Jan 29, 2014

CPUs will have multiple, massively wide vector units, with ridiculous bandwidth from stacked DRAM, replacing on-die GPUs with a few extra CPU cores and using the vector capabilities of the cores to perform graphics operations.

sefsefsefsef · Jan 29, 2014

Here are the programs from 2013's "big 3" micoarchitectural research conferences:

http://www.microarch.org/micro46/files/program.html
http://www.carch.ac.cn/~hpca19/confprog.html
http://isca2013.eew.technion.ac.il/programs/main-program/

That will give you an idea about what academia and industry thinks the future is. I've noticed the following topics come up again and again, and I think they will be big in the next 5-10 years: 3D memory integration, 2.5D interposer integration, task-specific accelerators, heterogeneity, GPUs, and tiered memory.

It's hard to tell at this point if AMD's version of heterogeneity (HSA on an APU) is going to win out, but heterogeneity is absolutely required moving forward. Also, I'm a lot more bullish on parallel computing than most people here, especially moving forward. Here's how you will do it in the near-ish future: parallelize most of the computation and run it on a GPU (most interesting applications have some large component that can be parallelized), and then build or reuse an existing accelerator/ASIC for the necessarily sequential part.

WaitingForNehalem · Jan 29, 2014

Torn Mind said:
In my example Y depends on the value of X; it is the result of multiplying X by 62. It probably is better expressed as in equations.
Find X
X*62=Y
Y-90=?
So, finding Y involves how fast it takes to find X first and then how fast it takes to multiply X by 62.

I don't see how X can be solved at the same time as Y when Y needs X solved first. It seems this is more of a matter of programming the program such that it is aware of telling which CPU execution units should calculate a problem based on the difficulty of the problem

In your example, it seems that Y is a separate number to be solved. The value does not depend on the answer of another problem being solved. It seems to more of a matter of programming to the program to make sure solving for X is executed on the "big" core while Y is executed on the "little" core. Of course, the cores you speak can be traditional CPU cores. It is a matter of coding the program of which core to use

Anyway, my response was to a guy who said everything will be run on GPUs. As far as I know, on the GPU, the "lower level" design itself that is much different from the CPU. For example, while the CPU has typically only 3 ALUs per core, a GPU has many more. Your example is more of properly telling CPU execution units what to do. In a GPU, there are a ton of ALUs in one execution unit, and they are ready to do math simultaneously, such as solving 1+1 a thousand times in one fetch of the data.

CPUs actually also have instruction sets that do SIMD processing, such as MMX, SSE, etc.

But I'm no expert in this area, being pretty much a total igornamus of many of these concepts even until just now. Task-level parallelism vs data-level parallelism is something I'm still struggling to comprehend, and my knowledge of GPUs is practically non-existent. It seems that with SIMD multiprocessing, everything is executed simultaneously once the data is fetched from memory, so it doesn't seem feasible to do big-little, as that would slow things down, somehow.

But in the context of a program containing thousands of instructions, even now they aren't executed sequentially. In most higher-end CPUs instructions are executed out of order and techniques like pipeline forwarding are used to avoid data hazards in the pipeline. Also compilers themselves do a lot of optimization for the target CPU so that the assembly looks nothing like the C code you just wrote.

What if a large program could be subdivided to execute on many little cores with no data dependencies between them?

Homeles · Jan 29, 2014

Galatian said:
3D structures are all nice and dandy, but heat transfer will put a limit on that. We're already swing the problem with Ivybridge/Haswell where the smaller transitor size actually resulted in a smaller area the heat could be transferred. Add several layer above and beneath that, how are you going to cool that?

They'll find a way. We'll eventually have room temperature superconductivity anyway.

In the meantime, tunnel FETs and other transistor structures offer much better performance than planar FETs. We'll see those in the early 2020s, or perhaps a little earlier.

jdubs03 · Jan 29, 2014

i think we'll see tfets in 2018 with the introduction of 7nm (sige/iii-v)

Rakehellion · Jan 29, 2014

Homeles said:
Why would everything run on a high latency processor that's difficult to code for? While GPUs are becoming increasingly useful for computing, they're simply not going to ever displace CPUs.

Both of those things can be reworked, and will probably need to be sometime soon.

Torn Mind said:
I might be ignorant of a lot about how CPUs work, but damn, people do drink the Kool-Aid with regards to "parallel computing" without even bothering to understand it.

Telling 6 people to calculate 1+1 works in parallel.

But tell them to do a sequence such as Solve for X, then multiply X by 62 and call that result Y, and then subtract Y by 90 won't benefit from parallelism. One "core" will be waiting for X and the another for Y.

And don't even bother with logic. There will always be some sequentialness to solving logic problems, although you can tell multiple cores to solve more than one at the same time.

Those calculations won't benefit from parallelism. Neither will low-wattage eight-core chips, which is where the CPU industry is headed.

Cogman · Jan 29, 2014

My guess is that CPUs will work their way towards being 100% asynchronous. Right now, we have a pesky clock which is just holding us back man (and using a whole boatload of power while doing it).

A 100% async chip wouldn't use power unless it was doing something. No need for clock throttling and gating. Components would only use power when they are doing something (ok, there might be some gate leak, but that wouldn't be TOO bad of a power draw).

Why hasn't this been done? Because it is terribly hard and terribly different from anything we have done before it. Our CPUs today require precision timing, a fully async CPU would have to somehow overcome the need for that timing.

Cogman · Jan 29, 2014

ninaholic37 said:
What about all that graphene / quantum stuff?

Graphene solves issues of heat (I believe it is more resilient) and it can be clocked up to craziness. It doesn't, however, handle problems of power consumption. There is also an issue with the output voltage. Graphene circuits cut the output voltage by about 40x. This is very bad if you want to make complex circuits with it. The voltage attenuation thing is the biggest drawback currently (the next is the manufacturing process).

Quantum computers are interesting, but they aren't the be all end all of computing. They can solve some problems remarkably fast, but they aren't suited for all problems (just like your GPU isn't the greatest at running sequential things but does an awesome job at parallel data processing).

The biggest problem with quantum computing is the fact that it will require a completely new programming paradigm. Those don't come around very often. It is a concept that has been around for a long time, but it has remarkably few algorithms. Of those algorithms, they depend on very specific types of quantum computing.

Cogman · Jan 29, 2014

gdansk said:
I find the Mill architecture very interesting, though I don't feel it is the only way forward.

A couple of years ago I would have said that Mill had no shot. Now, in the age of mobile computing and android, it might be able to take a hold.

Future of CPU architecture

Platinum Member

Senior member

Diamond Member

Lifer

Platinum Member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Lifer

Senior member

Diamond Member

Lifer

Lifer

Senior member

Golden Member

Lifer

Senior member

Platinum Member

Platinum Member

Golden Member

Lifer

Lifer

Lifer

Lifer