Future of CPU architecture

Cerb · Jan 30, 2014

WaitingForNehalem said:
Oh I see. It is interesting to see that with Nvidia's Denver and Apple's Cyclone, they are going for super wide architectures. That is assuming they are really 7-way superscalars and not just counting lots of micro-ops.

Apple's is somewhat of a mystery, though it's surely a 'normal' superscalar CPU. nVidia's, too, but whether it's VLIW or not, it will likely be simplified, compared to most CPUs, and able to run software in itself to run your software.

Galatian · Jan 30, 2014

Morbus said:
Submerge it in water.

Like our brain... you know... It's basically a 3D CPU...

I guess you are making a joke, but just to be on the safe side: the brain takes up a lot more area...it's a lot less dense then modern processors.

Morbus · Jan 31, 2014

Galatian said:
I guess you are making a joke, but just to be on the safe side: the brain takes up a lot more area...it's a lot less dense then modern processors.

Only partly joking. Conductive liquid wouldn't do it...

StrangerGuy · Jan 31, 2014

The future is software optimization needs a kick in the butt instead of getting lazy and eating the Moore's Law free lunch.

NTMBK · Jan 31, 2014

StrangerGuy said:
The future is software optimization needs a kick in the butt instead of getting lazy and eating the Moore's Law free lunch.

If you think it's so easy, go download the GCC and start coding. It's free, what's stopping you? I eagerly await your amazing revelations.

StrangerGuy · Jan 31, 2014

NTMBK said:
If you think it's so easy, go download the GCC and start coding. It's free, what's stopping you? I eagerly await your amazing revelations.

Yeah because software developers obviously work for free too and therefore aren't responsible for the quality of work.

What a nonsensical argument like movies reviewers don't have to right to trash lousy movies because they are so hard and expensive to make.

Cogman · Jan 31, 2014

NTMBK said:
If you think it's so easy, go download the GCC and start coding. It's free, what's stopping you? I eagerly await your amazing revelations.

*shudder* don't look at the GCC code base. Try the LLVM/Clang codebase first. GCC's code base is quite old and it shows.

Don't get me wrong. The GCC has done amazing things when it comes to optimizations. It really is a well functioning tool. Its just the code that supports it isn't for the faint of heart.

Cogman · Jan 31, 2014

StrangerGuy said:
Yeah because software developers obviously work for free too and therefore aren't responsible for the quality of work.

What a nonsensical argument like movies reviewers don't have to right to trash lousy movies because they are so hard and expensive to make.

Huh? No, his point is that as far as automated optimizations go, we are just about as good as we can get (for established languages). It is to the point where it is hard to hand write assembly to get faster performance than the compiler. MAYBE you can do it to get vectorization speed bumps, but even that is disappearing pretty fast (and most code can't be vectorized).

As for the "lazy developer" jib. The software industry currently operates on a "make it work, refine it if it becomes a problem" standard. They are much more concerned about making the software bug free and fixable than they are concerned about performance. And guess what? The guys that are paying them care about the same thing.

When looking at code performance, you often find that 90% of the time is spent on 10% of the code. Worrying about making 100% of the code as fast as possible means you waste a boatload of time on things that will never matter. Writing clean code, on the other hand, impacts yours and every developer who looks at the code in the future. The cleaner the code, the less money the company has to spend maintaining it.

For most applications, performance isn't even a secondary concern.

norseamd · Jan 31, 2014

a question about universal memory.

what are the benefits?

can you put 2 tb of memory next to the cpu?

are there still benefits if you divide the memory into close to cpu and large external memory?

what about using the memory for storage and having less for ram usage?

redhotiron2004 · Feb 1, 2014

norseamd said:
a question about universal memory.

what are the benefits?

can you put 2 tb of memory next to the cpu?

are there still benefits if you divide the memory into close to cpu and large external memory?

what about using the memory for storage and having less for ram usage?

The fact of the matter is that even if you can put 2tb of memory near the cpu. It would be extremely costly and the CPU won't be able/need to utilize more than some MB's of space. The CPU needs to perform calculations at a rate much faster than todays Ram. And so, the cache that is given to them is limited and quite costly. Unlike the Ram that we use today.

On the brighter side though. The overall responsiveness and speed would increase dramatically as you won't need separate slower ram for system or for graphics. But, you would still need a separate ssd to store and access data. Which is surely a bottleneck.

But the cost of implementing something like this would be so high. That minor improvements in speed and responsiveness could easily be offset looking at the cost.

The only benefit of keeping/dividing the memory close to Cpu and different memory(ram) for system is the cost factor. Which is significantly less.

Storage memory and ram memory are quite different. Storage memory is extremely slow. And unsuitable for running applications. And ram memory is much faster but can't store data without electricity. So, reducing ram would decrease the speed of the computer and not adviceable.

norseamd · Feb 1, 2014

http://en.wikipedia.org/wiki/Universal_memory

believe this is what i am talking about. i think it has more to do with combing ram and the data storage than combing ram and cache. although if they could combine the cache with the other two that seems like they would want to do that. but the difficulty of putting 2 tb of memory near the cpu is why i even asked in the first place. sram seems to still be faster than possible universal memory options. but you would be running from a ramdisk all the time.

NTMBK · Feb 1, 2014

norseamd said:
http://en.wikipedia.org/wiki/Universal_memory

believe this is what i am talking about. i think it has more to do with combing ram and the data storage than combing ram and cache. although if they could combine the cache with the other two that seems like they would want to do that. but the difficulty of putting 2 tb of memory near the cpu is why i even asked in the first place. sram seems to still be faster than possible universal memory options. but you would be running from a ramdisk all the time.

The main issue would be getting the CPU/GPU and the NVRAM storage to play nicely on the same process, but stacked dies would eliminate that problem.

norseamd · Feb 1, 2014

NTMBK said:
The main issue would be getting the CPU/GPU and the NVRAM storage to play nicely on the same process, but stacked dies would eliminate that problem.

so you would have the apu with cache on one level and the universal ram on the next level?

Cogman · Feb 1, 2014

norseamd said:
http://en.wikipedia.org/wiki/Universal_memory

believe this is what i am talking about. i think it has more to do with combing ram and the data storage than combing ram and cache. although if they could combine the cache with the other two that seems like they would want to do that. but the difficulty of putting 2 tb of memory near the cpu is why i even asked in the first place. sram seems to still be faster than possible universal memory options. but you would be running from a ramdisk all the time.

Assuming you can get the theoretical memory near SRAM speed, while maintaining DRAMs size, I would imagine that the new universal memory would find its first applications in L3 cache. That would probably equate to L3 caches going from 12mb to about 512MB in size (rough "I'm probably off by an order of magnitude or 2" estimate). And system memory would be replaced by it. Performance benefits would mostly impact highly threaded applications looking at large amounts of data.

I don't see the Hard drive -> memory -> CPU process disappearing. We may use the same memory at each stage, but the stages serve more than one purpose. I'm not sure that I would want those purposes jumbled (at least not initially).

NTMBK · Feb 1, 2014

norseamd said:
so you would have the apu with cache on one level and the universal ram on the next level?

That's what I'd go for, yes, but I don't know that much about stacked dies. I'm just assuming that the latency for communication between two stacked dies is still higher than accessing on-die cache, but I could be wrong.

norseamd · Feb 2, 2014

the universal memory could probably be built at smaller than 2.5 but larger than a modern processor. so the apu could be larger than is normally but with audio and networking processors built in as well as more space for cpu and gpu cores.

think that a combination of large x86 cores and small x86 cores would be the right thought

William Gaatjes · Feb 2, 2014

Here is my view on it together with a few questions ?

* When thinking of the x86 instruction set, with the arrival of x86-64 a lot of old rarely used instructions were removed. Any chance this will happen again to clean up the isa ? I see a lot of microcontrollers(exception the old pic based mcu's) and processors that use a 16 registers maximum in a load store architecture meaning all the operations are done in the registers.
Data is retreived from memory and operations are performed and then the results are written back to memory. Examples are AVR, ARM and MSP430 series.

How is X86 different from this originally and since the x86-64 bit mode ?

* I think the only way we can get faster cpu arcitectures is when the cpu and memory become a part of each other. A lot of cores each with its own memory. The problem is how to get high bandwidth, low latency on die interconnections between all these cores.

* I also think that we will end up with a "Big, lots of little" scenario. The big core for the single threaded code and also code that tells the big cpu to dispatch data to the other smaller but specialized processors.

* Write a program in a fashion that each thread has more or less its own cpu.
Each cpu can still switch threads but the trick is to perform thread load balancing with threads. Threads that depend on each other, must each be located at different cpu cores. When the cpu core has to wait, it switches to another thread. But then there must in the code also be instruction that indicate which threads must be run on seperate cpu cores.

For example :
Thread A and thread B are dependent on each other and need to wait for results from each other. Thread C and D are not and are independent.
Thread E and F are dependent on each other.
CPU core 1 runs thread A, C and E. And CPU core 2 runs thread B, D and F.
This way, the cores are constantly busy and when a thread stalls because it has to wait for input from another thread, the thread is swapped out for another thread that can be executed.

But this requires a lot of effort on the compiler and would need i think instructions to tell the cpu that a thread switch can be performed.
In the end, this would require a very tight cooperation between a compiler and the OS doing the scheduling of threads and the hardware. Now an OS does the thread scheduling and i wonder if it is possible to let a CPU do it by using some sort of instructions. CPU cores that communicate their running state with each other and to the OS by using some sort of exception.
*Another way might be that when a thread stalls, a cpu triggers an exception telling the OS that it needs another thread.

William Gaatjes · Feb 2, 2014

The more i think about it, the more pops up.
It is a lot more complex then i thought.

grimpr · Feb 2, 2014

WaitingForNehalem said:
As someone wanting to get into CPU design, I always wonder what the future of CPU architecture holds. I've talked to industry experts and I've heard some very interesting and different things:

CPUs will be relegated to low power, low cost and that the future is really in software and the user experience.

The Von Neumann architecture has been exhausted and that more exotic architectures such as neural networks will take its place.

We are in the dark ages of parallelism and that highly parallel, many core CPUs will come after compiler breakthroughs.

Heterogeneous CPU/GPU architecture will take over.

Analog computers will make a comeback.

While I realize there currently isn't a need for more performance in conventional computing for most average users, things like computer vision, big data, security, and artificial intelligence will play a big role in the future.

As of now, CPU design has really stagnated and all of the performance tricks such as OoO execution, pipelining, instruction level parallelism, and branch prediction have all been used. In fact, many of these techniques are either scaled down or discarded to save power.

Since computing requirements won't stay constant, what do you think future CPU architectures will be like?

Ramblings of a madman but truths spoken, I had lots of fun reading this book, im sure you will too.

http://www.rebelscience.org/download/cosa002.pdf

Also this fantastic talk at Hotchips 25 from ex-Intel engineer Dr. Robert Colwell that works in DARPA now makes you really think...

http://www.youtube.com/watch?v=JpgV6rCn5-g

norseamd · Feb 2, 2014

grimpr said:
Also this fantastic talk at Hotchips 25 from ex-Intel engineer Dr. Robert Colwell that works in DARPA now makes you really think...

transcripts

norseamd · Feb 2, 2014

William Gaatjes said:
Here is my view on it together with a few questions ? * When thinking of the x86 instruction set, with the arrival of x86-64 a lot of old rarely used instructions were removed. Any chance this will happen again to clean up the isa ? I see a lot of microcontrollers(exception the old pic based mcu's) and processors that use a 16 registers maximum in a load store architecture meaning all the operations are done in the registers. Data is retreived from memory and operations are performed and then the results are written back to memory. Examples are AVR, ARM and MSP430 series. How is X86 different from this originally and since the x86-64 bit mode ? * I think the only way we can get faster cpu arcitectures is when the cpu and memory become a part of each other. A lot of cores each with its own memory. The problem is how to get high bandwidth, low latency on die interconnections between all these cores. * I also think that we will end up with a "Big, lots of little" scenario. The big core for the single threaded code and also code that tells the big cpu to dispatch data to the other smaller but specialized processors. * Write a program in a fashion that each thread has more or less its own cpu. Each cpu can still switch threads but the trick is to perform thread load balancing with threads. Threads that depend on each other, must each be located at different cpu cores. When the cpu core has to wait, it switches to another thread. But then there must in the code also be instruction that indicate which threads must be run on seperate cpu cores. For example : Thread A and thread B are dependent on each other and need to wait for results from each other. Thread C and D are not and are independent. Thread E and F are dependent on each other. CPU core 1 runs thread A, C and E. And CPU core 2 runs thread B, D and F. This way, the cores are constantly busy and when a thread stalls because it has to wait for input from another thread, the thread is swapped out for another thread that can be executed. But this requires a lot of effort on the compiler and would need i think instructions to tell the cpu that a thread switch can be performed. In the end, this would require a very tight cooperation between a compiler and the OS doing the scheduling of threads and the hardware. Now an OS does the thread scheduling and i wonder if it is possible to let a CPU do it by using some sort of instructions. CPU cores that communicate their running state with each other and to the OS by using some sort of exception. *Another way might be that when a thread stalls, a cpu triggers an exception telling the OS that it needs another thread.

what else can we use bsides x86? power pc? ibm z? what other hpc cpu designs are there?

William Gaatjes · Feb 2, 2014

norseamd said:
what else can we use bsides x86? power pc? ibm z? what other hpc cpu designs are there?

We can use x86 but i wonder if there are instructions inside the cpu that are not efficient, meaning there is a better solution for it.
To be honest, the x86 instruction set is already converted into these streams of micro operations. and these streams are executed in parallel when possible. Nowadays, inside the cpu there are multiple buses (called ports) and multiple optimized execution units. A FPU unit, a vector unit, two or three ALU's for example. Inside a single core, lots of instructions really run in parallel (as streams of micro-ops) because there are all these optimized execution units.
Today's cpu designs are marvels of technology. The problem is with more execution units inside a cpu is that you need to keep feeding instructions faster and faster. That is a memory problem. The bigger the memory the longer the latency.

Maybe in the near future, DRAM might be replaced one day by ferroelectric RAM. But at the moment that is not going to happen. FERAM has advantages over DRAM but DRAM is even with its refresh cycle faster then FERAM. As long as that is the case, and FERAM is not available in the density DRAM is available .
There are already microcontrollers available with FERAM (FRAM).
The MSP430 FRAM series from Texas Instruments.

http://en.wikipedia.org/wiki/Ferroelectric_RAM

Or it is waiting for memristors technology.

William Gaatjes · Feb 2, 2014

Why i came up with FRAM ?
Because now the cpu is in general for most consumers fast enough.
But many customers would enjoy an "instant on" feature. Memory that does not forget. And FRAM has that advantage in comparison to DRAM. FRAM is not volatile.

jpiniero · Feb 2, 2014

Yeah, stacked nonvolatile memory on top of the CPU. Even if it hurts top end CPU performance (because of the additional heat now on top of the CPU) the responsiveness and potential power savings would more than make up for it.

I see it headed toward more integration on the cpu die/package beyond the memory and disk. Spend any space gained by die shrinks by adding more things to the cpu until it really is a 'System on a Chip'. That Moore's Law radio would be pretty useful.

norseamd · Feb 2, 2014

http://en.wikipedia.org/wiki/List_of_emerging_technologies

what about a thermal copper pillar bump

you guys also mentioned memristors which people seem to think are promising. what about spintronics

Future of CPU architecture

Elite Member

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Golden Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer