Future of CPU architecture

NTMBK · Jan 29, 2014

WaitingForNehalem said:
What if a large program could be subdivided to execute on many little cores with no data dependencies between them?

They've been trying to do that for decades, and it's still not working well. Automagical compilers won't solve everything. Some problems just don't parallelise well.

Phynaz · Jan 29, 2014

Rakehellion said:
I think the modern CPU will be extinct and everything will be run on the GPU.

Not going to happen. What's the standardized instruction set?

How's the low thread count performance?

I think MIT just published an article about why the GPU will not replace the CPU.

Torn Mind · Jan 29, 2014

WaitingForNehalem said:
But in the context of a program containing thousands of instructions, even now they aren't executed sequentially. In most higher-end CPUs instructions are executed out of order and techniques like pipeline forwarding are used to avoid data hazards in the pipeline. Also compilers themselves do a lot of optimization for the target CPU so that the assembly looks nothing like the C code you just wrote.

What if a large program could be subdivided to execute on many little cores with no data dependencies between them?

Parallelizing instructions is not the same as parallelizing operations on data. You can even parallelize both, which is the case on a multi-core CPU. This much I do know.

I am still trying to read up on the material. It seems that what you say doable to some extent right now via coding to program to use instruction sets such as AVX or GPU offloading. But to change the whole hardware design of the CPU into a more GPU-design and program everything to be done with SIMD processing, never mind backwards compatibility with apps such as Word, which probably is coded to interact with a typical CPU design.

It seems that understanding parallel computing involves understanding what exactly is parallelized; not every form of parallel computing involves the same "tricks" and hence the hardware design at the transistor level can vary. Hyperthreading, Bulldozer's modularity is also "parallel computing" but much different from a GPU.

In addition, it seems understanding data and instructions is also need. SISD, MISD, SIMD, MIMD provide a clear categorization of whether data and/or instruction execution are parallelized. As far as I can understand the whole modern CPUs are MIMD with each individual core able to do SIMD or SISD depending on the program.

What GPUs use is SIMD, short for single instruction, multiple data. This apparently refers to how the processing works. One of the math problems those units do well in is solving operations involving vectors. For us, we can understand vector ops by using matrices. So if matrix A is [1 2 3] and matrix B is [2 4 6], basic vector ops can be performed. For example doing vector addition does the following:
A+B=[1 2 3] + [2 4 6] = [1+2 2+4 3+6]
So, in a GPU "core" with parallel architecture, one ALU would do 1+2, another does 2+4, and yet another does 3+6. So modern graphics cards can process these huge vector operations real quickly.

Out of order execution is actually not present on a GPU "core", I think. Everything is done in order, but because there are so many ALUs operating at the same time, many problems are executed simultaneously.

CPUs do implement SIMD instruction sets, thus enabling hardware support for SIMD processing if coders want to have their application take advantage of those instruction sets, such as SSE, AVX, and other instruction sets. The coder would tell the program to use those instruction set. It could be very possible for the portions of code that take advantage of SSE or AVX to also be computed on a GPU.

el etro · Jan 29, 2014

NTMBK said:
CPUs will have multiple, massively wide vector units, with ridiculous bandwidth from stacked DRAM, replacing on-die GPUs with a few extra CPU cores and using the vector capabilities of the cores to perform graphics operations.

ARM cores help would be to help on less demanding operations in order to idle Big AMD cores and save power.
It's all hypothetical, i'm talking about ARM+X86 CPU cores.

Cerb · Jan 29, 2014

Kippa said:
It makes me wonder, how fast could you get a current tech cpu based on the assumption that only 1 core is going to be maxed out all the time and the other running at much lower frequencies. Maybe 5ghz+ for one single core? So long as the others are running at much lower frequencies?

You do know that Sandy Bridge came out in 2011, and that we've now had 3 generations of Intel CPUs, and now 2 generations of AMD CPUs that do that...right? The answer you seek is out there for sale, right now, and it's on the order of +500MHz to go from 4 cores at low speed to 1 at max and others doing nothing.

el etro said:
ARM cores help would be to help on less demanding operations in order to idle Big AMD cores and save power.
It's all hypothetical, i'm talking about ARM+X86 CPU cores.

Why? If that is a good thing to do, just use a weaker x86 core. There's nothing special ARM has on that front, and both Intel and AMD could do it with x86.

Cerb · Jan 29, 2014

WaitingForNehalem said:
Again though, running certain tasks in hundreds of threads would be very difficult. Compiler support would have to improve tremendously.

Not so. OS bottlenecks would have to be worked around, as many of the scaling features are currently theoretical in nature. But, running tasks in hundreds of threads can be done with current compilers just fine, and has been done. Powerful web and DB servers have done exactly this, and without anything exotic being used. But, what's special about them? Shared data is all read-only, and non-shared data is all thread-exclusive, so it's logically simple to reason the correctness of, until it hits writes need to reach the DBMS' and/or OS' storage layers. Compiler technology will keep improving, but like other computing problems, the low-hanging fruit has already been taken, and there won't be a magic compiler to fix your program.

It's the complexity of shared memory that gets you, with sufficiently parallelizable tasks, and languages that require you to tell the computer how to handle all the fine details of managing those threads. A compiler cannot and will not make C easy to use, for arbitrary scaling out, if you need to work with shared data structures that regularly get updated. That's where transactional memory, which basically applies speculation like speculative OOOE and branch prediction to arbitrary memory operations, to help get around the emergent complexity of locks and mutexes, as the count of potential accessors, and data protection structures, grows. But it still only will help, and the languages used need to be able to abstract the finer details of memory, to allow the compiler to do that grunt work.

Languages and compilers that can handle this already exist, including support for Intel's TSX, on existing CPUs. The problem we face is one of square pegs and round holes, more or less. You need to use some amount of code that can't take advantage of the feature, yet, because of how it is written, and because of that, you have no incentive to make the expensive and disruptive change to use a language/environment that may bypass these problems, because that code will be a huge bottleneck. C++ is getting more amenable to parallel work, Linux--I assume Windows and FreeBSD, too--are working on helping out in the kernels, with support for these new CPU features, and common libraries are working on it, too, but it's going to take a long time for them to all catch up. And then the programmers will have to catch up, which is a large expense for most businesses. And then you'll have to have to be dealing with work that can scale out well enough, to boot.

Cerb · Jan 29, 2014

WaitingForNehalem said:
CPUs will be relegated to low power, low cost and that the future is really in software and the user experience.

Th first part hasn't happened yet. I'm not sure if it will come to pass or not, TBH. Hardware in general won't, but the CPU very well could, as other bottlenecks become more important (even now, "the CPU," encompasses a great deal of networking and storage handling, FI). The second part has been under way for years, and rightfully so: we've needed it.

The Von Neumann architecture has been exhausted and that more exotic architectures such as neural networks will take its place.

Only if you drink the Kool-aid that multiple Von Neuman machines strung together are no longer based on a Von Neuman architecture, or that virtual memory breaks VN architecture, or caches do, etc..

While different than the basic architecture described in the 40s, they are all derived directly from it, and share its basic workings. To make use of something not that, code written for said CPU would need to have some way of being validated by other means than instruction type, order, register allocation order, and memory access location and order, which would be difficult and time-consuming, to the point that nobody wants to do that, except in very low cost/space/power embedded situations, where the code base used is small and targeted to a specific application.

If you consider Von Neumann to only be a pure Von Neumann, we've already long since ditched it, rather than it being some future thing. More importantly, forget the, "novel CPU," circle jerking, and look at memory. Memory is what makes us slow, today, from SRAM caches to DRAM caches to inches-away DIMMs, to feet-away disks to milliseconds-away servers with the info you need right now.

Other architecture types, that truly aren't Von Neumann-based, are and will remain used for special tasks, where they can be thousands of times faster, at a fraction of the cost...but are a PITA to develop for, so remain niche. A lot of them are cool, and it would probably be much more enjoyable to work on one--and its supporting software stack--than a general purpose CPU, but they aren't going to become general purpose CPUs.

We are in the dark ages of parallelism and that highly parallel, many core CPUs will come after compiler breakthroughs.

Nope. We are in the Renaissance age of parallelism. Those CPUs exist, though aren't common, and programming languages are catching up. You can tinker with Erlang, Haskell, Ocaml, etc., and see the future, but you have to wait for that future stuff to work with C, C++, C#, etc., before it makes sense to put into production.

The compiler crap has been a silly myth since ever high level languages came into being, and will stay that way. The work needs to be described to the compiler in a way that facilitates thread-level parallelism. Compilers that can handle the work after that already exist, and have for years (not decades, but maybe up to 15 years, depending on your you define it), but programming languages that you can reasonably use with existing software bases, are still works in progress. C and C++, FI, have so many ticky little rules that you can't just write code that has no logical dependencies and expect a miracle--it takes effort and experience, on top of knowing what you're trying to accomplish, performance-wise, without premature optimization. Java and C# are basically impossible to add any significant parallelization to, without being explicit. Magical compilers that will auto-paralellize code written sequentially have been, are, and will remain, a fantasy, and common programming languages aren't well suited to not writing sequentially. The compiler to cure all serializing ills is like cold fusion--it might be possible, but don't expect to see it in your lifetime.

Also, as long as memory is the bottleneck that it is, and power efficiency isn't improved by leaps and bounds, predication will suck, as an option for low-parallelism algorithms, just as it has in the past. However, if new memory technologies can allow many random accesses over just a few wires, then game on (Google "wish joins" for a couple papers describing an efficient methodology for this sort of thing to be added, that builds upon common speculative features already in most high-performance processors).

Heterogeneous CPU/GPU architecture will take over.

Already happening. Due to the above programming language related issues, it's not happening super fast, but it is happening. The early CPUs/SoCs are already in consumers' hands (in many cases, literally), popular applications use the GPU for more than just 3D, and proper combination of support features is being worked on by various means, including HSA.

Analog computers will make a comeback

Doubtful, but it would be interesting. Now, we're getting things like SDR, showing that simple digital processors are fast enough to gradually replace fixed-function digital/analog combo units (they're still quite specialized, but not to the degree they were, and have allowed for things like firmware updates to draft hardware when a spec comes out, firmware updates to meet some random country's new regulations, etc..). Processing bit streams can be done plenty fast, if there's never a question of which bits need to where and in what order.

While I realize there currently isn't a need for more performance in conventional computing for most average users

They are insulated from it, but if you get out of the world that marketing has made, it's still there, but in a very different form than 10+ years ago. They don't care what went on to make their new iPad/Nexus/Note faster, they just know it is and that was worth upgrading for. The work to making mobile computers better, and networked infrastructures function well enough for average users, is nowhere near complete, and has been a major paradigm shift. Mobile is becoming conventional. Performance in terms of raw operations per time in a single thread, though, have hit a wall due to memory, and exceed the needs of most users, when power consumption is not an issue.

Since computing requirements won't stay constant, what do you think future CPU architectures will be like?

GPGPU and/or GPDSP (such as Hexagon), with programming languages abstracting them, will finally get enough support to get widespread use, leading to better implementations of them and wider adoption of varied competing designs. Along with this, more and more R&D money will go into on-chip and chip-to-chip networking, which is more of an unknown.

A simple 2D grid like Tilera has should be fine for real-time applications and basic networking tasks, but it's not going to be usable for general-purpose work. OTOH, a 3D mesh like Fujitsu has (6D by their marketing) is going to scale too poorly to be used anywhere but expensive clusters (which is fine for Fujitsu, since that's what it's made for), unless 3D transistor layouts can be made cheaply, allowing layers of connected 2D grids. More flexible topologies like AMD has allowed with HT, OTOH, or that proprietary networking vendors have added that run over PCIe, require a lot of software effort to make efficient, so aren't good general fits, on top of being expensive add-ons. There's a lot to do, here, a lot of CPU time spent waiting to do something, and nobody yet has it truly figured out.

rtsurfer · Jan 29, 2014

Cogman said:
My guess is that CPUs will work their way towards being 100% asynchronous. Right now, we have a pesky clock which is just holding us back man (and using a whole boatload of power while doing it).

A 100% async chip wouldn't use power unless it was doing something. No need for clock throttling and gating. Components would only use power when they are doing something (ok, there might be some gate leak, but that wouldn't be TOO bad of a power draw).

Why hasn't this been done? Because it is terribly hard and terribly different from anything we have done before it. Our CPUs today require precision timing, a fully async CPU would have to somehow overcome the need for that timing.

I believe ARM SOCs already have asynchronous clocking.

Although I think it is harder to implement on the more complex X86 CPUs.

Cerb · Jan 29, 2014

Galatian said:
3D structures are all nice and dandy, but heat transfer will put a limit on that. We're already swing the problem with Ivybridge/Haswell where the smaller transitor size actually resulted in a smaller area the heat could be transferred. Add several layer above and beneath that, how are you going to cool that?

With a directly-soldered, or otherwise bonded, IHS, and a severely limited TDP? Since it's all hypothetical anyway, however, one thing that comes to mind might be to use such 3D manufacturing to stack the processors on top of or below lower-power SRAM, or even make a design with that all jumbled together between the layers.

Cerb · Jan 29, 2014

rtsurfer said:
I believe ARM SOCs already have asynchronous clocking.

That's a contradiction, in this context. What he means by asynchronous is non-clocked, where timing signals are entirely peripheral to the CPU. This is primarily useful in certain niche embedded cases where either EMI/RFI from clocks is an issue, or internal clock variations create more correctness complexities than are worth it, and performance doesn't much matter (in the context of being able to cram more into a grain of salt than computers that took up whole buildings many years ago). Automotive safety systems, FI, have long been fans of either fully asynchronous systems, or systems who's activities are triggered only by external inputs (which has common with microcontrollers for decades), and that do absolutely nothing without them. Some have been made with ARM ISAs, but nothing close to mainstream.

WaitingForNehalem · Jan 29, 2014

Torn Mind said:
Parallelizing instructions is not the same as parallelizing operations on data. You can even parallelize both, which is the case on a multi-core CPU. This much I do know.

I am still trying to read up on the material. It seems that what you say doable to some extent right now via coding to program to use instruction sets such as AVX or GPU offloading. But to change the whole hardware design of the CPU into a more GPU-design and program everything to be done with SIMD processing, never mind backwards compatibility with apps such as Word, which probably is coded to interact with a typical CPU design.

It seems that understanding parallel computing involves understanding what exactly is parallelized; not every form of parallel computing involves the same "tricks" and hence the hardware design at the transistor level can vary. Hyperthreading, Bulldozer's modularity is also "parallel computing" but much different from a GPU.

In addition, it seems understanding data and instructions is also need. SISD, MISD, SIMD, MIMD provide a clear categorization of whether data and/or instruction execution are parallelized. As far as I can understand the whole modern CPUs are MIMD with each individual core able to do SIMD or SISD depending on the program.

What GPUs use is SIMD, short for single instruction, multiple data. This apparently refers to how the processing works. One of the math problems those units do well in is solving operations involving vectors. For us, we can understand vector ops by using matrices. So if matrix A is [1 2 3] and matrix B is [2 4 6], basic vector ops can be performed. For example doing vector addition does the following:
A+B=[1 2 3] + [2 4 6] = [1+2 2+4 3+6]
So, in a GPU "core" with parallel architecture, one ALU would do 1+2, another does 2+4, and yet another does 3+6. So modern graphics cards can process these huge vector operations real quickly.

Out of order execution is actually not present on a GPU "core", I think. Everything is done in order, but because there are so many ALUs operating at the same time, many problems are executed simultaneously.

CPUs do implement SIMD instruction sets, thus enabling hardware support for SIMD processing if coders want to have their application take advantage of those instruction sets, such as SSE, AVX, and other instruction sets. The coder would tell the program to use those instruction set. It could be very possible for the portions of code that take advantage of SSE or AVX to also be computed on a GPU.

I think you misunderstood my point or maybe I did a poor job of explaining. I was trying to show that modern CPUs/compilers avoid data hazards by reorganizing instructions, forwarding results in the pipeline out of the EX stage, inserting nops, etc...I was not saying anything about GPUs.

My point was that maybe in the future, instructions can be subdivided to run on many smaller cores where there are no data dependencies between them. If a couple of instructions must be run sequentially due to data dependenetcs that is ok.

cytg111 · Jan 29, 2014

Homeles said:
room temperature superconductivity..

- forget cpu's then and give me a flying car instead please.

cytg111 · Jan 29, 2014

CPUs will be relegated to low power, low cost and that the future is really in software and the user experience.
- One could hope about the latter part .. While Apple has shown the way to be successful in the usability space players such as MS odly enough is not playing ball. Here's to hoping thou'

The Von Neumann architecture has been exhausted and that more exotic architectures such as neural networks will take its place.
- A neural net is highly inefficent at general computational work. You can do the calculations yourself, go train a simple neural network to add integers up to 10. Execute a calculation on the network and count the number of cycles it takes to complete and compare it a reference manual of your favorite architecture for MUL. The neural approach will be many orders of magnitude slower.

We are in the dark ages of parallelism and that highly parallel, many core CPUs will come after compiler breakthroughs.
- Here is to hoping, but amdahl and all that.. But hey, smart folks do smart things..

Heterogeneous CPU/GPU architecture will take over.
- Sort of the same Q as above.

Analog computers will make a comeback.
- I dont see that angle, elaborate?

I think we'll see graphene or similar, and if, as another pointed out, we're going to get room-temp super conductors at some point - then what is the limit to the clocking potential?
At some point I think we'll see our super fast (graphene? 500Ghz, x86/x64++) cores (many, 20+) married to a quantum construct of some kind, d-wave is just the beginning (will we be counting quantum cores? as coprocessors?).

Torn Mind · Jan 29, 2014

WaitingForNehalem said:
I think you misunderstood my point or maybe I did a poor job of explaining. I was trying to show that modern CPUs/compilers avoid data hazards by reorganizing instructions, forwarding results in the pipeline out of the EX stage, inserting nops, etc...I was not saying anything about GPUs.

My point was that maybe in the future, instructions can be subdivided to run on many smaller cores where there are no data dependencies between them. If a couple of instructions must be run sequentially due to data dependenetcs that is ok.

Nah, it is more on my end, since I don't know programming at all except for a brief attempt to get comfortable with object-oriented programming and failing, and most of my knowledge of CPU hardware came from google searching yesterday night. I'm not well-versed in these matters at all. In addition, my earlier posts were regarding GPUs, so I thought you were continuing along.

Morbus · Jan 29, 2014

Galatian said:
3D structures are all nice and dandy, but heat transfer will put a limit on that. We're already swing the problem with Ivybridge/Haswell where the smaller transitor size actually resulted in a smaller area the heat could be transferred. Add several layer above and beneath that, how are you going to cool that?

Submerge it in water.

Like our brain... you know... It's basically a 3D CPU...

el etro · Jan 29, 2014

Cerb said:
Why? If that is a good thing to do, just use a weaker x86 core. There's nothing special ARM has on that front, and both Intel and AMD could do it with x86.

Yes. So goes Jaguar on the way. But ARM stong no X86 processor can match, and you know it.

Cerb · Jan 29, 2014

WaitingForNehalem said:
My point was that maybe in the future, instructions can be subdivided to run on many smaller cores where there are no data dependencies between them. If a couple of instructions must be run sequentially due to data dependenetcs that is ok.

How is that not a description of common OoOE implementations with clustered execution units? Or, if you mean more granular, like SIMD/MIMD/SPMD on a bigger scale, some GPUs (including all 3 major PC GPU vendors) run many data paths of the same program loops, on many very small cores (1000+ now, for midrange video cards), to improve throughput. If you mean at a higher level, like regular threads, Haskell is probably the most widely used language capable of multithreading without explicitly being told to do so, though Erlang is a bit more entrenched for networking-based tasks.

Cerb · Jan 29, 2014

el etro said:
Yes. So goes Jaguar on the way. But ARM stong no X86 processor can match, and you know it.

Then they should both be ARM...

Homeles · Jan 29, 2014

cytg111 said:
- forget cpu's then and give me a flying car instead please.

http://www.eetimes.com/document.asp?doc_id=1320283

WaitingForNehalem · Jan 29, 2014

cytg111 said:
CPUs will be relegated to low power, low cost and that the future is really in software and the user experience.
- One could hope about the latter part .. While Apple has shown the way to be successful in the usability space players such as MS odly enough is not playing ball. Here's to hoping thou'

The Von Neumann architecture has been exhausted and that more exotic architectures such as neural networks will take its place.
- A neural net is highly inefficent at general computational work. You can do the calculations yourself, go train a simple neural network to add integers up to 10. Execute a calculation on the network and count the number of cycles it takes to complete and compare it a reference manual of your favorite architecture for MUL. The neural approach will be many orders of magnitude slower.

We are in the dark ages of parallelism and that highly parallel, many core CPUs will come after compiler breakthroughs.
- Here is to hoping, but amdahl and all that.. But hey, smart folks do smart things..

Heterogeneous CPU/GPU architecture will take over.
- Sort of the same Q as above.

Analog computers will make a comeback.
- I dont see that angle, elaborate?

I think we'll see graphene or similar, and if, as another pointed out, we're going to get room-temp super conductors at some point - then what is the limit to the clocking potential?
At some point I think we'll see our super fast (graphene? 500Ghz, x86/x64++) cores (many, 20+) married to a quantum construct of some kind, d-wave is just the beginning (will we be counting quantum cores? as coprocessors?).

To clarify, those are various answers from people I've talked to in the industry and academia. They aren't mine.

With regards to neural networks, I assure you they are very impressive :sneaky:.

To elaborate on analog computing, a professor presented to us on the research he was doing. One of the examples he demonstrated was how using transistor (maybe FET?) characteristics to create a Gaussian distribution and calculate probability was much faster and more efficient than on a conventional digital computer using software.

I've heard things like this from other people too. That analog estimation versus precise digital calculation is much more useful for AI as it is works much in the same way as our brains.

WaitingForNehalem · Jan 29, 2014

Cerb said:
How is that not a description of common OoOE implementations with clustered execution units? Or, if you mean more granular, like SIMD/MIMD/SPMD on a bigger scale, some GPUs (including all 3 major PC GPU vendors) run many data paths of the same program loops, on many very small cores (1000+ now, for midrange video cards), to improve throughput. If you mean at a higher level, like regular threads, Haskell is probably the most widely used language capable of multithreading without explicitly being told to do so, though Erlang is a bit more entrenched for networking-based tasks.

mov r1,r2
sub r1,r1,#1
--------------
add r3,r3,#3
--------------
add r4,r4,#3
--------------
sub r5,r5,#7

Basically what I was thinking of was a large highly clocked core, surrounded by hundreds of simple, in-order cores using the same ISA. So maybe a 5GHz Haswell core surrounded by several Pentium cores for example. The compiler or some sort of scheduler would then segregate the assembly based on data dependencies. The sequential code like the first two lines above would be executed on the large core and the other lines that have no dependencies would be fed to the small cores. Of course I didn't mean single lines of assembly but different groupings of segregated assembly would be executed by the small cores in parallel. All the results would then be written into a giant public cache and the big core would load the computed results into the respective registers for the calling program. I guess the big core would be a master core that would feed the smaller cores.

TBH, I haven't thought about this very much so I'm sure I've missed a lot but since you asked

Also, I realize that there would need to be some way of identifying or tagging the results in the large cache...idk just thinking out loud here

Cerb · Jan 30, 2014

Nifty block diagram of Nehalem (might not be 100% right) and the P6, here: http://arstechnica.com/gadgets/2006/04/core/2/

And, AMD, back in the day: http://www.anandtech.com/show/1098/2
(note the 3 scheduler clusters, which is what I was thinking of in my post)

That's what most wide superscalar CPUs do, some how, some way. Keeping track of it all, and making sure it's done right (or if not, that what wasn't right can be undone), is why we have CPUs with 4 cores, when nVidia and AMD can fit 100+ GPU cores in the same space as each one of them. The CPU goes and converts the incoming x86 instructions to a tailored internal set, and runs those.

Trying to explicitly give dependencies has been tried, and keeps failing, at least for improving single-threaded performance, for numerous reasons, not the least among them being that memory, being fast&small or slow&big, becomes a bottleneck, because it can't be done without using more bits for instructions, to state stale registers, microthreads, blocks, etc.; and for short pieces of code, register renaming still ends up being needed, anyway, to be able to re-use registers without waiting for old instructions to finish with them, so the gains aren't much. It also helps that compilers will try to make temporary values sit in a handful of hot registers, spreading them around a little bit for better ILP discovery, that get written to, used once or twice, then overwritten again and again and again, to hint to the CPU that the values in them are dead. The registers and addresses used are thus a sufficient way to describe parallelism to the CPU (not ideal, maybe--there are some neat ideas, out there). That it could be done well enough on x86, prior to x86-64, is pretty impressive, too, both on account of CPU designers and compiler writers (IA32 has no true general-purpose registers, simply registers who's related instructions aren't being used right now).

Ken g6 · Jan 30, 2014

WaitingForNehalem said:
Basically what I was thinking of was a large highly clocked core, surrounded by hundreds of simple, in-order cores using the same ISA. So maybe a 5GHz Haswell core surrounded by several Pentium cores for example.

I was thinking along those lines. But if you need to design your code specifically to work in parallel, why stick with the same ISA? Why not just use AMD's HSA instead?

That does seem to be the direction processors are going it, but I'd say it's a question of whether they'll use AMD's HSA or Intel's AVX to get there. If Intel made very wide AVX units that were shared between cores like AMD is doing, that would practically be the same thing as HSA.

WaitingForNehalem · Jan 30, 2014

Ken g6 said:
I was thinking along those lines. But if you need to design your code specifically to work in parallel, why stick with the same ISA? Why not just use AMD's HSA instead?

That does seem to be the direction processors are going it, but I'd say it's a question of whether they'll use AMD's HSA or Intel's AVX to get there. If Intel made very wide AVX units that were shared between cores like AMD is doing, that would practically be the same thing as HSA.

Because it would still use x86 assembly as all the cores would still be x86. You wouldn't need HSA or OpenCL.

WaitingForNehalem · Jan 30, 2014

Cerb said:
Nifty block diagram of Nehalem (might not be 100% right) and the P6, here: http://arstechnica.com/gadgets/2006/04/core/2/

And, AMD, back in the day: http://www.anandtech.com/show/1098/2
(note the 3 scheduler clusters, which is what I was thinking of in my post)

That's what most wide superscalar CPUs do, some how, some way. Keeping track of it all, and making sure it's done right (or if not, that what wasn't right can be undone), is why we have CPUs with 4 cores, when nVidia and AMD can fit 100+ GPU cores in the same space as each one of them. The CPU goes and converts the incoming x86 instructions to a tailored internal set, and runs those.

Trying to explicitly give dependencies has been tried, and keeps failing, at least for improving single-threaded performance, for numerous reasons, not the least among them being that memory, being fast&small or slow&big, becomes a bottleneck, because it can't be done without using more bits for instructions, to state stale registers, microthreads, blocks, etc.; and for short pieces of code, register renaming still ends up being needed, anyway, to be able to re-use registers without waiting for old instructions to finish with them, so the gains aren't much. It also helps that compilers will try to make temporary values sit in a handful of hot registers, spreading them around a little bit for better ILP discovery, that get written to, used once or twice, then overwritten again and again and again, to hint to the CPU that the values in them are dead. The registers and addresses used are thus a sufficient way to describe parallelism to the CPU (not ideal, maybe--there are some neat ideas, out there). That it could be done well enough on x86, prior to x86-64, is pretty impressive, too, both on account of CPU designers and compiler writers (IA32 has no true general-purpose registers, simply registers who's related instructions aren't being used right now).

Oh I see. It is interesting to see that with Nvidia's Denver and Apple's Cyclone, they are going for super wide architectures. That is assuming they are really 7-way superscalars and not just counting lots of micro-ops.

Future of CPU architecture

Lifer

Lifer

Lifer

Golden Member

Elite Member

Elite Member

Elite Member

Senior member

Elite Member

Elite Member

Platinum Member

Lifer

Lifer

Lifer

Senior member

Golden Member

Elite Member

Elite Member

Platinum Member

Platinum Member

Platinum Member

Elite Member

Programming Moderator, Elite Member

Platinum Member

Platinum Member