CPU design Q: superscalar vs multicore

limelava1 · Jun 27, 2011

Question/ discussion topic regarding CPU design choices...
Why not go more superscalar instead of more multicore? Why not have dozens of functional units (a.k.a. datapaths, pipelines, ALUs), instead of adding cores?

Okay, yes, I have heard that the complexity of dependency checking increases exponentially in the number of functional units. So it becomes prohibitive to have too many. Multiple cores avoids this by forcing the dependency problem back on the software and programmer to handle concurrency, threading, IPC, and the like.

But what if... we limited the number of instructions issued per clock per thread, to something much less than the number of functional units? This should reduce the complexity to a manageable level. Wouldn't this solve the problem?

You might say, "ah yes, but it introduces a new problem: how to make use of those now-empty functional units!" Well, how about this:
i) SIMD. Let a single instruction operate on multiple data via multiple functional units. Ideally we would dynamically group functional units together to form SIMD abilities instruction by instruction.
ii) SMT. Let many threads share all the functional units. POWER and especially Oracle/ Sun's T3 have shown it is entirely workable to go beyond Intel's two threads per core.

To make this more concrete, if you look at an Intel E7, it's got up to 10 cores and, let's say, 4 ALUs each (plus probably some FPs etc). So, 40 ALUs a chip. So I'm thinking, let's put 40 ALUs in a single core, and limit to 4 or 8 instructions per clock per thread, but have 8 or 16 threads, and let each single instruction use 4 or 8 or all 40 ALUs on vector data.

Why bother? In a nutshell, the parallelism that must be extracted to use multiple cores is very coarse or granular compared to what the CPU can extract at run time, and is consequently less efficient. CPUs are also losing to GPUs. For example:
-- multiple cores do not share all caches, and cache coherency traffic has become a problem. As we add cores, we'll have more trouble getting data from one end of the chip to the other. In contrast, functional units usually share cache automatically.
-- the number of instructions 'ready to go' on a given thread is highly variable. Sometimes we are idling, waiting for data from main memory; other times we are compute limited. When these two situations arise on different cores at the same time, we can't do anything. Even if the busy thread could somehow be 'split,' the time to move it to the idle core may be too high.
-- synchronization mechanisms available at the programmer levels can be very slow (e.g. thousands of clock cycles to wake via mutex)
-- GPUs are already very wide. They manage it by imposing restrictions to simplify the scheduling -- operate on vectors of data, strictly in order execution, etc.

In short, even if we severely limit the scheduling choices for making use of dozens of functional units in a CPU in order to make it feasible, it is still probably a lot more efficient than the coarse software-level scheduling we get now under multicore.

Opinions?

Ben90 · Jun 27, 2011

I love how easy it is to find competent answers on this forum.
Step one: Search for posts by Idontcare containing "superscalar"
Step two: Click first link.

Idontcare said:
Ben90 said:

am legitimately curious as to why extremely superscalar designs never took off.

Click to expand...

Basically physics and economics got in the way.

http://en.wikipedia.org/wiki/Pollack's_Rule

You can keep doubling the complexity but your rate of performance improvements dies off as the square root of your efforts while power-consumption and production costs increase linearly with die-area.

If your customers are willing to accept the alternative, multi-core/multi-thread processing, then you can build higher performance chips without spending a bundle on development and production costs associated with non-silicon based semiconductor technologies.

If the law holds reasonably true to even today, I would totally go for a superscalar quad core vs a future octo. Sure you lose a lot of the theoretical performance, but per thread these things would kick butt. I don't expect Intel to develop a chip like this though since R&D is so expensive, they want to cater to whatever will give them the most return on investment. Multicore.

Fortunately for us superscalar guys, we have Amdahl's Law. Per core improvements, while slowed, haven't been standing still either. It's all a balance.

Welcome to the forum! Excellent question btw.

Borealis7 · Jun 27, 2011

and this is the first post you make? who's fake account are you?

and Welcome to Anandtech Forums!

VirtualLarry · Jun 27, 2011

I've suggested something like this in the past too. Basically, future CPUs designed much like GPUs, with arrays of processing units (functional units), and running an insane version of SMT across all of them, with many threads in play at once, all utilizing the available functional units as much as possible.

jones377 · Jun 27, 2011

Current CPU designs (since Pentium 1 for x86) are already superscalar. ILP and exponential complexity prevents the CPU designers from creating very wide superscalar CPUs.

Borealis7 · Jun 27, 2011

also while GPUs do a lot of vector operations, CPUs do alot of moving bits from one place to another. they're completely different function units, and it'd be a waste to run then on the GPU.

you just end up with a "jack of all trades, master of none".

Tuna-Fish · Jun 27, 2011

There is a limited amount of ILP (instruction-level parallelism) available in the code -- if all the future operations in a thread are dependent on a single instruction, (which is extremely common), then having a wider core buys you nothing.

As for SIMD, there are really fine SIMD units in all modern processors, and they are practically almost unused. Getting good simd code with current tools is still hard enough that it is only ever done for very small computing kernels on very few programs.

The final problem with wide cores is the register file -- all those units have to load operands, and output results. Read and write ports in the register file are not cheap -- the area complexity of a register file is amount_of_ports^2. This, combined with the fact that making the register file larger means longer wire delays, means that if you want to run high at clock speeds, you very simply cannot increase the port counts on register files much above 4. On their new designs, both Intel and AMD duplicate the entire register file to keep the amount of ports low. So both the leading CPU manufacturers agree that you cannot have any wider register files.

Then why not have multiple separate register files, each dedicated to one SMT thread? The ALU's have to physically very close to the register file to be able to reach high clocks, and placing lots of ALU's and lots of register files so that they all are close to each other is, well, problematic.

But this is not actually that big of a problem, because ALU's are actually extremely cheap in modern processors. So instead of a lot of alus sharing lots of register files, you can couple a few alu's to a register file, and duplicate that, sharing all other parts. Which is the design of AMD bulldozer.

GammaLaser · Jun 27, 2011

limelava1 said:
Question/ discussion topic regarding CPU design choices...
Why not go more superscalar instead of more multicore? Why not have dozens of functional units (a.k.a. datapaths, pipelines, ALUs), instead of adding cores?

The first reply pretty much hit the nail on the head. Not only does it take exponentially more resources to increase issue widths (leading to a non-area and non-power efficient design), but the practical benefits you get won't even scale. Most programs can't fully utilize the existing superscalar resources of a modern 4-issue CPU. That's why HTT brings overall throughput benefits, since most of the time, there is no additional ILP to exploit in a single thread.

limelava1 said:
SMT. Let many threads share all the functional units. POWER and especially Oracle/ Sun's T3 have shown it is entirely workable to go beyond Intel's two threads per core.

T3 implements multithreading a bit differently than Intel's HTT. Bimplementations can only issue between two different threads in a single cycle. The difference is that T3 can also interleave threads in time so that the execution of one thread can cover the otherwise idle execution resources when another thread is stalled on a memory access.

limelava1 said:
In short, even if we severely limit the scheduling choices for making use of dozens of functional units in a CPU in order to make it feasible, it is still probably a lot more efficient than the coarse software-level scheduling we get now under multicore.

I believe your argument is that CPU's should have much more flexible execution units so they can "burst" their issue widths when the ILP is available but otherwise operate in SMT mode. The reality is that most of the time, the CPU will not be able to "burst". There are too many complications involved--program could be waiting for memory most of the time, the branch prediction may be fumbling (8 issues/cycle would require on average 2 correct predictions per cycle even if the ILP was available), or the program may just have too many inherent data dependancies. A lot of power would be wasted to perform the checks needed to see if this bursting is possible, when most of the time it won't be.

The CPU will then just turn into a multithreaded system, it'll look the same as a multicore system to software, except that it needed a lot more power and transistors.

Bill Brasky · Jun 27, 2011

Ben90 said:
I love how easy it is to find competent answers on this forum.
Step one: Search for posts by Idontcare containing "superscalar"
Step two: Click first link.

LOL. Funny because it's true.

limelava1 · Jun 27, 2011

Ben90 said:
Step one: Search for posts by Idontcare containing "superscalar"

A good thread, thank you. I didn't know to specify Idontcare. <

Ben90 said:
Welcome to the forum! Excellent question btw.

Borealis7 said:
and this is the first post you make? who's fake account are you?

Thanks. I have been reading Anandtech's for articles and news for probably 10-12 years but have not really participated/ posted in the forums. First time caller.

Borealis7 said:
also while GPUs do a lot of vector operations, CPUs do alot of moving bits from one place to another. they're completely different function units, and it'd be a waste to run then on the GPU.

You're of the specialization camp. Like the neuroscientists. They will tell you about how the human brain has the computational power of the top entry on top500.org, but uses several orders of magnitude lower wattage, mainly via specialization. But, look how long it takes to program a human -- a good 20-25 years before we're fully educated. And with greatly varying results. Specializing hardware in the computer world demands a lot from programmers to make full use. Few programmers are capable of using a CPU well, very very few can really use a GPU or both. And you will never get around the time lag to move data between CPU and GPU ("never" via software, only via newer hardware than the PCIe bottleneck). Even context switches and moving stuff between cores is expensive in cycles. Hence the proposal to do it in one core -- more physically efficient for data movement and WAY easier on the software side.

Tuna-Fish said:
if all the future operations in a thread are dependent on a single instruction, (which is extremely common), then having a wider core buys you nothing.

Is this not why we use TLP & SMT then? So that when our thread is waiting on something, somebody else can make use of our scarce compute resources?

Tuna-Fish said:
As for SIMD, there are really fine SIMD units in all modern processors, and they are practically almost unused.

A good point. That said, I think SIMD is used when the speed matters. Media apps such Photoshop or encoders/ decoders use it, I believe. Also, I thought gcc put this in a couple years ago, where the compiler would automatically look for places it could unroll your loops into 4's and 8's for vectorization. In other words, I think more (gcc-created) binaries use SSE & friends under the covers than you might expect. See here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Tuna-Fish said:
The final problem with wide cores is the register file

I think this, and the remainder of your post, really explains it and answers the question. Thank you, Tuna-Fish. I was thinking of ALUs as more expensive, but good to hear that Bulldozer is going in this direction.

GammaLaser said:
The difference is that T3 can also interleave threads in time so that the execution of one thread can cover the otherwise idle execution resources when another thread is stalled on a memory access.

I think such temporal multithreading could be pretty useful. Yes, the OS temporally multithreads already, but it's at too coarse of a level. Letting the CPU do it seems much preferable in terms of better utilizing resources. It seems like this would reduce the complexity for a given core.

To the general idea is that the reason is complexity, is not intercore communication and cache coherency complex, too? And NUMA or the like between sockets? There's already talk for things like Intel MIC that maybe we shouldn't keep all of a chip's caches coherent because it's too difficult, but no one really knows how to form the programming model for that. That sounds like a scaling bottleneck for multicore.

GammaLaser · Jun 27, 2011

limelava1 said:
Also, I thought gcc put this in a couple years ago, where the compiler would automatically look for places it could unroll your loops into 4's and 8's for vectorization. In other words, I think more (gcc-created) binaries use SSE & friends under the covers than you might expect. See here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Last I heard, autovectorization was not very mature, and using intrinsics to insert SSE ops was the way to gain significant speedups. Maybe recent versions of the compilers have improved in that regard.

limelava1 said:
To the general idea is that the reason is complexity, is not intercore communication and cache coherency complex, too? And NUMA or the like between sockets? There's already talk for things like Intel MIC that maybe we shouldn't keep all of a chip's caches coherent because it's too difficult, but no one really knows how to form the programming model for that. That sounds like a scaling bottleneck for multicore.

Yeah, these are the big issues facing multicore architectures. It's why we've seen CPUs go to ring-bus interconnects, use directory-based coherency protocols, implement complex cache hierarchies, etc.

I wonder if there is any literature on how Intel has been able to scale coherency in MIC.

On the other hand, using private memory pools/explicit coherency has proven to be a software challenge, as shown by Cell.

Also for anyone interested here is a great RWT article on the DEC EV8, which was going to be an 8-issue, 4-way SMT CPU ('till Compaq killed it after acquiring DEC):
http://www.realworldtech.com/page.cfm?ArticleID=RWT122600000000

limelava1 · Jun 27, 2011

If we restrict ourselves to in-order execution, must dependencies be checked, or does that reduce the complexity? Do we additionally need to restrict each thread to issuing a single instruction? If one or both of these restrictions reduces complexity, then we can make the tradeoff -- give up out-of-order execution and possibly multiple issue but go very wide with lots of datapaths and lots of SMT to make use of them.

In-order would seem to go hand-in-hand with no speculative execution, which has the additional performance per watt benefit of never wasting cycles on branch mispredictions. So I could also envision deepening the pipeline and cranking up the GHz.

Basically, I'm asking what would happen if we:
-- gave up out-of-order execution
-- gave up speculative execution
-- gave up multiple instruction issue (per thread per clock cycle)
So that scheduling becomes really easy and simple. You simply find a thread that has all of its operands ready to go for the first instruction on its stack. We'll use multiple register files per the Bulldozer solution, but if we're talking single issue per thread, many register sets doesn't sound too onerous. Maybe we don't even need all of these restrictions. In return, we'll:
-- get a very wide (many functional units) core (e.g. 16-64)
-- get a very deep and highly clocked core
-- use extreme SMT (far more threads than functional units, e.g. 64-256)

I am guessing that per-thread performance would be really, really crappy, compensated only by the fact that the GHz is high. However, if the task is highly parallelizable, I'm guessing that this *could* outperform existing designs. The main reason is that the threads are running on the same core and not incuring as steep of penalties either communicating or moving themselves into places with open functional units.

Consumer apps might not benefit much, but a lot of 'big iron' apps are naturally very parallelizable. For instance, you have 100,000 users hitting a website -- each one can be its own thread. Or you have 10,000 employees making database queries within a company. Or you have 1,000 trading customers each sending you stock orders.

Also, to what extent did I just describe how a GPU design differs from a CPU (mod the instruction set/ general purposeness)?

jones377 · Jun 28, 2011

What advantage would a 64 wide issue core, where each thread can only issue a single instruction, have over 64 separate cores each with single issue (scalar)?

I'd also like to see the design of the L1i cache for a core that has 256 threads

Ben90 · Jun 28, 2011

My simulation program routed the wires just fine.

Idontcare · Jun 28, 2011

Ben90 said:
My simulation program routed the wires just fine.

LOL! If I'm not mistaken I think you just published a strictly internal document regarding Intel's fundamental architecture innovation underlying Itanium and EPIC

Expect to have your computers confiscated by the DHS tonight after midnight.

Tuna-Fish · Jun 28, 2011

limelava1 said:
Also, to what extent did I just describe how a GPU design differs from a CPU (mod the instruction set/ general purposeness)?

Add to that spmd simd, and you have just described modern GPU's (Fermi and AMD's GCN).

For example, in the AMD GCN the frontend bundles threads to run into "wavefronts" of 16 threads, which run in a simd fashion -- their instruction pointers are always in sync, and just the data they operate on differ. (if there are no 16 threads that would run the same program, some of the lanes just idle).

These wavefronts are fed into the compute units for scheduling/execution. A compute unit runs 40-wavefront (so 640 individual threads) SMT with 4 SIMD units (which can complete a single fp madd for all the threads of a single wavefront once per clock), and some scalar units for common operations.

Memory latency is also hidden by smt -- why build complex cache structures for hiding latency when you can just execute something else while you wait? The caches that are available are mostly designed to boost bandwidth.

The end result is truly abysmal performance for a single thread. Think before PPro. But, you're running 640 of them at a time in each compute unit, and the compute units are small enough that you can stuff 20 in a chip at 40nm...

dmens · Jun 28, 2011

Ben90 said:
My simulation program routed the wires just fine.

Cool, you found the feasibility study for the P4 trace cache.

Search

CPU design Q: superscalar vs multicore

limelava1

Junior Member

Ben90

Platinum Member

Borealis7

Platinum Member

VirtualLarry

No Lifer

jones377

Senior member

Borealis7

Platinum Member

Tuna-Fish

Golden Member

GammaLaser

Member

Bill Brasky

Diamond Member

limelava1

Junior Member

GammaLaser

Member

limelava1

Junior Member

jones377

Senior member

Ben90

Platinum Member

Idontcare

Elite Member

Tuna-Fish

Golden Member

dmens

Platinum Member

TRENDING THREADS