Question/ discussion topic regarding CPU design choices...
Why not go more superscalar instead of more multicore? Why not have dozens of functional units (a.k.a. datapaths, pipelines, ALUs), instead of adding cores?
Okay, yes, I have heard that the complexity of dependency checking increases exponentially in the number of functional units. So it becomes prohibitive to have too many. Multiple cores avoids this by forcing the dependency problem back on the software and programmer to handle concurrency, threading, IPC, and the like.
But what if... we limited the number of instructions issued per clock per thread, to something much less than the number of functional units? This should reduce the complexity to a manageable level. Wouldn't this solve the problem?
You might say, "ah yes, but it introduces a new problem: how to make use of those now-empty functional units!" Well, how about this:
i) SIMD. Let a single instruction operate on multiple data via multiple functional units. Ideally we would dynamically group functional units together to form SIMD abilities instruction by instruction.
ii) SMT. Let many threads share all the functional units. POWER and especially Oracle/ Sun's T3 have shown it is entirely workable to go beyond Intel's two threads per core.
To make this more concrete, if you look at an Intel E7, it's got up to 10 cores and, let's say, 4 ALUs each (plus probably some FPs etc). So, 40 ALUs a chip. So I'm thinking, let's put 40 ALUs in a single core, and limit to 4 or 8 instructions per clock per thread, but have 8 or 16 threads, and let each single instruction use 4 or 8 or all 40 ALUs on vector data.
Why bother? In a nutshell, the parallelism that must be extracted to use multiple cores is very coarse or granular compared to what the CPU can extract at run time, and is consequently less efficient. CPUs are also losing to GPUs. For example:
-- multiple cores do not share all caches, and cache coherency traffic has become a problem. As we add cores, we'll have more trouble getting data from one end of the chip to the other. In contrast, functional units usually share cache automatically.
-- the number of instructions 'ready to go' on a given thread is highly variable. Sometimes we are idling, waiting for data from main memory; other times we are compute limited. When these two situations arise on different cores at the same time, we can't do anything. Even if the busy thread could somehow be 'split,' the time to move it to the idle core may be too high.
-- synchronization mechanisms available at the programmer levels can be very slow (e.g. thousands of clock cycles to wake via mutex)
-- GPUs are already very wide. They manage it by imposing restrictions to simplify the scheduling -- operate on vectors of data, strictly in order execution, etc.
In short, even if we severely limit the scheduling choices for making use of dozens of functional units in a CPU in order to make it feasible, it is still probably a lot more efficient than the coarse software-level scheduling we get now under multicore.
Opinions?
Why not go more superscalar instead of more multicore? Why not have dozens of functional units (a.k.a. datapaths, pipelines, ALUs), instead of adding cores?
Okay, yes, I have heard that the complexity of dependency checking increases exponentially in the number of functional units. So it becomes prohibitive to have too many. Multiple cores avoids this by forcing the dependency problem back on the software and programmer to handle concurrency, threading, IPC, and the like.
But what if... we limited the number of instructions issued per clock per thread, to something much less than the number of functional units? This should reduce the complexity to a manageable level. Wouldn't this solve the problem?
You might say, "ah yes, but it introduces a new problem: how to make use of those now-empty functional units!" Well, how about this:
i) SIMD. Let a single instruction operate on multiple data via multiple functional units. Ideally we would dynamically group functional units together to form SIMD abilities instruction by instruction.
ii) SMT. Let many threads share all the functional units. POWER and especially Oracle/ Sun's T3 have shown it is entirely workable to go beyond Intel's two threads per core.
To make this more concrete, if you look at an Intel E7, it's got up to 10 cores and, let's say, 4 ALUs each (plus probably some FPs etc). So, 40 ALUs a chip. So I'm thinking, let's put 40 ALUs in a single core, and limit to 4 or 8 instructions per clock per thread, but have 8 or 16 threads, and let each single instruction use 4 or 8 or all 40 ALUs on vector data.
Why bother? In a nutshell, the parallelism that must be extracted to use multiple cores is very coarse or granular compared to what the CPU can extract at run time, and is consequently less efficient. CPUs are also losing to GPUs. For example:
-- multiple cores do not share all caches, and cache coherency traffic has become a problem. As we add cores, we'll have more trouble getting data from one end of the chip to the other. In contrast, functional units usually share cache automatically.
-- the number of instructions 'ready to go' on a given thread is highly variable. Sometimes we are idling, waiting for data from main memory; other times we are compute limited. When these two situations arise on different cores at the same time, we can't do anything. Even if the busy thread could somehow be 'split,' the time to move it to the idle core may be too high.
-- synchronization mechanisms available at the programmer levels can be very slow (e.g. thousands of clock cycles to wake via mutex)
-- GPUs are already very wide. They manage it by imposing restrictions to simplify the scheduling -- operate on vectors of data, strictly in order execution, etc.
In short, even if we severely limit the scheduling choices for making use of dozens of functional units in a CPU in order to make it feasible, it is still probably a lot more efficient than the coarse software-level scheduling we get now under multicore.
Opinions?