• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Trends in Multithreaded processing

Martimus

Diamond Member
I was wondering how everyone feels chip makers will tackle multithreaded processing in the future. Intel started by letting a second thread use the idle portions of a core (hyper threading), AMD followed by adding a second core (CMP). Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT).
I can see this third method really expand in the future, to the point where there are odd numbers of each type of resource in a module, depending on the percentage of the type of work you expect the module to do. (Say 3 FPU's, 8 ALU's, 4 AGU's, etc.) Where a single "module" would run a multitude of threads at once.

Seeing how early we are in this whole mess, I am somewhat excited to think of what kinds of twists the designs can have in the future. I know how I would take it, but I wonder what others think will be the path forward for the different companies?

I wouldn't limit each module (core if you are talking Intel, but that is just semantics at this point) to just two threads at once. If one portion of the processor is used 1/4 as often as other portions of the processor (and is relatively large or has some other non-trivial penalty for being there unused), I would make four (4) threads share that one resource, not two (2). Of course this would be very application specific, so the optimizations will be interesting for these general purpose processors (I wonder where they will optimize most, and what kinds of applications will see detriments).
 
This is pretty much how it's done now. Hardware features that are sharable are shared, ones that are not are augmented.

For example, with Hyperthreading Intel beefed up some portions of the chip that running a second thread on would cause resource contention.
 
I wouldn't limit each module (core if you are talking Intel, but that is just semantics at this point) to just two threads at once. If one portion of the processor is used 1/4 as often as other portions of the processor (and is relatively large or has some other non-trivial penalty for being there unused), I would make four (4) threads share that one resource, not two (2). Of course this would be very application specific, so the optimizations will be interesting for these general purpose processors (I wonder where they will optimize most, and what kinds of applications will see detriments).

If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.
 
If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.

I am talking about adjusting how the processer physically handles the operations. Lets say that your expected application load only uses 1 floating point instruction for every 5 integer instruction. Why not have one FPU for every 5 integer units? And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

The BD module system is a step in this direction, but it is limited to a maximum of two (2) threads for each shared component. I was thinking of what would happen in the future as this limitation is removed.
 
And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

Ah, I see now. I agree it is heading that way. Not sure what that means for turbo-clocking and power-gating though...having discrete clock domains and logic domains has some advantages.

Definitely going to be trade-offs either way. But I see where you are going with this. A chip in 2016 might well be a single "core" capable of simultaneously processing 48 threads depending on the specific instruction mix involved across all the threads (determines resource contention).
 
I am talking about adjusting how the processer physically handles the operations. Lets say that your expected application load only uses 1 floating point instruction for every 5 integer instruction. Why not have one FPU for every 5 integer units? And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

The BD module system is a step in this direction, but it is limited to a maximum of two (2) threads for each shared component. I was thinking of what would happen in the future as this limitation is removed.
It can happen in the future.
Imagine the following. OpenCL matures. C/C++ math libraries will enable compilers to generate code for APU/GPU. Let's say vector and matrix operations and manipulation can be offloaded. At that point FPU could become 'too fat'. The ratio could change as you suggested.
 
I'm not really a processor architecture expert like you guys, but what is keeping Intel/AMD from keeping core counts the same and just expanding upon the superscalar design?

I haven't done much research on it, but it seems like with the ever increasing xtor count, it was just easier to get more performance out MCM'ing two cores together. Is there something fundamentally wrong with a truly massive superscalar design in physics, or is it something where the money/effort to research and design something that massive is more than the benefits?

Bulldozer's modules seem to be treading on the blurry line of being two/one cores. What is really keeping them from creating a massive integer scheduler that feeds both "cores" allowing all of the module to handle a single thread.

I'm sure what Ive suggested is probably unreasonable, but like I said I don't know much about the design of processors, and am legitimately curious as to why extremely superscalar designs never took off.
 
am legitimately curious as to why extremely superscalar designs never took off.

Basically physics and economics got in the way.

http://en.wikipedia.org/wiki/Pollack's_Rule

05.jpg


You can keep doubling the complexity but your rate of performance improvements dies off as the square root of your efforts while power-consumption and production costs increase linearly with die-area.

If your customers are willing to accept the alternative, multi-core/multi-thread processing, then you can build higher performance chips without spending a bundle on development and production costs associated with non-silicon based semiconductor technologies.
 
Bulldozer's modules seem to be treading on the blurry line of being two/one cores. What is really keeping them from creating a massive integer scheduler that feeds both "cores" allowing all of the module to handle a single thread.

I am guessing that you spend so much time taking apart and putting together the thread that you lose the benefit of having multiple execution units.
 
^ that would be my guess too...be it a net loss in performance or a very power-expensive way to try and get a meager performance increase.
 
Yeah, the next time your wife is trying to make dinner you can help her do it quicker if you make the potatoes whole she makes the meatloaf. That is two cores.

She can't make the meatloaf faster if you both have your hands in the bowl at the same time. It will take longer.

The goal is not to figure out how to make one thread run faster, its about trying to make sure that everything is out of the way of that thread so that it can run faster.
 
It can happen in the future.
Imagine the following. OpenCL matures. C/C++ math libraries will enable compilers to generate code for APU/GPU. Let's say vector and matrix operations and manipulation can be offloaded. At that point FPU could become 'too fat'. The ratio could change as you suggested.

So long as we have discrete graphics cards, this is never going to happen (It MAY happen with the APU). Vector/Matrix operations are relatively quick on a CPU. However, the latency for transferring data from CPU to GPU is pretty dang big.

The time it becomes efficient is when you have LOTS of vector operations (think of physics calculations for 1000+ particles) or LOTS of matrix operations. That, or fairly big vectors/matrices. (though, not so much).

The whole process of making specific GPU modules for the code, piping those modules to the GPU compiler, and loading those modules onto the GPU only adds to the issues. It wouldn't be a simple MatrixA * MatrixB and Viola, it is done on the GPU. There would at least have to be a setup phase for the operation.

Compilers today have a difficult time effectively using things like SSE instructions, I can't see them doing any better of a job with GPU offloading which by its nature would be far more complex than a simple "When should I use an XMM register".
 
I am guessing that you spend so much time taking apart and putting together the thread that you lose the benefit of having multiple execution units.

I agree as well. It could result in some pretty weird performance problems as well. IG ThreadA runs at full speed doing A * B but ThreadB struggles doing A + B because all of the integer units are being used by ThreadA. From the OS level, there would be no way to really control that or schedule threads so something like that doesn't happen.
 
I agree as well. It could result in some pretty weird performance problems as well. IG ThreadA runs at full speed doing A * B but ThreadB struggles doing A + B because all of the integer units are being used by ThreadA. From the OS level, there would be no way to really control that or schedule threads so something like that doesn't happen.

I don't agree at all that you can not code an OS to take the hardware limitations into account when assigning resources to threads.
 
I don't agree at all that you can not code an OS to take the hardware limitations into account when assigning resources to threads.

The OS would have to READ THE CODE and in essence execute it before it is executed...I know some people think this is easy, but it isn't at all. The OS has no way to tell if a thread has been waiting forever to perform A + B (and thus force one of the other threads to wait).

The OS only has heuristic methods to determine which thread should run when. I does not, however, have the ability to determine which thread will consume which resources. Again, that is akin to solving the halting problem.
 
The OS would have to READ THE CODE and in essence execute it before it is executed...I know some people think this is easy, but it isn't at all. The OS has no way to tell if a thread has been waiting forever to perform A + B (and thus force one of the other threads to wait).

The OS only has heuristic methods to determine which thread should run when. I does not, however, have the ability to determine which thread will consume which resources. Again, that is akin to solving the halting problem.

Maybe it is because I was writing it for a microcontroller (HC12), but the last OS I wrote did read the code and executed it based on the limitations of the platform.

Even so, it shouldn't be much different than the way it is currently set up. You would just have more shared resources than are currently available, and they would be shared by more threads. (Intel already shares nearly 100% of the resources between 2 threads, I just expect this to grow to more than 2 threads over time.)
 
To the OP - I think I have also suggested a similar thing, that the design of future CPUs will imitate that of current GPUs, where they have a very wide array of execution resources, and a pool of threads, that runs on top of those execution resources.

I think one of the GPU makers once called it "Ultra Threaded Arch".
 
Its quite hard to keep feed a wide array of execution resources with compilers alone, having a similar RISC based coprocessor to sort and distribute the threads ala AMD's Ultra Dispatch Processor (Or Command Queue Processor) should be a nice idea to increase parallelism without too much software tweaks, and giving the Command Queue Processor stuff like pointer register access so it can help to reduce cache latency issues or coherency and a faster data fetch for the threads. (I think I'm mumbling a bit)

I think that explains why AMD didn't fail with a VLIW approach like nVidia did with their FX series (I think there's other severe flaws in the architecture per se), I don't know much about architectures, but I think that there's a posibility to have more gains in that way than having a general purpose CPU (For example Fusion), to do the sorting to the Wide Execution Array reducing the execution resources available for other tasks.
 
Last edited:
If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.

The team at Be Computing basically did this 15 years ago. Albiet with a sligthly different twist.

they created a OS that easily spawn threads, gave the kernel the thread management across resources, initiated a watchdog timer to retask pipeline with stalled threads and then slapped it all together with less then cutting edge hardware and called it the BeBox . which ran the BeOS.

Which btw is a really light operating system that focuses on user responsiveness.

There is a opensource group attempting to rebuild the Beos and they call it haiku. It is based on the Beos code but it is completely rewritten. As to avoid copyright issues.

If you want to see a threading pmplementation like your suggesting. Check it out. the latest nightly are substantially improved over the R1A1 and R1A2 releases and support most of the generic PC hardware on the market.

www.haiku-os.org
www.haiku-files.org

Check it out the source is openly avilable to.
 
Intel has experimented with quad-hyperthreading on some non-consumer CPUs.
I do wonder if that will ever reach the consumer level though and how it'll work... (i.e. pricing, effectiveness)
 
Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT).
I can see this third method really expand in the future

I don't, really... It seems to offer no advantages over fully shared resources like in HyperThreading (it requires more hardware resources, eg having two integer schedulers in a Bulldozer module, instead of just one scheduler for everything, and it still doesn't solve the problem of units sitting idle, so you are still wasting precious execution resources).
I think the future is in scaling HyperThreading up... adding more execution units to each single core, and allowing more than two threads to run on that core.
I think the ideal solution is to have only one mega-core, where HyperThreading handles all the logical cores.
 
Last edited:
Back
Top