Trends in Multithreaded processing

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
I was wondering how everyone feels chip makers will tackle multithreaded processing in the future. Intel started by letting a second thread use the idle portions of a core (hyper threading), AMD followed by adding a second core (CMP). Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT).
I can see this third method really expand in the future, to the point where there are odd numbers of each type of resource in a module, depending on the percentage of the type of work you expect the module to do. (Say 3 FPU's, 8 ALU's, 4 AGU's, etc.) Where a single "module" would run a multitude of threads at once.

Seeing how early we are in this whole mess, I am somewhat excited to think of what kinds of twists the designs can have in the future. I know how I would take it, but I wonder what others think will be the path forward for the different companies?

I wouldn't limit each module (core if you are talking Intel, but that is just semantics at this point) to just two threads at once. If one portion of the processor is used 1/4 as often as other portions of the processor (and is relatively large or has some other non-trivial penalty for being there unused), I would make four (4) threads share that one resource, not two (2). Of course this would be very application specific, so the optimizations will be interesting for these general purpose processors (I wonder where they will optimize most, and what kinds of applications will see detriments).
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
This is pretty much how it's done now. Hardware features that are sharable are shared, ones that are not are augmented.

For example, with Hyperthreading Intel beefed up some portions of the chip that running a second thread on would cause resource contention.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I wouldn't limit each module (core if you are talking Intel, but that is just semantics at this point) to just two threads at once. If one portion of the processor is used 1/4 as often as other portions of the processor (and is relatively large or has some other non-trivial penalty for being there unused), I would make four (4) threads share that one resource, not two (2). Of course this would be very application specific, so the optimizations will be interesting for these general purpose processors (I wonder where they will optimize most, and what kinds of applications will see detriments).

If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.

I am talking about adjusting how the processer physically handles the operations. Lets say that your expected application load only uses 1 floating point instruction for every 5 integer instruction. Why not have one FPU for every 5 integer units? And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

The BD module system is a step in this direction, but it is limited to a maximum of two (2) threads for each shared component. I was thinking of what would happen in the future as this limitation is removed.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

Ah, I see now. I agree it is heading that way. Not sure what that means for turbo-clocking and power-gating though...having discrete clock domains and logic domains has some advantages.

Definitely going to be trade-offs either way. But I see where you are going with this. A chip in 2016 might well be a single "core" capable of simultaneously processing 48 threads depending on the specific instruction mix involved across all the threads (determines resource contention).
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
I am talking about adjusting how the processer physically handles the operations. Lets say that your expected application load only uses 1 floating point instruction for every 5 integer instruction. Why not have one FPU for every 5 integer units? And I mean this accross the board, where each function is shared throughout the processor, so that there aren't really seperate full cores, but rather an amalgam of components that can be used by whatever thread will need them.

The BD module system is a step in this direction, but it is limited to a maximum of two (2) threads for each shared component. I was thinking of what would happen in the future as this limitation is removed.
It can happen in the future.
Imagine the following. OpenCL matures. C/C++ math libraries will enable compilers to generate code for APU/GPU. Let's say vector and matrix operations and manipulation can be offloaded. At that point FPU could become 'too fat'. The ratio could change as you suggested.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I'm not really a processor architecture expert like you guys, but what is keeping Intel/AMD from keeping core counts the same and just expanding upon the superscalar design?

I haven't done much research on it, but it seems like with the ever increasing xtor count, it was just easier to get more performance out MCM'ing two cores together. Is there something fundamentally wrong with a truly massive superscalar design in physics, or is it something where the money/effort to research and design something that massive is more than the benefits?

Bulldozer's modules seem to be treading on the blurry line of being two/one cores. What is really keeping them from creating a massive integer scheduler that feeds both "cores" allowing all of the module to handle a single thread.

I'm sure what Ive suggested is probably unreasonable, but like I said I don't know much about the design of processors, and am legitimately curious as to why extremely superscalar designs never took off.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
am legitimately curious as to why extremely superscalar designs never took off.

Basically physics and economics got in the way.

http://en.wikipedia.org/wiki/Pollack's_Rule

05.jpg


You can keep doubling the complexity but your rate of performance improvements dies off as the square root of your efforts while power-consumption and production costs increase linearly with die-area.

If your customers are willing to accept the alternative, multi-core/multi-thread processing, then you can build higher performance chips without spending a bundle on development and production costs associated with non-silicon based semiconductor technologies.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Bulldozer's modules seem to be treading on the blurry line of being two/one cores. What is really keeping them from creating a massive integer scheduler that feeds both "cores" allowing all of the module to handle a single thread.

I am guessing that you spend so much time taking apart and putting together the thread that you lose the benefit of having multiple execution units.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
^ that would be my guess too...be it a net loss in performance or a very power-expensive way to try and get a meager performance increase.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Yeah, the next time your wife is trying to make dinner you can help her do it quicker if you make the potatoes whole she makes the meatloaf. That is two cores.

She can't make the meatloaf faster if you both have your hands in the bowl at the same time. It will take longer.

The goal is not to figure out how to make one thread run faster, its about trying to make sure that everything is out of the way of that thread so that it can run faster.
 

Cogman

Lifer
Sep 19, 2000
10,286
147
106
It can happen in the future.
Imagine the following. OpenCL matures. C/C++ math libraries will enable compilers to generate code for APU/GPU. Let's say vector and matrix operations and manipulation can be offloaded. At that point FPU could become 'too fat'. The ratio could change as you suggested.

So long as we have discrete graphics cards, this is never going to happen (It MAY happen with the APU). Vector/Matrix operations are relatively quick on a CPU. However, the latency for transferring data from CPU to GPU is pretty dang big.

The time it becomes efficient is when you have LOTS of vector operations (think of physics calculations for 1000+ particles) or LOTS of matrix operations. That, or fairly big vectors/matrices. (though, not so much).

The whole process of making specific GPU modules for the code, piping those modules to the GPU compiler, and loading those modules onto the GPU only adds to the issues. It wouldn't be a simple MatrixA * MatrixB and Viola, it is done on the GPU. There would at least have to be a setup phase for the operation.

Compilers today have a difficult time effectively using things like SSE instructions, I can't see them doing any better of a job with GPU offloading which by its nature would be far more complex than a simple "When should I use an XMM register".
 

Cogman

Lifer
Sep 19, 2000
10,286
147
106
I am guessing that you spend so much time taking apart and putting together the thread that you lose the benefit of having multiple execution units.

I agree as well. It could result in some pretty weird performance problems as well. IG ThreadA runs at full speed doing A * B but ThreadB struggles doing A + B because all of the integer units are being used by ThreadA. From the OS level, there would be no way to really control that or schedule threads so something like that doesn't happen.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
I agree as well. It could result in some pretty weird performance problems as well. IG ThreadA runs at full speed doing A * B but ThreadB struggles doing A + B because all of the integer units are being used by ThreadA. From the OS level, there would be no way to really control that or schedule threads so something like that doesn't happen.

I don't agree at all that you can not code an OS to take the hardware limitations into account when assigning resources to threads.
 

Cogman

Lifer
Sep 19, 2000
10,286
147
106
I don't agree at all that you can not code an OS to take the hardware limitations into account when assigning resources to threads.

The OS would have to READ THE CODE and in essence execute it before it is executed...I know some people think this is easy, but it isn't at all. The OS has no way to tell if a thread has been waiting forever to perform A + B (and thus force one of the other threads to wait).

The OS only has heuristic methods to determine which thread should run when. I does not, however, have the ability to determine which thread will consume which resources. Again, that is akin to solving the halting problem.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
The last thing Windows needs is yet another excuse to make its Chaitin's constant asymptotically approach unity...
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
The OS would have to READ THE CODE and in essence execute it before it is executed...I know some people think this is easy, but it isn't at all. The OS has no way to tell if a thread has been waiting forever to perform A + B (and thus force one of the other threads to wait).

The OS only has heuristic methods to determine which thread should run when. I does not, however, have the ability to determine which thread will consume which resources. Again, that is akin to solving the halting problem.

Maybe it is because I was writing it for a microcontroller (HC12), but the last OS I wrote did read the code and executed it based on the limitations of the platform.

Even so, it shouldn't be much different than the way it is currently set up. You would just have more shared resources than are currently available, and they would be shared by more threads. (Intel already shares nearly 100% of the resources between 2 threads, I just expect this to grow to more than 2 threads over time.)
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
To the OP - I think I have also suggested a similar thing, that the design of future CPUs will imitate that of current GPUs, where they have a very wide array of execution resources, and a pool of threads, that runs on top of those execution resources.

I think one of the GPU makers once called it "Ultra Threaded Arch".
 
Sep 9, 2010
86
0
0
Its quite hard to keep feed a wide array of execution resources with compilers alone, having a similar RISC based coprocessor to sort and distribute the threads ala AMD's Ultra Dispatch Processor (Or Command Queue Processor) should be a nice idea to increase parallelism without too much software tweaks, and giving the Command Queue Processor stuff like pointer register access so it can help to reduce cache latency issues or coherency and a faster data fetch for the threads. (I think I'm mumbling a bit)

I think that explains why AMD didn't fail with a VLIW approach like nVidia did with their FX series (I think there's other severe flaws in the architecture per se), I don't know much about architectures, but I think that there's a posibility to have more gains in that way than having a general purpose CPU (For example Fusion), to do the sorting to the Wide Execution Array reducing the execution resources available for other tasks.
 
Last edited:

ModestGamer

Banned
Jun 30, 2010
1,140
0
0
If you've spent much time compiling linux kernels you are prolly familiar with the old tip of setting the -j# parameter to a # that exceeds the physical core count of the system because spawning more threads than cores will (up to a point) actually result in faster compile times because over-subscribing threads will take advantage of hardware stalls and pipeline inefficiencies.

I think you are basically saying do this but for apps in general.

For example I don't have the option of forcing TMPGEnc to spawn more than 4 encoding threads on my quad-core Q6600, but it would be nice if I could set it to 5 or some such and have an extra thread languishing in the background ready to soak up an extra idle cpu cycle here and there as otherwise fully active threads hit a stall for any reason.

The team at Be Computing basically did this 15 years ago. Albiet with a sligthly different twist.

they created a OS that easily spawn threads, gave the kernel the thread management across resources, initiated a watchdog timer to retask pipeline with stalled threads and then slapped it all together with less then cutting edge hardware and called it the BeBox . which ran the BeOS.

Which btw is a really light operating system that focuses on user responsiveness.

There is a opensource group attempting to rebuild the Beos and they call it haiku. It is based on the Beos code but it is completely rewritten. As to avoid copyright issues.

If you want to see a threading pmplementation like your suggesting. Check it out. the latest nightly are substantially improved over the R1A1 and R1A2 releases and support most of the generic PC hardware on the market.

www.haiku-os.org
www.haiku-files.org

Check it out the source is openly avilable to.
 

xd_1771

Member
Sep 19, 2010
72
0
0
www.youtube.com
Intel has experimented with quad-hyperthreading on some non-consumer CPUs.
I do wonder if that will ever reach the consumer level though and how it'll work... (i.e. pricing, effectiveness)
 

Scali

Banned
Dec 3, 2004
2,495
1
0
Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT).
I can see this third method really expand in the future

I don't, really... It seems to offer no advantages over fully shared resources like in HyperThreading (it requires more hardware resources, eg having two integer schedulers in a Bulldozer module, instead of just one scheduler for everything, and it still doesn't solve the problem of units sitting idle, so you are still wasting precious execution resources).
I think the future is in scaling HyperThreading up... adding more execution units to each single core, and allowing more than two threads to run on that core.
I think the ideal solution is to have only one mega-core, where HyperThreading handles all the logical cores.
 
Last edited: