Trends in Multithreaded processing

Martimus · Sep 20, 2010

Scali said:
I don't, really... It seems to offer no advantages over fully shared resources like in HyperThreading (it requires more hardware resources, eg having two integer schedulers in a Bulldozer module, instead of just one scheduler for everything, and it still doesn't solve the problem of units sitting idle, so you are still wasting precious execution resources).
I think the future is in scaling HyperThreading up... adding more execution units to each single core, and allowing more than two threads to run on that core.
I think the ideal solution is to have only one mega-core, where HyperThreading handles all the logical cores.

This is exactly what I said, and it is describing the CMT option (adding execution units). According to your post, you agree completely, but decided to say that you completely disagree while explaning why you agree.

Scali · Sep 20, 2010

Martimus said:
This is exactly what I said, and it is describing the CMT option (adding execution units). According to your post, you agree completely, but decided to say that you completely disagree while explaning why you agree.

Uhh no...
I responded to "Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT)."
I disagree on the "some resources dedicated", I say "all should be shared".

I think the problem is more that your understanding of HyperThreading is incorrect... the second thread is not just using the idle portions. Search for my earlier posts on the subject, they'll direct you to Intel's optimization manuals, which will explain the resource partitioning scheme that HT uses, and why 'using idle' is incorrect.

Martimus · Sep 20, 2010

Scali said:
Uhh no...
I responded to "Now we are seeing a merging of the two philosophies, where some of the core is shared, and some resources are dedicated to each thread (CMT)."
I disagree on the "some resources dedicated", I say "all should be shared".

I think the problem is more that your understanding of HyperThreading is incorrect... the second thread is not just using the idle portions. Search for my earlier posts on the subject, they'll direct you to Intel's optimization manuals, which will explain the resource partitioning scheme that HT uses, and why 'using idle' is incorrect.

If you only have two threads per module, and one integer unit cannot be used by two threads at the same time, what is the difference between this and HyperThreading? The biggest difference is that there are two resources, so both threads can use them at the same time.

Don't worry about the marketing calling this two cores, it is just the next evolution to HyperThreading where extra execution units are added to areas where the most conflicts between the treads would likely be. The more I read about how the Bulldozer module was set up, the more I see it as a better version of HyperThreading, and this thread was where I postulated on the next improvements (where I talked about 1 core with a mix of resources much the same way you did).

Scali · Sep 20, 2010

Martimus said:
If you only have two threads per module, and one integer unit cannot be used by two threads at the same time, what is the difference between this and HyperThreading? The biggest difference is that there are two resources, so both threads can use them at the same time.

Not sure what you're talking about here...
If it is Bulldozer, the difference is simple:
Bulldozer has its integer scheduler and execution units fixed to the logical core.
That is, thread A can never use the scheduler and units of logical core B, no matter how idle they are.
HyperThreading instead would simply have one scheduler, put all execution units in one shared pool, and two threads could make use of as many execution units as they need, unlike the hard limit of 2 units per thread with Bulldozer.

Bulldozer therefore is not the next evolutionary step beyond HyperThreading, but rather a half-hearted implementation of it, where only the floating point unit is truly 'HyperThreaded', and the integer portion is still the same as a regular two physical core processor.

JFAMD · Sep 20, 2010

But HT is limited to the execution units of a single core. Having all the front end in the world won't help you when your pipelines are totally full.

You can't get more than 100% efficiency. Period.

So if one thread is at 100% efficiency, the second thread is at 0% (stalled).

People can argue all they want about the "best" implementation, but the truth remains, that no matter how you build your front end, you are always limited by the number of execution resources. If you have enough exeuction to run two threads, doesn't it make more sense to run them as two threads and remove any context switching from the front end?

Scali · Sep 20, 2010

JFAMD said:
But HT is limited to the execution units of a single core. Having all the front end in the world won't help you when your pipelines are totally full.

But a single core is not really limited to a certain number of execution units.
A single physical core of the Nehalem architecture already has more execution units than an entire Bulldozer module. And there's no reason why they should stop there. They could add more execution units and deepen the ooo-buffers to support more instructions/threads (as I already mentioned earlier, Sun/Oracle's Niagara architecture can run 8 threads per core).

JFAMD said:
You can't get more than 100% efficiency. Period.

So if one thread is at 100% efficiency, the second thread is at 0% (stalled).

The problem with CPUs in general and x86 in particular is that you rarely reach anything remotely close to 100% efficiency.
And as you add more execution units to the core, the efficiency per thread reduces further (which is one of the reasons why you can drop one of the ALUs from your Bulldozer cores without much pain compared to the previous arch, it was mostly sitting idle anyway). This opens up more room for HyperThreading.

JFAMD said:
People can argue all they want about the "best" implementation, but the truth remains, that no matter how you build your front end, you are always limited by the number of execution resources. If you have enough exeuction to run two threads, doesn't it make more sense to run them as two threads and remove any context switching from the front end?

Uhhh, what does that have to do with anything?
Both HyperThreading and Bulldozer's module show two logical cores to the OS per core/module, so neither needs any context switching.

Martimus · Sep 20, 2010

Scali said:
Not sure what you're talking about here...
If it is Bulldozer, the difference is simple:
Bulldozer has its integer scheduler and execution units fixed to the logical core.
That is, thread A can never use the scheduler and units of logical core B, no matter how idle they are.
HyperThreading instead would simply have one scheduler, put all execution units in one shared pool, and two threads could make use of as many execution units as they need, unlike the hard limit of 2 units per thread with Bulldozer.

Bulldozer therefore is not the next evolutionary step beyond HyperThreading, but rather a half-hearted implementation of it, where only the floating point unit is truly 'HyperThreaded', and the integer portion is still the same as a regular two physical core processor.

I understand that you believe this, but I am going to assume that you haven't read this article: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1

You can see from here that the majority of the resources are shared between the two cores, and only a few are dedicated per thread. If anything, Intels version of Hyperthreading is half-hearted implementation, since they do very little to eliminate conflicts between the threads. I am sure that intel will remedy this at some point, since they are not stupid, but as it currently sits it isn't very optimized.

Scali · Sep 20, 2010

Martimus said:
I understand that you believe this, but I am going to assume that you haven't read this article: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=1

You can see from here that the majority of the resources are shared between the two cores, and only a few are dedicated per thread. If anything, Intels version of Hyperthreading is half-hearted implementation, since they do very little to eliminate conflicts between the threads. I am sure that intel will remedy this at some point, since they are not stupid, but as it currently sits it isn't very optimized.

The ones that matter most, aren't shared: the schedulers and integer units (as I already said, so what's the point of your entire post?).
They're saving on die area more than they are improving execution efficiency.
And saving on die-area is not as spectacular as the 'savings' with HT (I say 'savings', because it's more like you get a 'free' extra core with only a handful of logic, so it makes much more sense to look at it from the nr of physical cores than the other way around).

zephyrprime · Sep 20, 2010

Ben90 said:
I'm not really a processor architecture expert like you guys, but what is keeping Intel/AMD from keeping core counts the same and just expanding upon the superscalar design?

I haven't done much research on it, but it seems like with the ever increasing xtor count, it was just easier to get more performance out MCM'ing two cores together. Is there something fundamentally wrong with a truly massive superscalar design in physics, or is it something where the money/effort to research and design something that massive is more than the benefits?

Nobody can figure out a way to extract more instruction level parallelism to make a wider superscalar effective.

Scali · Sep 20, 2010

zephyrprime said:
Nobody can figure out a way to extract more instruction level parallelism to make a wider superscalar effective.

Problem is that most such attempts start at the instruction set.
A big reason for the limited superscalar effectivity of x86 processors is implicit in the instructionset itself.
The majority of instructions only have 2 operands, where one of the source operands is re-used as a destination operand.
This means that you'll generally need extra dependent instructions to copy data around to other registers, when you don't want to overwrite your source operands.

More modern instructionsets use a 3-operand system, so the source operands will not be overwritten (unless you use the same source and destination operand ofcourse).

For example, if you want to do something like this:
C = A + B

The x86 will have to do this:
C = A
C = C + B (dependent on previous)

A 3-operand processor can do this directly:
C = A + B
So you have only one instruction, rather than 2 dependent ones.

You can take the removal of dependencies even further, with the VLIW/EPIC philosophy (like the Transmeta Crusoe or the Intel Itanium), and pack multiple independent instructions into a single 'very large instruction word'.
Itanium works with a system of 'code bundles', where 3 instructions are packed together. These 3 instructions are guaranteed to be independent of eachother. So there is your explicit instruction level parallelism (when less than 3 independent instructions can be found, nop instructions are inserted).
Itanium then goes one step beyond that and executes 2 bundles at the same time.

But with x86 you're pretty much stuck with the limited instructionset.
We've not really improved on ILP since the first out-of-order architectures.
Extra performance mainly comes from new instructions (SSE/AVX), improved caching and higher clock speeds.
But if you take the cache out of the equation, a Pentium Pro can do up to 3 instructions per cycle, and with well-optimized code, sustaining 2 instructions per cycle is possible.
A modern Core i7 won't really do that better.

Schmide · Sep 20, 2010

Scali said:
For example, if you want to do something like this:
C = A + B

The x86 will have to do this:
C = A
C = C + B (dependent on previous)

Wouldn't you want to it as so?

C = A
A = A + B

As long as the source register isn't written to, they both can be in flight at the same time.

Cogman · Sep 20, 2010

Schmide said:
Wouldn't you want to it as so?

C = A
A = A + B

As long as the source register isn't written to, they both can be in flight at the same time.

Then you would be smashing register A.

Scali · Sep 20, 2010

I think that pretty much demonstrates my point:
With x86 it is pretty difficult to maneuver your code in such a way that you can extract ILP.
Note also that a compiler may be able to make the switch of:
C = A
A = A + B
(it basically swaps around A and C internally, so C becomes the new A)

But a CPU can not do this.
So unless the code is already compiled as such, the CPU will just have to execute it in its dependent form.
Intel has been experimenting with op fusion since the original Core mobile processors, but it hasn't really made much of a difference (eg in Core2 processors the op fusion only works in 32-bit mode, not in 64-bit mode, but it's not exactly like this makes 64-bit slower).

With a better instructionset, there would be less room for error for the compiler. It will just output C = A + B, no matter how naive it is about optimization.
And obviously it is just a single instruction, instead of two. So where your x86 processor needs to have the compiler optimize it properly, and then the processor has to figure out that there are no dependencies... it still has to decode and execute two instructions, which in the best case can be executed in parallel.
The other CPU would just do it in one instruction, so the whole meaning of ILP is completely different. Since the instructions themselves are designed more efficiently, it doesn't need to execute as many of them in parallel in order to get the same amount of work done as the x86 CPU.

Schmide · Sep 20, 2010

Cogman said:
Then you would be smashing register A.

Some register is going to be smashed regardless. For me the only benefit I can see from 3 operand instructions is a combining of instructions. The same micro ops still have to go through the pipeline and the same operations are done. You're really only saving space in the code cache at the expense of complexity, I fear a RISC/CISC debate forthcoming.

Scali · Sep 20, 2010

Schmide said:
Some register is going to be smashed regardless. For me the only benefit I can see from 3 operand instructions is a combining of instructions. The same micro ops still have to go through the pipeline and the same operations are done.

That's not true. A 3-operand instructionset will just be implemented with 3-operand micro ops.
In fact, because of op fusion, we have some 3-operand micro ops in the x86 CPUs as well.
Problem is that you have to rely on the decoder frontend to extract this and fuse the ops in the first place. It cannot work in all cases.

Scali · Sep 20, 2010

On a slightly related note... Google published a paper on how they don't really like the trend of cores for the sake of cores, and prefer to have cores with the best possible single-threaded performance:
http://static.googleusercontent.com...esearch.google.com/en//pubs/archive/36448.pdf

This has been a pet peeve of mine for many years as well, nice to see that Google took the time to publish this.

Schmide · Sep 20, 2010

I've always thought, in terms of core disparity, it would be nice to have one type of core dedicated to OS type operations. An OS doesn't need a large floating point unit but could benefit from extra large many way associative caches and more address/int units. Maybe even some interrupt optimization such that other resources aren't stalled by this type of event or snoop operations to monitor other types of cores.

So in the multi-core world you would have this hyper-visor core, a few general purpose cores, and maybe a GPU/FPU highly parallel core or so.

ModestGamer · Sep 20, 2010

Scali said:
Problem is that most such attempts start at the instruction set.
A big reason for the limited superscalar effectivity of x86 processors is implicit in the instructionset itself.
The majority of instructions only have 2 operands, where one of the source operands is re-used as a destination operand.
This means that you'll generally need extra dependent instructions to copy data around to other registers, when you don't want to overwrite your source operands.

More modern instructionsets use a 3-operand system, so the source operands will not be overwritten (unless you use the same source and destination operand ofcourse).

For example, if you want to do something like this:
C = A + B

The x86 will have to do this:
C = A
C = C + B (dependent on previous)

A 3-operand processor can do this directly:
C = A + B
So you have only one instruction, rather than 2 dependent ones.

You can take the removal of dependencies even further, with the VLIW/EPIC philosophy (like the Transmeta Crusoe or the Intel Itanium), and pack multiple independent instructions into a single 'very large instruction word'.
Itanium works with a system of 'code bundles', where 3 instructions are packed together. These 3 instructions are guaranteed to be independent of eachother. So there is your explicit instruction level parallelism (when less than 3 independent instructions can be found, nop instructions are inserted).
Itanium then goes one step beyond that and executes 2 bundles at the same time.

But with x86 you're pretty much stuck with the limited instructionset.
We've not really improved on ILP since the first out-of-order architectures.
Extra performance mainly comes from new instructions (SSE/AVX), improved caching and higher clock speeds.
But if you take the cache out of the equation, a Pentium Pro can do up to 3 instructions per cycle, and with well-optimized code, sustaining 2 instructions per cycle is possible.
A modern Core i7 won't really do that better.

Cogman said:
Then you would be smashing register A.

Scali said:
I think that pretty much demonstrates my point:
With x86 it is pretty difficult to maneuver your code in such a way that you can extract ILP.
Note also that a compiler may be able to make the switch of:
C = A
A = A + B
(it basically swaps around A and C internally, so C becomes the new A)

But a CPU can not do this.
So unless the code is already compiled as such, the CPU will just have to execute it in its dependent form.
Intel has been experimenting with op fusion since the original Core mobile processors, but it hasn't really made much of a difference (eg in Core2 processors the op fusion only works in 32-bit mode, not in 64-bit mode, but it's not exactly like this makes 64-bit slower).

With a better instructionset, there would be less room for error for the compiler. It will just output C = A + B, no matter how naive it is about optimization.
And obviously it is just a single instruction, instead of two. So where your x86 processor needs to have the compiler optimize it properly, and then the processor has to figure out that there are no dependencies... it still has to decode and execute two instructions, which in the best case can be executed in parallel.
The other CPU would just do it in one instruction, so the whole meaning of ILP is completely different. Since the instructions themselves are designed more efficiently, it doesn't need to execute as many of them in parallel in order to get the same amount of work done as the x86 CPU.

I should assume you guys have some screwy work arounds from C and c++ looking at this. the x86 instruction set is pretty damned feature rich.

the issue is compilers and compiler librarys.

introduce yourselves to the world of assembly. writing hybraid c c++ and assembly makes alot of sense, if the compiler supports it.

Schmide · Sep 20, 2010

ModestGamer said:
I should assume you guys have some screwy work arounds from C and c++ looking at this. the x86 instruction set is pretty damned feature rich.

the issue is compilers and compiler librarys.

introduce yourselves to the world of assembly. writing hybraid c c++ and assembly makes alot of sense, if the compiler supports it.

WTF? (I know we're not supposed to cuss but WTF?)

Compilers do a darn good job at scheduling instructions such that they can be executed an parallel.

Regardless!!! Whether the above was hand tuned assembly or code generated from a compiler, the restrictions are the same, a dependent variable must be resolved before you can use it.

The implications of a 3 operand instruction is little more than a change in where the final output is written; thus in effect, combining 2 instructions into 1. Against properly scheduled/renamed instructions it will probably save very little as you're basically allowing some other operation to be scheduled early. One offset in a deep pipeline.

The real advantage is when you get into the AVX instructions that combine a multiply with an add. This will shorten the pipeline by much more and allow results to return to the pipeline without assignment to a physical register. It also has the advantage of limiting the number of entries needed in the physical register file to do the needed operations allowing more operations to be pre-scheduled.

Voo · Sep 20, 2010

ModestGamer said:
introduce yourselves to the world of assembly. writing hybraid c c++ and assembly makes alot of sense, if the compiler supports it.

Really, there's a 3op add in the x86 ISA that's not just emulated on the µop level?

But register renaming more or less neglects the impact of such a operation, doesn't it? After all in reality we don't really overwrite the same register (or at least we don't have to), so that's essentially a 3op instruction. Compilers are getting extraordinary good at most things, if there's an area where handtuned asm can still be noticeably faster are vector instructions

Cogman · Sep 20, 2010

Schmide said:
WTF? (I know we're not supposed to cuss but WTF?)

Compilers do a darn good job at scheduling instructions such that they can be executed an parallel.

Regardless!!! Whether the above was hand tuned assembly or code generated from a compiler, the restrictions are the same, a dependent variable must be resolved before you can use it.

The implications of a 3 operand instruction is little more than a change in where the final output is written; thus in effect, combining 2 instructions into 1. Against properly scheduled/renamed instructions it will probably save very little as you're basically allowing some other operation to be scheduled early. One offset in a deep pipeline.

The real advantage is when you get into the AVX instructions that combine a multiply with an add. This will shorten the pipeline by much more and allow results to return to the pipeline without assignment to a physical register. It also has the advantage of limiting the number of entries needed in the physical register file to do the needed operations allowing more operations to be pre-scheduled.

While compilers do a pretty good job at instruction ordering and choosing (Many instructions in the x86 architecture go unused because they are worthless) They do somewhat suck at effective register use (That is, at least the gcc does). In 99.999% of cases the compiler is going to churn out code that is pretty hard to improve on from an experienced assemblers viewpoint, however, there are cases where someone that knows what they are doing can churn out something better.

That isn't to say every program should have assembly in it. Just that programmers focused on performance should have SOME idea how to use assembly.

BTW, I wouldn't take what modest gamer has to say too seriously. In other threads he has pretty much come out and said that he is a troll.

Dufus · Sep 20, 2010

Scali said:
The x86 will have to do this:
C = A
C = C + B (dependent on previous)

lea eax,[ebx+ecx]

Scali · Sep 21, 2010

Dufus said:
lea eax,[ebx+ecx]

That's not the point (I'm just trying to keep things simple, on advice of some of the moderators).
The CPU cannot rewrite an add to a lea. And for anything other than an add, your little trick will not work at all.
And then we're not getting into the background of lea being executed by the AGU rather than the ALU on most x86 CPUs... etc.

Point remains: the instructionset makes it a lot more difficult than it could be.

Scali · Sep 21, 2010

ModestGamer said:
introduce yourselves to the world of assembly. writing hybraid c c++ and assembly makes alot of sense, if the compiler supports it.

Hi, I've been a valued member of http://www.asmcommunity.net for many years now.

degibson · Sep 21, 2010

Scali said:
Point remains: the instructionset makes it a lot more difficult than it could be.

I agree intellectually, but realistically the ISA doesn't matter. I'ts just a method for describing dataflow dependences. In a modern out-of-order x86 core, there's enough uOp cracking and fusion going on that the CPU is basically dynamically rewriting code on the fly anyway. And then, executing it out of order.

1. x86 is meant for a compiler. Sure, some people are better than compilers at writing x86 assembly, but it's an explicit stack-based ISA.

2. Register renaming exposes a lot of independence, right in the CPU.

Code:

A = B + C
X = A

A = D + E
Y = A

After renaming, those operations are independent, so long as A is a register. I.e., now more WAW.

3. All ISAs suck in their own ways.

4. At least there are more registers now in x64. Kinda gums up instruction segments to use them, though.

Trends in Multithreaded processing

Diamond Member

Banned

Diamond Member

Banned

Senior member

Banned

Diamond Member

Banned

Diamond Member

Banned

Diamond Member

Lifer

Banned

Diamond Member

Banned

Banned

Diamond Member

Banned

Diamond Member

Golden Member

Lifer

Senior member

Banned

Banned

Golden Member