AMD working on reverse Hyper-Threading technology

Gamingphreek · Apr 14, 2006

I suggest you re-read the fundamentals of SMT/HT if you really think that is what HT/SMT is.

I suggest you re-read because what i said is HT Technology in a nutshell.

The instruction set of a chip has nothing to do with with fundamental computer science architecture.

Yes it does! Itanium is IA64 instruction set. It exectues instructions much differently than an x86 core.

And no it is not because i think AMD is "teh roxor". I completely understand how HT carried out, and i know that it is how Intel compensates for a very long pipeline.

-Kevin

Viditor · Apr 14, 2006

Originally posted by: Furen
Errr... Isn't this what Intel's Mitosis will do, too?

Good point Furen...in reading Intel's description of Speculative Threading (Mitosis), it occurs to me that AMD might have an advantage here due to their use of MOESI cache protocol instead of MESI.
This is not my field (so any comments are more than welcome), but it seems to me that performing speculative threading within L2 (or even L3?) would be easier and quicker if you add an Owned state to the cache coherency protocol...
Am I off on this?

RallyMaster · Apr 14, 2006

Sounds good. I'd like to see the result of it.

zephyrprime · Apr 14, 2006

Originally posted by: Gamingphreek

I suggest you re-read the fundamentals of SMT/HT if you really think that is what HT/SMT is.

Click to expand...

I suggest you re-read because what i said is HT Technology in a nutshell.

The instruction set of a chip has nothing to do with with fundamental computer science architecture.

Click to expand...

Yes it does! Itanium is IA64 instruction set. It exectues instructions much differently than an x86 core.

And no it is not because i think AMD is "teh roxor". I completely understand how HT carried out, and i know that it is how Intel compensates for a very long pipeline.

-Kevin

Sorry dude, but you're completely wrong and Accord99 is completely right. It's clear that you have a misconception of both pipelining and simultaneous multithreading.

I recommend these articles:
http://arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html
http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars

Viditor · Apr 15, 2006

Originally posted by: zephyrprime
Sorry dude, but you're completely wrong and Accord99 is completely right. It's clear that you have a misconception of both pipelining and simultaneous multithreading.

I recommend these articles:
http://arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html
http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars

They are both partially correct...
With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.
The reason Conroe isn't HT enabled yet is that it is far more latency sensitive than Netburst, however going forward as clockspeeds increase, the scales should tip the other direction so that the net gain from SMT is once again greater than the latency penalties.

dmens · Apr 15, 2006

Originally posted by: Viditor
Good point Furen...in reading Intel's description of Speculative Threading (Mitosis), it occurs to me that AMD might have an advantage here due to their use of MOESI cache protocol instead of MESI.
This is not my field (so any comments are more than welcome), but it seems to me that performing speculative threading within L2 (or even L3?) would be easier and quicker if you add an Owned state to the cache coherency protocol...
Am I off on this?

AFAIK the described form of speculative threading should be transparent to the bus so MESI/MOESI or whatever coherency protocol used makes no difference.

Also from what I've seen this isn't really "reverse-SMT" (imo), rather it is yet another type of (aggressive) speculation with a branch and commit. A chip can do SMT and speculative threading at the same time probably...

dmens · Apr 15, 2006

Originally posted by: Viditor
They are both partially correct...
With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.
The reason Conroe isn't HT enabled yet is that it is far more latency sensitive than Netburst, however going forward as clockspeeds increase, the scales should tip the other direction so that the net gain from SMT is once again greater than the latency penalties.

P4 is far more latency sensitive than merom due to replay, but that is an architectural effect anyways... in general, latency needs to be talked about from a workload pov, as opposed to a uarch. All this talk about SMT being affected by long/short pipes is a lot of junk, there are chips out there and in the works that are far shorter than P4 and incorporate dual or multi SMT.

The real metric that needs to be looked at is if two threads on one core can come out faster on two switching threads on one core, because partitioned resources on the one-core with SMT has both threads harmed from smaller buffers.

Viditor · Apr 15, 2006

Originally posted by: dmens

Originally posted by: Viditor
Good point Furen...in reading Intel's description of Speculative Threading (Mitosis), it occurs to me that AMD might have an advantage here due to their use of MOESI cache protocol instead of MESI.
This is not my field (so any comments are more than welcome), but it seems to me that performing speculative threading within L2 (or even L3?) would be easier and quicker if you add an Owned state to the cache coherency protocol...
Am I off on this?

Click to expand...

AFAIK the described form of speculative threading should be transparent to the bus so MESI/MOESI or whatever coherency protocol used makes no difference.

Also from what I've seen this isn't really "reverse-SMT" (imo), rather it is yet another type of (aggressive) speculation with a branch and commit. A chip can do SMT and speculative threading at the same time probably...

Thanks for the reply dmens!
My (uneducated) thought on this was that if you could utilize a very large L2 or L3 for the speculative threads instead of main memory, then the increased granularity of MOESI would allow for quicker flagging of these threads and thereby decrease latency tremendously...what do you think?

Viditor · Apr 15, 2006

Originally posted by: dmens

Originally posted by: Viditor
They are both partially correct...
With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.
The reason Conroe isn't HT enabled yet is that it is far more latency sensitive than Netburst, however going forward as clockspeeds increase, the scales should tip the other direction so that the net gain from SMT is once again greater than the latency penalties.

Click to expand...

P4 is far more latency sensitive than merom due to replay, but that is an architectural effect anyways... in general, latency needs to be talked about from a workload pov, as opposed to a uarch. All this talk about SMT being affected by long/short pipes is a lot of junk, there are chips out there and in the works that are far shorter than P4 and incorporate dual or multi SMT.

The real metric that needs to be looked at is if two threads on one core can come out faster on two switching threads on one core, because partitioned resources on the one-core with SMT has both threads harmed from smaller buffers.

I assume you are talking about IBM here...there is a big difference in that the Power5 chip lets the OS (AIX) control the whole stack and implements SMT in a different and much simplified manner (because they can!).
Ars Technica article on Power5 SMT
Therefore, the latency penalty for SMT on Power5 is far less than it is for HT...it's less of a uA design issue than it is a platform (including OS) issue.

dmens · Apr 15, 2006

Not really that different... the IBM implementation allows the OS to tune the frontend decode rate, which becomes useful if you have more than two threads in play. As opposed to two-thread SMT, for which thread priority is a binary decision. Livelock detection and thread priority logic already exists on the P4 and is totally transparent to the user.

Instructions-in-flight for P4 in MT have the exact same instruction latency as ST mode, from a purely processor POV. A clean SMT implementation should not incur any additional instruction latency. If the OS can hint the frontend to do a better job, that is good.

dmens · Apr 15, 2006

Originally posted by: Viditor
Thanks for the reply dmens!
My (uneducated) thought on this was that if you could utilize a very large L2 or L3 for the speculative threads instead of main memory, then the increased granularity of MOESI would allow for quicker flagging of these threads and thereby decrease latency tremendously...what do you think?

My understanding is that all the speculative threads are in flight in the backend portion of the core, and the memory system should not be aware of all the guessing. It'd be very difficult to recover state if incorrect speculation managed to leak to a cache, afaik.

Memory latency is mentioned in the paper because the mitosis idea masks latency by doing work first then confirming later, which is done now already but more conservatively. A mitosis machine wouldn't care much about slow memory, it'd happily process down long blocks as long as the machine has enough slice buffers, and if the dependency is not real, you get good speedup. Otherwise, as the paper said, it is no slower than what we have now, being a conservative compiler, and a hell of a lot of wasted power.

BitByBit · Apr 15, 2006

Originally posted by: Gamingphreek

In a long pipeline one packet is sent through. It take so long to go through that much of the pipeline is not working and is idle. Additionally, if there were a cache miss then it would have to do the same thing over again. HT staggers packets and sends another behind the first packet. Therefore the pipeline is working as close to theoretical as possible all the time.

In a short pipeline the pipeline is in use most of the time. There would be no point in trying to jam another packet in there; it would only delay the other packets. Additionally, a cache miss or a branch misprediction is MUCH MUCH less costly on a short pipeline.

If you understand HT this is common knowledge...

-Kevin

I seem to remember having this conversation with you a while back, Gamingphreek, but obviously met little success.

Intel's primary reason for introducing SMT on the P4 was not to cover up some design flaw, but to encourage software developers to start writing multithreaded code that would be able to take advantage of future dual and multicore processors.
If HT was designed to cover up Netburst's ineffeciencies, how then, would it help the P4 when executing single-threaded code?

Now I'm not willing to yet again discuss the difficulties superscalar processors face in executing instructions in parallel. Suffice it to say that a Hyperthreading-enabled P4 can extract greater ILP from multithreaded code, and hence improve its IPC when executing that code.
Pipeline depth has absolutely nothing to do with the suitability of a core for SMT.
In fact, the wider the core, the more suitable it is.

The final advantage SMT offers is of course, multitasking.
By appearing as two logical processors, the OS can simultaneously send threads from two concurrent programs, making context-switching much more fluid. Without SMT, a processor has to continually fetch a different thread from memory each timeslice. When latency is considered, the advantage SMT offers in multitasking is obvious.

Viditor · Apr 15, 2006

Originally posted by: dmens

My understanding is that all the speculative threads are in flight in the backend portion of the core, and the memory system should not be aware of all the guessing. It'd be very difficult to recover state if incorrect speculation managed to leak to a cache, afaik.

Memory latency is mentioned in the paper because the mitosis idea masks latency by doing work first then confirming later, which is done now already but more conservatively. A mitosis machine wouldn't care much about slow memory, it'd happily process down long blocks as long as the machine has enough slice buffers, and if the dependency is not real, you get good speedup. Otherwise, as the paper said, it is no slower than what we have now, being a conservative compiler, and a hell of a lot of wasted power.

I guess I still don't understand then...I thought the idea was to utilize multiple cores for processing a single thread. I envisioned different cores processing the speculative threads and flagging them as such in the cache. That's why MOESI made sense to me (the "Owned" portion of the protocol seemed well suited to this task). Guess I'm going to have to study more...thanks for the reply!

Madwand1 · Apr 15, 2006

The original link was in Dutch, which is not supported by Google. I found the following article in French, and the Google translation is hilarious -- if any of you complain about offshore tech support ... just be grateful they're not Google bots.

http://translate.google.com/translate?u...=UTF-8&oe=UTF-8&prev=%2Flanguage_tools

At INTEL, medium term is not pinker forcing

(The original French article is linked within. I don't think it really has any new information, but I'm not sure I'd know.)

liebremx · Apr 15, 2006

Originally posted by: Viditor
I thought the idea was to utilize multiple cores for processing a single thread. I envisioned different cores processing the speculative threads and flagging them as such in the cache.

Actually there are still multiple threads in the Mitosis system, although in this case threads are compiler-defined not programmer-defined. The cores, still with some help from the compiler, do agressive speculation regarding data dependencies between the threads. The agressive speculation results in the cores executing , for example, two threads in parallel even if one of them has a data dependency on the other one. The dependent thread executes using a probably-correct data value and then when the real data becomes available the executing core decides whether to commit or discard the results of the dependent thread.

And as for the cache, why would you want to flag the threads in the cache? I don't see why this would be needed.

🙂

Screech · Apr 15, 2006

Originally posted by: Madwand1
The original link was in Dutch, which is not supported by Google. I found the following article in French, and the Google translation is hilarious -- if any of you complain about offshore tech support ... just be grateful they're not Google bots.

http://translate.google.com/translate?u...=UTF-8&oe=UTF-8&prev=%2Flanguage_tools

At INTEL, medium term is not pinker forcing

Click to expand...

(The original French article is linked within. I don't think it really has any new information, but I'm not sure I'd know.)

haha, that's awesome.....especially this:
"Conscious that K8 architecture could not compete with the next high-speed motorboat of INTEL,"...

IntelUser2000 · Apr 15, 2006

- Memory technology (because Rambus left the desktop memory market) was stagnant and could not provide the necessary banwidth to feed the Pentium-4
- The shared FSB that could also not feed the Pentium-4.
- Too much leakage and bad design on Prescott that prevented high scaling

Oh right, then what was up with the 3.46GHz EE with 1066MHz FSB not more than 2% faster than its 800MHz brethren?? Shared FSB is only apparent for SMP systems!!!

The Itanium-2 Monecito has a mere 10 stage long pipeline, but also carries its own SMT.

That's not really a valid argument as Itanium 2 Montecito core has SoEMT, not SMT, which is primarily for hiding memory latency, by switching threads when necessary. Itanium 2 by the way, has 8 stage pipeline not 10.

With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.

Latency as in.... You are being very general here.

HT is not the god send you are making out to be. It does nothing, and even hinders performance on efficient processors (ie: Short Instruction Pipelines).

Oh really?? You would know it would hinder on short pipelined processors because...? Only other real SMT implementation is IBM's Power, which takes considerably more die size, and is more advanced than Intel's HT. Add to the fact that IBM's Power 5 is VERY wide, and they are not comparable.

HT IS the god send in terms of efficiency. Contrary to what people are claiming, single thread performance was basically equal in most apps. Since it only adds less than 5% to die size, giving potential 30%, and average 5%, is VERY efficient. Which other technology does that??

imported_goku · Apr 15, 2006

I hope this technology becomes optional in the sense that you can turn it off and on.. Because there are times when I'd like more powerful multitasking and other times when I'd like better single threaded performance.

xtknight · Apr 15, 2006

I hope you can tweak the time slicing and "multithreadatorization" or something. I love having things to tweak. 🙂

Fox5 · Apr 16, 2006

Originally posted by: IntelUser2000

- Memory technology (because Rambus left the desktop memory market) was stagnant and could not provide the necessary banwidth to feed the Pentium-4
- The shared FSB that could also not feed the Pentium-4.
- Too much leakage and bad design on Prescott that prevented high scaling

Click to expand...

Oh right, then what was up with the 3.46GHz EE with 1066MHz FSB not more than 2% faster than its 800MHz brethren?? Shared FSB is only apparent for SMP systems!!!

The Itanium-2 Monecito has a mere 10 stage long pipeline, but also carries its own SMT.

Click to expand...

That's not really a valid argument as Itanium 2 Montecito core has SoEMT, not SMT, which is primarily for hiding memory latency, by switching threads when necessary. Itanium 2 by the way, has 8 stage pipeline not 10.

With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.

Click to expand...

Latency as in.... You are being very general here.

HT is not the god send you are making out to be. It does nothing, and even hinders performance on efficient processors (ie: Short Instruction Pipelines).

Click to expand...

Oh really?? You would know it would hinder on short pipelined processors because...? Only other real SMT implementation is IBM's Power, which takes considerably more die size, and is more advanced than Intel's HT. Add to the fact that IBM's Power 5 is VERY wide, and they are not comparable.

HT IS the god send in terms of efficiency. Contrary to what people are claiming, single thread performance was basically equal in most apps. Since it only adds less than 5% to die size, giving potential 30%, and average 5%, is VERY efficient. Which other technology does that??

Are you sure about 5% die size only? I thought part of what made prescott have so many more transistors than northwood was that they really souped up its HT.

stevty2889 · Apr 16, 2006

Originally posted by: Fox5

Originally posted by: IntelUser2000

- Memory technology (because Rambus left the desktop memory market) was stagnant and could not provide the necessary banwidth to feed the Pentium-4
- The shared FSB that could also not feed the Pentium-4.
- Too much leakage and bad design on Prescott that prevented high scaling

Click to expand...

Oh right, then what was up with the 3.46GHz EE with 1066MHz FSB not more than 2% faster than its 800MHz brethren?? Shared FSB is only apparent for SMP systems!!!

The Itanium-2 Monecito has a mere 10 stage long pipeline, but also carries its own SMT.

Click to expand...

That's not really a valid argument as Itanium 2 Montecito core has SoEMT, not SMT, which is primarily for hiding memory latency, by switching threads when necessary. Itanium 2 by the way, has 8 stage pipeline not 10.

With longer pipelines (and higher clockspeeds), the latency introduced by HT is a relatively trivial amount and HT gains much more than it loses.
Contrarily, with a shorter pipeline latency and efficiency are everything! The gains from SMT are far outweighed by the latency penalty.

Click to expand...

Latency as in.... You are being very general here.

HT is not the god send you are making out to be. It does nothing, and even hinders performance on efficient processors (ie: Short Instruction Pipelines).

Click to expand...

Oh really?? You would know it would hinder on short pipelined processors because...? Only other real SMT implementation is IBM's Power, which takes considerably more die size, and is more advanced than Intel's HT. Add to the fact that IBM's Power 5 is VERY wide, and they are not comparable.

HT IS the god send in terms of efficiency. Contrary to what people are claiming, single thread performance was basically equal in most apps. Since it only adds less than 5% to die size, giving potential 30%, and average 5%, is VERY efficient. Which other technology does that??

Click to expand...

Are you sure about 5% die size only? I thought part of what made prescott have so many more transistors than northwood was that they really souped up its HT.

5% die size is about acurate for hyperthreading in the Prescott. Prescott also had EM64T, Execute disable, and virtualization built in since the begining, but disabled, so that was a big part of the extra transistor count.

dmens · Apr 16, 2006

Most of the new logic in the prescott core minus the last level cache is taken up by the extended replay window and by extension, the longer global pipeline. SMT is pretty much the same as it was before, EM64T had a moderate impact (especially on the real register file and certain datapaths). NX bit is basically zilch, and VT1 is pretty light too.

DARQ MX · Apr 30, 2006

FreedomGUNDAM · Jul 21, 2006

with the advent of cheap dual core AMD processers next week. Has anyone heard any more on the reverse hyper-threading?

HopJokey · Jul 21, 2006

Originally posted by: FreedomGUNDAM
with the advent of cheap dual core AMD processers next week. Has anyone heard any more on the reverse hyper-threading?

Basically, RHT was just a rumor and is not real.

AMD working on reverse Hyper-Threading technology

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Member

Golden Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Platinum Member