Hyperthreading P4 vs P4 Dual Core

carlosd · Dec 2, 2005

Originally posted by: intangir
Untrue. While the first wave of Intel's Next-Generation Microarchitecture (Merom/Conroe/Woodcrest) will not have SMT, the later cores will.

That will be not HT, HT will die with netburst.

Originally posted by: intangir
One: it is a myth that SMT is not worth it on shorter pipelines or single cores. The Alpha EV8 would have implemented 4-way simultaneous multithreading in a single core with a 9-stage pipeline. They estimated it would have doubled performance with a die-size increase of less than 10%.

The now dead Alpha architechture was a quite different architecture cannot directly be compared to X86. I am talking specifically about x86 cores. EV8 architecture RISC and the level instruction parallelism is much easier to reach but with post RISC desing this goal is quite difficult since the micro or macro ops are not directly in control of software those are up to the hardware and microcode decoders, in the pure RISC architectures instructions are directly in control of software.

Originally posted by: carlosd
Two: Proliferations of the Merom core *will* have SMT. It probably will not double the performance, but I know for a fact it will increase it significantly.

I don't think they will have the same kind of multithreading as HT, it would be SMT using multiple cores, multiple cores+HT = no gains in performance as you see with the 840XE CPU.

Originally posted by: intangir
Well, if AMD thought as you, I think Intel has nothing to fear for the next 3 years. You remind me of the people that claim register renaming helps register-starved CISC designs more than RISC, and so is a necessary added cost of designing CISC chips. Well, the fact is, any serious high-performance RISC design also implements register renaming, because the performance gain is worth it.

You are talking about a TOTALLY different issue.

Originally posted by: carlosd

You could say the exact same thing about x86-64. In many cases, it bloats code and data sizes, decreasing cache effectiveness and slowing things down. Many benchmarks run slower in Windows x64 than 32-bit Windows XP. And the Linux applications I run (especially the lattice siever) are sensitive to codesize bloat, and actually run significantly faster in 32-bit mode.

I am not defending 64 bits, yes I am agree with you that X86-64 in this momet doesn't give performance advantages mainly because the software and hardware are still not optimized, but it doesn't have nothing to do with the HT discussion.

As said the HT was only useful in the P4, but now with the new desings of intel the same kind of multithreading will have not significative advantages.

Leper Messiah · Dec 2, 2005

By no means would I say that HT is useless with net burst arc. but Every way that it has been desricbed to me is that it helps keep the long pipelines filled on the P4. Since AMD's pipelines are shorter, they don't have the same performance hit that intel does when the pipelines are flushed. So AMD64 would see a performance increase, but it would be maybe half of of what Intels sees (AMD64 is a 13 pipe arc. IIRC). Granted, there are probably other ways of providing virtual cores that may be more effiecent for a high IPC contruction, but I'm not a chip engineer.

intangir · Dec 2, 2005

Originally posted by: carlosd

Originally posted by: intangir
Untrue. While the first wave of Intel's Next-Generation Microarchitecture (Merom/Conroe/Woodcrest) will not have SMT, the later cores will.

Click to expand...

That will be not HT, HT will die with netburst. HT uses coarse multithreading not the same with SMT using multiple cores.

Now you're just abusing the terminology. HT is what Intel marketing calls SMT, nothing more and nothing less. When Intel's new cores get SMT, Intel will call it HT. If you're saying HT is SMT as applied to the Netburst core, well then yes, duh, your definition of HT will go away with Netburst. But that's tautological and pointless to say, so you can't really mean that, can you?

Also, HT is the property of a single core. SMT with multiple cores is just duplicating the SMT with a single core. There are no new challenges or features that make it different from a single-core implementation.

And finally, HT is fine-grained simultaneous multithreading. Don't you check your facts at all? Where do you get this misinformation?? Stop spreading it!!

:disgust:

http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/4
"The scheduler has no idea that it's scheduling code from multiple threads. It simply looks at each instruction in the scheduling queue on a case-by-case basis, evaluates the instruction's dependencies, compares the instruction's needs to the physical processor's currently available execution resources, and then schedules the instruction for execution. To return to the example from our hyper-threading diagram, the scheduler may issue one red instruction and two yellow to the execution core on one cycle, and then three red and one yellow on the next cycle. So while the scheduling queue is itself aware of the differences between instructions from one thread and the other, the scheduler in pulling instructions from the queue sees the entire queue as holding a single instruction stream."

http://intel.com/design/pentium4/manuals/index_new.htm#aorm
"The core can dispatch up to six µops per cycle, provided the µops are
ready to execute. Once the µops are placed in the queues waiting for
execution, there is no distinction between instructions from the two
logical processors. The execution core and memory hierarchy is also
oblivious to which instructions belong to which logical processor."

Or if you're a little slow:
"Figure 1-6 shows a typical bus-based symmetric multiprocessor (SMP)
based on processors supporting Hyper-Threading Technology. Each
logical processor can execute a software thread, allowing a maximum of
two software threads to execute simultaneously on one physical
processor. The two software threads execute simultaneously, meaning
that in the same clock cycle an ?add? operation from logical processor 0
and another ?add? operation and load from logical processor 1 can be
executed simultaneously by the execution engine."

Originally posted by: carlosd

Originally posted by: intangir
One: it is a myth that SMT is not worth it on shorter pipelines or single cores. The Alpha EV8 would have implemented 4-way simultaneous multithreading in a single core with a 9-stage pipeline. They estimated it would have doubled performance with a die-size increase of less than 10%.

Click to expand...

The now dead Alpha architechture was a quite different architecture cannot directly be compared to X86. I am talking specifically about x86 cores. EV8 architecture RISC and the level instruction parallelism is much easier to reach but with post RISC desing this goal is quite difficult since the micro or macro ops are not directly in control of software those are up to the hardware and microcode decoders, in the pure RISC architectures instructions are directly in control of software.

OMG. This is hilarious. That was pure gobbledygook. It has become blindingly obvious you have no clue what you're talking about. Heck, I don't even know what you were trying to say, but I'll try to cover every point you might possibly have been making.

First of all, SMT does not care about instruction parallellism. It exploits thread-level parallellism.

Second, SMT does not rely on the software being aware of it to work. It relies on no software hints or support. All the software cares about is that its instructions get executed. The processor automatically takes care of all scheduling and updating of architectural state. With hyperthreading, it just fetches and schedules instructions from two processes at the same time. That's all the software stack has to know.

Third of all, there is no relevant difference between the Alpha's backend and the backend of modern x86 processors. They've pretty much converged. Once the decode and register allocation has been done, the backend just sees a single stream of instructions. The scheduler is free to make whatever scheduling choices it desires among the available execution units. All relevant data dependencies are taken care of by the register allocation. Whether these are RISC ops or x86 uops makes no difference whatsoever.

Originally posted by: carlosd

Originally posted by: intangir
Two: Proliferations of the Merom core *will* have SMT. It probably will not double the performance, but I know for a fact it will increase it significantly.

Click to expand...

I don't think they will have the same kind of multithreading as HT, it would be SMT using multiple cores, multiple cores+HT = no gains in performance as you see with the 840XE CPU.

Again, whatever architectural form it takes, it will be called HT by Intel, which has no relevant differences with SMT. And I don't have an 840XE so I can't test it, but I doubt in the extreme that HT gives it no gains. Dual processor Xeons have benefitted from HT indisputably, as running 4 threads give more performance than 2. Why would dual-core chips be any different? And the benchmarks for dual-core Xeons with HT show obvious gains.

For a dual-processor dual-core Bensley (Netburst-based Xeon) system (4 cores, 8 threads):

http://www.realworldtech.com/page.cfm?ArticleID=RWT112905011743&p=4
"The Bensley system also scales perfectly to four physical processors (which is quite an achievement), and then gets a 35% boost from Hyper-Threading. At eight threads, the Bensley system executed the kernel 74% faster."

http://www.realworldtech.com/page.cfm?ArticleID=RWT112905011743&p=5
"The Nocona system scales by a factor of 2.44, and the Bensley system by a factor of 4.81."

That gives a 20% increase from hyperthreading.

And stop moving the goalposts. First you complained that HT was useless. Then you said it was only useful in multicore processors. Now you're saying it's only useful in single core processors. Make up your damn mind.

Originally posted by: carlosd

Originally posted by: intangir
Well, if AMD thought as you, I think Intel has nothing to fear for the next 3 years. You remind me of the people that claim register renaming helps register-starved CISC designs more than RISC, and so is a necessary added cost of designing CISC chips. Well, the fact is, any serious high-performance RISC design also implements register renaming, because the performance gain is worth it.

Click to expand...

You are talking about a TOTALLY different issue.

That is what is known as an "analogy". We are debating the cost/benefit of a microarchitectural optimization. I was showing you the historic decisions made concerning an optimization with the same characteristics as HT.

Originally posted by: carlosd
As said the HT was only useful in the P4, but now with the new desings of intel the same kind of multithreading will have not significative advantages.

I showed you my numbers, you show me yours.

dmens · Dec 2, 2005

Since AMD's pipelines are shorter, they don't have the same performance hit that intel does when the pipelines are flushed.

Nukes are rare. Much more common are stalls, and SMT allows forward progress in those circumstances. Stall "bypassing" accounts for far more gain in SMT machines than gains obtained when nuke "bypassing". Pipeline length has no impact on this metric.

Viditor · Dec 2, 2005

Originally posted by: BlvdKing
HT was useful on the P4 because it increased the efficiency of the long pipeline. Dual cores is a better technology that wasn't possible in the days of .13 micron cpus. The cost of a dual core .13 micron cpu would be huge and profit margin would be smaller, not to mention yeilds. HT was a good trade off that increased efficiency without a huge increase in die size.

QFT
I see that someone actually understands the true benefits of HT! It was an excellent addition for the Netburst architecture, but is not a very good design for more efficient uA.
If you increase the transistor count by 5%/core, you would also be increasing the power/heat level significantly for a very modest gain in performance...
This was acceptable in Netburst because it was already highly inefficient, and the pipelines needed feeding...not to mention that thermals weren't really a priority during it's design.
I suspect that the Inq is incorrect about HT being included in Intel's NextGen chips.

dmens · Dec 2, 2005

I fail to see how SMT is "less efficient". A perf return that exceeds its cost in cost, die size and power is efficient. SMT's average perf ROI is >1 on all three axes.

Acanthus · Dec 2, 2005

Originally posted by: dmens
Notice how I've been saying that comparisons between DC and SMT are pointless due to vastly different hardware cost. My only goal has been to disprove the notion that SMT is a "joke", and frankly, your assertions are completely off-target:

1) "Not worth it". WRONG. 5% die area increase for >30% averaged wall clock gain on server workloads.
2) "Useless". WRONG. With the introduction of multicore, workloads will move to become even more threaded, and SMT will benefit.
3) "Outdated". WRONG. SMT is a feature now in Intel and IBM cores, both current and upcoming.
4) "Joke compared to dual cores". Not applicable due to investment difference.

Admit it, you just don't know jack. I'm done here, there is no point talking any more.

Not even owned, demolished

Viditor · Dec 2, 2005

Originally posted by: dmens
I fail to see how SMT is "less efficient". A perf return that exceeds its cost in cost, die size and power is efficient. SMT's average perf ROI is >1 on all three axes.

You would have to show me how the perf return is > 1 on all three axes for a shorter pipelined uA...don't forget to add the yield factor into it.
The more transistors, the greater the chance of a flawed die. Most especially, the heat/power numbers would seem to me to be much less than 1 on an efficient uA.

As anecdotal evidence of this, please note that neither AMD nor the P-M utilize HT...
AMD is the patent holder on a good portion of HT (they developed many of the processes that Intel uses for HT), so they could certainly use it if they thought it would be an improvement...

dmens · Dec 2, 2005

Current P-M's don't have SMT for reasons not yet publicly disclosed. And I'm not going to talk about the future of SMT in Intel procs since that is also confidential afaik, sorry.

As for P4, the die size increase is about the same as the transistor width total increase (~5%). I never saw any numbers regarding the power increase since I wasn't around in the early Willamette days. Assuming that SMT's logic makeup is pretty similar to the rest of the chip (which is OK since there aren't any large new structures introduced to handle SMT), the power increase would be going up linearly with the other metrics.

Given the perf numbers seen all over this thread, for P4, SMT ROI >1 on above three axes.

Also see my above post regarding pipeline flush vs. stall and how it relates to SMT perf gain on shorter pipelines.

carlosd · Dec 3, 2005

Originally posted by: intangir

OMG. This is hilarious. That was pure gobbledygook. It has become blindingly obvious you have no clue what you're talking about. Heck, I don't even know what you were trying to say, but I'll try to cover every point you might possibly have been making.

First of all, SMT does not care about instruction parallellism. It exploits thread-level parallellism.

Second, SMT does not rely on the software being aware of it to work. It relies on no software hints or support. All the software cares about is that its instructions get executed. The processor automatically takes care of all scheduling and updating of architectural state. With hyperthreading, it just fetches and schedules instructions from two processes at the same time. That's all the software stack has to know.

Third of all, there is no relevant difference between the Alpha's backend and the backend of modern x86 processors. They've pretty much converged. Once the decode and register allocation has been done, the backend just sees a single stream of instructions. The scheduler is free to make whatever scheduling choices it desires among the available execution units. All relevant data dependencies are taken care of by the register allocation. Whether these are RISC ops or x86 uops makes no difference whatsoever.

It seems you are the one who don't know what is talking about. That's why I won't discuss that anymore. I can be explaining it all day and you won't get it. The front end is very important, look at the IBM SMT implementation where many of the tasks done in hardware by HT are done in software reaching a higher lever of efficiency using shorter pipelines limited resources CPU. Whit a very complex front end the CPU need of mayor resources , in the case of P4 , HT gives some improvement because all the wasted resources.

Originally posted by: intangir
Again, whatever architectural form it takes, it will be called HT by Intel, which has no relevant differences with SMT. And I don't have an 840XE so I can't test it, but I doubt in the extreme that HT gives it no gains. Dual processor Xeons have benefitted from HT indisputably, as running 4 threads give more performance than 2. Why would dual-core chips be any different? And the benchmarks for dual-core Xeons with HT show obvious gains.

It gives no gains to 840XE, haven't look at the bechmarks, they have been around for a while!

Originally posted by: intangir
For a dual-processor dual-core Bensley (Netburst-based Xeon) system (4 cores, 8 threads):
http://www.realworldtech.com/page.cfm?ArticleID=RWT112905011743&p=4
"The Bensley system also scales perfectly to four physical processors (which is quite an achievement), and then gets a 35% boost from Hyper-Threading. At eight threads, the Bensley system executed the kernel 74% faster."

http://www.realworldtech.com/page.cfm?ArticleID=RWT112905011743&p=5
"The Nocona system scales by a factor of 2.44, and the Bensley system by a factor of 4.81."

That gives a 20% increase from hyperthreading.

But they still get asskicked by multi dual core opteron configs at a lower price while consuming less power!

Originally posted by: intangir
And stop moving the goalposts. First you complained that HT was useless. Then you said it was only useful in multicore processors. Now you're saying it's only useful in single core processors. Make up your damn mind.

I said HT wa<s only useful for netburts asrchitechture, There asre ways SMT (not specifically HT) would be useful for multicore CPUs, look for IBM implementations, I see you don't read well.

Originally posted by: intangir

That is what is known as an "analogy". We are debating the cost/benefit of a microarchitectural optimization. I was showing you the historic decisions made concerning an optimization with the same characteristics as HT.

Bad analogy

carlosd · Dec 3, 2005

Originally posted by: Viditor

Originally posted by: BlvdKing
HT was useful on the P4 because it increased the efficiency of the long pipeline. Dual cores is a better technology that wasn't possible in the days of .13 micron cpus. The cost of a dual core .13 micron cpu would be huge and profit margin would be smaller, not to mention yeilds. HT was a good trade off that increased efficiency without a huge increase in die size.

Click to expand...

QFT
I see that someone actually understands the true benefits of HT! It was an excellent addition for the Netburst architecture, but is not a very good design for more efficient uA.
If you increase the transistor count by 5%/core, you would also be increasing the power/heat level significantly for a very modest gain in performance...
This was acceptable in Netburst because it was already highly inefficient, and the pipelines needed feeding...not to mention that thermals weren't really a priority during it's design.
I suspect that the Inq is incorrect about HT being included in Intel's NextGen chips.

That is one of the thigs I was trying to say.

dmens · Dec 3, 2005

So wrong...

1. "Shorter pipelines limited resources CPU": I'm having a hard time reading that, but if it is what I think it means, you'd should know that pipeline length isn't nearly as important as you'd think (try reading my post).

2. The only thing IBM's implementation has over Intel's is the addition of OS hints on thread priority, which is handled entirely by software on a P4 platform. So you got it backwards; IBM added hardware to handle this SW/HW interaction.

3. I am skeptical about the "efficiency" of this extra hardware. IMO the CPU shouldn't even have to worry about thread priority, because it is a higher level of abstraction. The only thing the CPU should force a thread priority at the hardware level is in livelock/starvation scenarios, which is an entirely different issue. I can imagine tagging threads in flow with a priority and arbitrating in hardware, but with good software, that can be avoided. But I will wait for real analysis before commenting further.

4. "Frontend... wasted resources". This is a real shocker. It is painfully obvious you have no idea what a frontend or backend is. What's this about P4's frontend and "wasted resources" and how SMT ties into this? I want to hear this.

5. Still carping about SMT vs. dual core? OMG still beating at that dead horse?

6. No gains on 840XE? Nice blanket statement. You might want to look at all the benchmarks... since 840 is similar to nocona, and we already have numbers for that setup.

7. You don't know how SMT works, period. That article you linked, it describes new additions designed by IBM. The theory of SMT and high level implementation remains the same for all CPU's. I want you to tell me how Intel's and IBM's hardware implementation is different at the high level.

8. Still whining about SMT only useful for P4 uarch? Learn to read already.

9. How is it a bad analogy? ROI assessment is the most crucial factor in deciding whether to implement any kind of uarch feature. You know what that is right?

Search

Hyperthreading P4 vs P4 Dual Core

carlosd

Senior member

Leper Messiah

Banned

intangir

Member

dmens

Platinum Member

Viditor

Diamond Member

dmens

Platinum Member

Acanthus

Lifer

Viditor

Diamond Member

dmens

Platinum Member

carlosd

Senior member

carlosd

Senior member

dmens

Platinum Member

TRENDING THREADS