Modules are more effective than hyperthreading, right?

jhu · Oct 15, 2011

aigomorla said:
^ i stand corrected... i completely ignored the Godtanium's.. D:

And Atom too! hehe! Atoms have had hyperthreading since the beginning. It works pretty well too.

ocre · Oct 15, 2011

aigomorla said:
P3 were the mobile cpu's, what do you think the first doltan and yonah's were?

http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture)

"Despite these enhancements, the NetBurst architecture created obstacles for engineers trying to scale up its performance. With this microarchitecture, Intel looked to attain clock speeds of 10 GHz, but because of rising clock speeds, Intel faced increasing problems with keeping power dissipation within acceptable limits."

that 10ghz number i heard is where it was theoretically supposed to work.

And your last comment applies because cpu's started to have more cores.

LOL! That doesnt mean that hyperthreading on netburst needs to be at 10ghz to be worth anything. i like how you tryed to pull that though.

As far as the core2 and the pentuim 3 goes, there is a big gap in the middle you keep ignoring. It was the Pentium M and original core technology which was the direct basis of the core2 dou. it did not go directly from Pentium 3 to core2 like you are saying. The pentuim M came from a path paved with a lot of advancements.

intel did go backwards before they went forward, its my belief AMD must do something similar. your focusing completely missed my whole reasoning, one which is more about AMDs BD and less about the pentuim3. But yes, if you connect the dots from where the core2 came from, the Pentium 3 is in there but its direct architecture forefather is the pentuim M. A ultra low power path intel work which became extremely efficient and scaled like a dream. they used this for the basis of the core2, it wasnt just a Pentium 3 die shrunk! If thats all it was then AMD would have had a much better hand.

aigomorla · Oct 16, 2011

ocre said:
LOL! That doesnt mean that hyperthreading on netburst needs to be at 10ghz to be worth anything. i like how you tryed to pull that though..

this is what i heard from a lot of Intel people..

So unless u have some different views share them.

Becuase if i was really off im sure JHU would of corrected me, or IDC.

AtenRa · Oct 16, 2011

HT was ok in P4 with Windows XP SP2 and multi-threaded apps, Windows 2000 had a lot of performance problems and it was recommended from Intel themselves to disable HT in Win/Server 2000.

HT was working fine and producing up to +30% even at 3GHz with P4s.

Nemesis 1 · Oct 16, 2011

aigomorla said:
P3 were the mobile cpu's, what do you think the first doltan and yonah's were?

Actually the israel team choose The P6Pro based off of that Failed Intel processor based on ondie igp and hypertransport. It also had a possiabilty to be used as a 2 core processor had the link built in . We seen this with intels first true dual core . This was one of 2 reasons they choose this core the other being the chief engineer also worked on the failed ondie IGP processor. One of the engineers of the P6 is on a recently linked video in this forum . Its really good talk he gives .

Also befor ya go slaping P4C hyperthreading I sugjest you go back to the proper reviews were hyperthreading worked. You make it sound like it did nothing . There we were here we are and the next step is better yet.

aigomorla · Oct 16, 2011

regardless they skipped an entire generation of HT on all their processors so the ones which followed were optimized.

jhu · Oct 16, 2011

aigomorla said:
this is what i heard from a lot of Intel people..

So unless u have some different views share them.

Becuase if i was really off im sure JHU would of corrected me, or IDC.

Have some testing to do, will report back later.

ocre · Oct 16, 2011

aigomorla said:
regardless they skipped an entire generation of HT on all their processors so the ones which followed were optimized.

I pretty much explained how they skipped a whole generation earlier.......

so, the Israel team (who just got done with the Pentuim M) went directly to work on the designs conroe is born from (core). The pentuim M was direct in pathway leading all the way down passed p3 to pentuim pro. Every step was pretty big. What made the core2 dou great was the performance per watt which is all directly related to the Israels work on a super efficient cpu. the Pentium M gets noticed little these days but it shares a lot with the core2 dou. The pentium3 is several steps down the ladder.

This path did not have hyperthreading, but it wasnt that hyperthreading wasnt working. Its just they sidestepped to a entirety different beast that work very well even without it! Intel in time redesign hyperthreading into their CPUs down the road and we have it today.

Blastman · Oct 16, 2011

xtremesystsems ran some benches with 4M/4C and 4M/8C configurations. It's essentially like turning HT (hyperthreading) on/off

Chess 11800/8813=1.3389 ?
Wprime 13.814/9.531=1.4494
Winrar 4467/3027=1.4757
3d06 5803/4134=1.4037
3dvantage 19215/12102=1.5878
3d11 6340/4289=1.4782
CB R10 20552/15033=1.3671
CB R11.5 6/3.8=1.5789
Blender 9.76/7.16=1.3631
X264 37.23/25.18=1.4786
Transcode (222+210)/(185+135)=1.35

and AMD gets 33-59% boost from CMT (chip level multithreading).

I ran a few of these benches on my i3-530 with HT on/off

Fritz

FX-8150

11,807 4M/8C 34.0% (faster)
8813 4M/4C

i3-530

5418 HT on 31.2%
4129 HT off

Cinebench 11.5

FX-8150

6.0 4M/8C 57.9%
3.8 4M/4C

i3-530

2.32 HT on 31.1%
1.77 HT off

wPrime 32M

FX-8150

9.531 4M/8C 44.9%
13.814 4M/4C

i3-530

19.281 HT on 33.1%
25.671 HT off

The problem is how much additional die size is traded for what performance gains. The addition of HT only adds about 5% additional resources to the i3 where AMD is adding what ? 60-70% to the die size to get CMT? On the low end of the gains like in Fritz, HT gains 31.2% compared to 34.0% for the FX -- so AMD has added a huge die size penalty to gain marginally more than Intel does from HT. While a lot of those benches gain more than 40% and as high as 59% with CMT on the FX, it's not enough to offset the die size hit the chip takes for the performance gain. If the FX could see at least a 60% gain across the board in multithreaded benches like Fritz the FX would probably be looking like a pretty good chip right now.

One has to hope the bottleneck on the FX in multithreaded benches can be fixed and AMD can get consistently 60-80% from CMT. This could turn the FX into a fast competitive processor.

Idontcare · Oct 17, 2011

Blastman said:
and AMD gets 33-59% boost from CMT (chip level multithreading).

I ran a few of these benches on my i3-530 with HT on/off

Can you provide a one-line summary for what Intel gets from HT along the same lines of what you stated for AMD and CMT?

Blastman said:
The problem is how much additional die size is traded for what performance gains. The addition of HT only adds about 5% additional resources to the i3 where AMD is adding what ? 60-70% to the die size to get CMT?

AMD said the 2nd INT core present in a module occupies only 5% of area, not a 60-70% increase.

To do justice here you will need to make thread/mm^2 core (including L2$) comparisons for each architecture.

beginner99 · Oct 17, 2011

Idontcare said:
AMD said the 2nd INT core present in a module occupies only 5% of area, not a 60-70% increase.

I thought more like 12% more. But that does not include the huge L2 cache (compared to SB). So if you include cache HT seems to be a much better approach.
Never really checked but isn't mainly the huge l2 and l3 cache responsible for the large die size? An shrunk 8-core thuban would use less space...

Also some cache benches so extremely poor performance (worse than phenom) and abysmal compared to SB. Since I'm not an expert at all I wonder it it isn't the slow cache causing the poor performance.

Blastman · Oct 17, 2011

Idontcare said:
Can you provide a one-line summary for what Intel gets from HT along the same lines of what you stated for AMD and CMT?

I only have 3 of the benches they ran on hand to run on my i3-530.

% gain with HT/CMT

… … … … i3-530 … … 8150

Fritz … … 31.2 … … 34.0
wPrime … 33.1 … … 44.9
CB 11.5 … 31.1 … … 57.9

AMD said the 2nd INT core present in a module occupies only 5% of area, not a 60-70% increase.

I wasn't sure about the additional die size that the extra integer core adds to the FX -- I just threw out a number based on reading about Bulldozer. I checked and according to an article on Bulldozer by Anadtech in 2009 …

AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area. That's less than a complete doubling of die size for two cores, but still much more than something like Hyper Threading.

The 5% figure for HT on the i3 is from the Intel website.

Idontcare · Oct 17, 2011

Yeah it really seems the way to make these comparisons is 1 BD Module versus 1 Intel "core"...1 vs 2 threads on each...then compare the mm^2 of the cores (including L2$).

You can do the same with thuban (preferably Llano), but you'd compare 1core and 2core results.

Blastman · Oct 17, 2011

Idontcare said:
Yeah it really seems the way to make these comparisons is 1 BD Module versus 1 Intel "core"...1 vs 2 threads on each...then compare the mm^2 of the cores (including L2$).

I was thinking along the same lines. If we consider Nehalem

Anand A single Nehalem core isnt made up of a majority of cache. Approximately 1/3 of the core is L1/L2 cache, 1/3 is the out of order execution engine and the remaining 1/3 is decode, the branch prediction logic, memory ordering and paging.

So adding 50% to the size of the execution engine would only add 1/3x 0.5 = 16.5% to the die size of a Nehalem excluding L3 cache. Since AMD's Bulldozer has larger L2 caches, the size increase to the overall CPU core could be lower than 16.5% depending on how and what you include in the measure.

Ancalagon44 · Oct 17, 2011

The advantage of hyperthreading is that it will never cost performance. There are no cases in which a Sandy Bridge without HT will outperform a SB with HT. However, depending on workload, a BD with odd numbered cores disabled can outperform a 2 module/4 core BD. That is why it is a terrible idea.

3DVagabond · Oct 17, 2011

There's going to be a bigger improvement going from 2 thread to 4 thread, than going from 4 thread to 8 thread. Performance doesn't scale linearly as you increase threads. If you go from 1 to 2 to 4 to 8... performance doesn't double each time.

AtenRa · Oct 17, 2011

Ancalagon44 said:
The advantage of hyperthreading is that it will never cost performance. There are no cases in which a Sandy Bridge without HT will outperform a SB with HT. However, depending on workload, a BD with odd numbered cores disabled can outperform a 2 module/4 core BD. That is why it is a terrible idea.

It is the same with HT, 2C/4T SB will loose from 4C/4T SB.

HT can cost performance in some circumstances too.

pantsaregood · Oct 17, 2011

Ancalagon44 said:
The advantage of hyperthreading is that it will never cost performance. There are no cases in which a Sandy Bridge without HT will outperform a SB with HT. However, depending on workload, a BD with odd numbered cores disabled can outperform a 2 module/4 core BD. That is why it is a terrible idea.

There are cases of HT lowering performance, actually. There are some isolated cases of it hurting performance enough for an i3-2100 to perform worse than a Pentium G840.

or whatever the highest pentium model is

soccerballtux · Oct 17, 2011

Blastman said:
xtremesystsems ran some benches with 4M/4C and 4M/8C configurations. It's essentially like turning HT (hyperthreading) on/off …

Chess 11800/8813=1.3389 ?
Wprime 13.814/9.531=1.4494
Winrar 4467/3027=1.4757
3d06 5803/4134=1.4037
3dvantage 19215/12102=1.5878
3d11 6340/4289=1.4782
CB R10 20552/15033=1.3671
CB R11.5 6/3.8=1.5789
Blender 9.76/7.16=1.3631
X264 37.23/25.18=1.4786
Transcode (222+210)/(185+135)=1.35

… and AMD gets 33-59% boost from CMT (chip level multithreading).

I ran a few of these benches on my i3-530 with HT on/off …

Fritz

FX-8150

11,807 … 4M/8C … 34.0% … (faster)
8813 … 4M/4C

i3-530

5418 … HT on … 31.2%
4129 … HT off

Cinebench 11.5

FX-8150

6.0 … 4M/8C … 57.9%
3.8 … 4M/4C

i3-530

2.32 … HT on …31.1%
1.77 … HT off

wPrime 32M

FX-8150

9.531 … 4M/8C … 44.9%
13.814 … 4M/4C

i3-530

19.281 … HT on … 33.1%
25.671 … HT off

The problem is how much additional die size is traded for what performance gains. The addition of HT only adds about 5% additional resources to the i3 where AMD is adding what … ? … 60-70% to the die size to get CMT? On the low end of the gains like in Fritz, HT gains 31.2% compared to 34.0% for the FX -- so AMD has added a huge die size penalty to gain marginally more than Intel does from HT. While a lot of those benches gain more than 40% and as high as 59% with CMT on the FX, it's not enough to offset the die size hit the chip takes for the performance gain. If the FX could see at least a 60% gain across the board in multithreaded benches like Fritz the FX would probably be looking like a pretty good chip right now.

One has to hope the bottleneck on the FX in multithreaded benches can be fixed and AMD can get consistently 60-80% from CMT. This could turn the FX into a fast competitive processor.

I don't understand why are they doing tests like this? This is wrong... We want to look at single threaded performance, not multithreaded performance. Of course multithreaded performance is going to be better when one turns on the extra unused modules.

Rifter · Oct 17, 2011

Im going to vote no they are not as effective as HT. Because SB manages to beat BD in pretty much all test while using half the die space and less power.

Perhaps if AMD knew how to design a CPU core that was as efficiant as SB maybe the whole module thing might work out better i dunno. But as of right now it seems HT is more effective both in die space(and therefor cost to produce/sell) and power use.

Idontcare · Oct 17, 2011

Rifterut said:
Im going to vote no they are not as effective as HT. Because SB manages to beat BD in pretty much all test while using half the die space and less power.

Perhaps if AMD knew how to design a CPU core that was as efficiant as SB maybe the whole module thing might work out better i dunno. But as of right now it seems HT is more effective both in die space(and therefor cost to produce/sell) and power use.

I think what we learned from this, as laymen, is that CMT is not an effective method of improving the performance of a core any more than hyperthreading is.

If the core (base) compute microarchitecture is weak (be it netburst/prescott or bulldozer/zambezi) then expanding the architecture in the direction of multithreading by way of CMT/SMT is essentially a fool's errand because you've merely diluted (shared resources) something that was already weak in the first place.

Making a weak core even weaker by forcing it to share resources is not going to result in stronger performance in a consistent robust manner. There will be niche apps, corner-cases, that can take advantage of it, but it hardly makes for a compelling argument that it is as good general purpose processor.

Caza · Oct 17, 2011

I read that the extra core added about 50% more transistors to the module size. If they had gotten the 80% performance gain they anticipated it would have been worth it.

They're not able to feed the cores fully so in practice they're getting 30-59% which explains a lot. When it's at the 30% end it's just horrible. This also explains why they're so close to Thuban 6 core performance. [1.6x4 = 6.4]

It also seems they threw too much cache in and the latency is not as good as older designs. More transistors equals more heat at a given voltage. A bunch of "minor" things adding up to tip the scales in a negative direction.

The good thing assuming these deficiencies can be tweaked, Piledriver could come out 20% ahead.

Ferzerp · Oct 17, 2011

Idontcare said:
I think what we learned from this, as laymen, is that CMT is not an effective method of improving the performance of a core any more than hyperthreading is.

That depends on your definition of effective is. A modulified core appears to provide more benefit to that core in overall throughput than a HT'ed core, but at a cost (in both single threaded performance and transistor count) that makes that benefit of dubious overall merit, but it *does* provide more of a benefit than HT. To make it attractive it needs to not come at such a steep cost though (as it appears even enabling the ability hamstrings the performance per thread, regardless of how many cores are actually loaded).

It makes one wonder why, after seeing the behavior, did they not just trash the 8 "core" part, separate the cores in to true cores and make an actual 6 core part as top of the line.

Vesku · Oct 17, 2011

Idontcare said:
I think what we learned from this, as laymen, is that CMT is not an effective method of improving the performance of a core any more than hyperthreading is.

If the core (base) compute microarchitecture is weak (be it netburst/prescott or bulldozer/zambezi) then expanding the architecture in the direction of multithreading by way of CMT/SMT is essentially a fool's errand because you've merely diluted (shared resources) something that was already weak in the first place.

Making a weak core even weaker by forcing it to share resources is not going to result in stronger performance in a consistent robust manner. There will be niche apps, corner-cases, that can take advantage of it, but it hardly makes for a compelling argument that it is as good general purpose processor.

I was saying that pre-launch, actually, that if they couldn't hit close to Nehalem performance after CMT penalty then they wouldn't be near Sandybridge. The bright side of FX is that it seems their CMT design is pretty functional. Unfortunately the cores being CMTed are not what I or many were expecting.

Although, I would love to see a technical site more thoroughly examine the FX design. A lot to explore in terms of the front end, cache, deeper exploration of new instruction set performance, etc. Also, a closer look at bottleneck scenarios where the 8 series seems to not be running at full pace would be great.

ocre · Oct 17, 2011

Idontcare said:
AMD said the 2nd INT core present in a module occupies only 5% of area, not a 60-70% increase.

AMD would say something like this! What a claim! But it doesnt work like that. While hyperthreading could be added to intel chips, AMd had to redesign entirely a new CPU for their CMT to be feasible. Their claim, 5% dies space? Well i dont know how that is a useful scale. Transistor count tells it all.

Its obvious they have a 2billion transistors and an 8core cpu. Thurban had 6cores and was only 900 million (150 million per core). So comparing these designs you can see the CMT design has an extremely huge amount of transistors. that 250million transistors per core for BD.

This is just a rough comparison aimed to show that AMDs claims of 5% die space for 80% gains is completely absurd! Why didnt they just slap their 5% companion CMT cores to their existing designs then? Because you cant do that. Its not possible. They had to completely design it from the ground up. As we all can see now, its a huge freakn chip design. 5%, lol!

Modules are more effective than hyperthreading, right?

Lifer

Golden Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Lifer

Lifer

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Lifer

Golden Member

Golden Member

Elite Member

Diamond Member

Golden Member

Elite Member

Golden Member

Diamond Member

Lifer

Lifer

Senior member

Lifer

Lifer

Elite Member

Junior Member

Diamond Member

Diamond Member

Golden Member