Multithreading with Intel and AMD

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Something is wrong with mine.

8 thread is 16.7

but only 49%
?

i7-3630qm

Something seems wrong.

It is consistent
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
<image redacted> i7-3630qm

Something seems wrong.

It is consistent

I'd be willing to bet power-management is screwing up the 1-thread reading, resulting in way too many CPU cycles being logged to get the job done.

Then when subsequent runs are initiated the power-management gets out of the way, speed goes up but it seems like it is all because of 2-threads now so the scaling goes crazy high.

This can happen when core-parking is allowed. It is an issue with LinX too, even on desktops, when HT is enabled.
 
Last edited:

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
Something is wrong with mine.

8 thread is 16.7

but only 49%
?

i7-3630qm

Something seems wrong.

It is consistent
arent mobile i7s 2x2 core/thread(2c2t?)? it seems that this also happened to the previous poster who put the hex core amd where the last two results were messed up probably because it doesn't have 8 cores.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
I'd be willing to bet power-management is screwing up the 1-thread reading, resulting in way too many CPU cycles being logged to get the job done.

Then when subsequent runs are initiated the power-management gets out of the way, speed goes up but it seems like it is all because of 2-threads now so the scaling goes crazy high.

This can happen when core-parking is allowed. It is an issue with LinX too, even on desktops, when HT is enabled.

You are right

On maximum performance mode.

hyperthreading.png


Would it be possible to edit that quote? I removed it because there was some stuff showing up in the background that I would prefer not to.

Edit: Thanks
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
arent mobile i7s 2x2 core/thread(2c2t?)? it seems that this also happened to the previous poster who put the hex core amd where the last two results were messed up probably because it doesn't have 8 cores.

Mobile qm chips are quadcores.
 
Last edited:

mrle

Member
Mar 27, 2009
33
0
0
Thank you everyone for your results, especially JQuilty for the FX-8350 run :thumbsup:

It seems to me that these power management and turbo features influence results much more than anything else, so they should be taken with some reserve. Still, I think this is already an indication that this type of workload might be very well suited for AMD's CMT, and not so much for HT.
 

SlowSpyder

Lifer
Jan 12, 2005
17,305
1,002
126
Ran it on my Thuban just for fun:

1 thread(s) 2607641 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2607693 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2608526 cpu cycles (100 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2609853 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2609338 cpu cycles (100 % of 1 thread). MT scaling factor: 5.00
6 thread(s) 2612983 cpu cycles (100 % of 1 thread). MT scaling factor: 5.99
7 thread(s) 5208636 cpu cycles (199 % of 1 thread). MT scaling factor: 3.50
8 thread(s) 5215254 cpu cycles (199 % of 1 thread). MT scaling factor: 4.00
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Ran it on my Thuban just for fun:

1 thread(s) 2607641 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2607693 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2608526 cpu cycles (100 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2609853 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2609338 cpu cycles (100 % of 1 thread). MT scaling factor: 5.00
6 thread(s) 2612983 cpu cycles (100 % of 1 thread). MT scaling factor: 5.99

Got to love that near-perfect scaling through 6-cores on your hex-core :thumbsup:

Thank you everyone for your results, especially JQuilty for the FX-8350 run :thumbsup:

It seems to me that these power management and turbo features influence results much more than anything else, so they should be taken with some reserve. Still, I think this is already an indication that this type of workload might be very well suited for AMD's CMT, and not so much for HT.

Thread scaling is kinda like clockspeed (GHz), thread-scaling alone doesn't tell you much about performance.

You have to factor in IPC and clockspeed, along with thread-scaling performance, to determine whether the workload is well suited for AMD's CMT.

We see examples of this all over the place, Cinebench is a good example:

cinebench.gif


In Cinebench, AMD's CMT-based FX-8350 scales better than Intel's 3770k.

FX-8350 8-threads is 6.94/1.11 = 6.25x

i7-3770K 8-threads is 7.56/1.66 = 4.55x

Based on this you might be tempted to conclude this type of workload (Cinebench rendering) is well suited to AMD's CMT...but that's not true if you look at the actual performance numbers.

It takes more clockspeed, more power-consumption, and more cores (8 vs. 4) for the FX-8350 to turn in a score of 6.94pts. Meanwhile the Intel approach with fewer cores, and less power-consumption, turns in the higher score of 7.56pts.

(now one can argue that the FX8350 is much cheaper resulting in better performance/dollar over the 3770k, but that is not the same as discussing the merits of the microarchitecture, superior performance/dollar is due to the sales and marketing team pricing the chips at a discount and not due to the CPU design team designing a superior product)
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
One thing that continues to concern me is how much impact the power saving is having on what is quite obviously an "all resources are necessary" problem. The power saving doesn't just work as its meant to, it seems to downclock and park cores at moments where the CPU appears to be fully utilised.

Does disabling parked cores make a difference or is it just about going to max performance that fixes the problem?

Since my main desktop is down I thought an old Core 2 Duo result might be of interest as well:

Max performance
Code:
1 thread(s) 2249908 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2277602 cpu cycles (101 % of 1 thread). MT scaling factor: 1.98
3 thread(s) 4536493 cpu cycles (201 % of 1 thread). MT scaling factor: 1.49
4 thread(s) 4561711 cpu cycles (202 % of 1 thread). MT scaling factor: 1.97
5 thread(s) 6827625 cpu cycles (303 % of 1 thread). MT scaling factor: 1.65
6 thread(s) 6851558 cpu cycles (304 % of 1 thread). MT scaling factor: 1.97
7 thread(s) 9059221 cpu cycles (402 % of 1 thread). MT scaling factor: 1.74
8 thread(s) 9140645 cpu cycles (406 % of 1 thread). MT scaling factor: 1.97

Balanced
Code:
1 thread(s) 2255724 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2278640 cpu cycles (101 % of 1 thread). MT scaling factor: 1.98
3 thread(s) 4556735 cpu cycles (202 % of 1 thread). MT scaling factor: 1.49
4 thread(s) 4591366 cpu cycles (203 % of 1 thread). MT scaling factor: 1.97
5 thread(s) 6808452 cpu cycles (301 % of 1 thread). MT scaling factor: 1.66
6 thread(s) 6829825 cpu cycles (302 % of 1 thread). MT scaling factor: 1.98
7 thread(s) 9079475 cpu cycles (402 % of 1 thread). MT scaling factor: 1.74
8 thread(s) 9124735 cpu cycles (404 % of 1 thread). MT scaling factor: 1.98

There is a small amount of negative scalability starting to kick in even at 4 threads where the over utilisation of the machines resources are causing a slight drop in 1:1 performance.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
One thing that continues to concern me is how much impact the power saving is having on what is quite obviously an "all resources are necessary" problem. The power saving doesn't just work as its meant to, it seems to downclock and park cores at moments where the CPU appears to be fully utilised.

Does disabling parked cores make a difference or is it just about going to max performance that fixes the problem?

You will observe the same issue with LinX thread-scaling as well.

When I want to investigate the capabilities of the microarchitecture I disable all power-features, all turbo features, and I disable core-parking.

Additionally I manually manage the thread-affinity to avoid thread-migration (undermines performance) as well as thread balancing across modules and cores.

And depending on the motherboard, sometimes I have to go in and manually tweak the BCLK to get it to be 100MHz (and FSB w/AMD) so the CPU's clockspeed is what it is supposed to be. Some mobo makers try and be be sneaky in OC'ing the BCLK/FSB just a smidge so their mobo comes out ahead in benchmarks.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
This should not be the case, we should consider it a bug and report it as such to Microsoft. Its not correct that the machine is parking itself and leaving large amounts of unused cycles when its clearly fully utilised.
 

mrle

Member
Mar 27, 2009
33
0
0
Please note that this particular benchmark does not really represent any real world application, it operates on only 1 MB of random generated data, which fits easily in L3 cache, even L2 cache in FX series. The purpose was to see if HT/CMT cores can be used as pure integer cores without (much) penalty.

Actually, lately I have had quite a few multithreaded development projects and I am trying to decide on which platform to buy for my new workstation. I don't care that much about power consumption, IPC or performance, because it would be fully loaded only for very small fractions of time, so I think any new cpu would be good enough for me -- even my old workstation with Core 2-based dual core Xeon and a fast SSD is still mostly fine. What I need is a many core cpu to be able to simulate and debug concurrency and other threading effects in MT applications. And these are mostly server applications, not something anyone would want to run at home on commodity hardware anyway. I'll take a little more time to decide, but I think I will probably go for either a 4C8T E3-series Xeon or FX-8350.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
This should not be the case, we should consider it a bug and report it as such to Microsoft. Its not correct that the machine is parking itself and leaving large amounts of unused cycles when its clearly fully utilised.

I agree.

At the same time we must also take stock in the fact that the web is littered with "disable core parking" programs because (1) core-parking is somewhat borked and lots of people have issues with it, and (2) Microsoft does not appear inclined to prioritize addressing it given that enough time has passed such that dozens of 3rd party "disable core parking" apps have been created in the meantime.

I'm with you in being an optimist in thinking MS will address it eventually, at the same time I espouse the philosophy of being pragmatic about it and disable core-parking altogether.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Please note that this particular benchmark does not really represent any real world application, it operates on only 1 MB of random generated data, which fits easily in L3 cache, even L2 cache in FX series. The purpose was to see if HT/CMT cores can be used as pure integer cores without (much) penalty.

Actually, lately I have had quite a few multithreaded development projects and I am trying to decide on which platform to buy for my new workstation. I don't care that much about power consumption, IPC or performance, because it would be fully loaded only for very small fractions of time, so I think any new cpu would be good enough for me -- even my old workstation with Core 2-based dual core Xeon and a fast SSD is still mostly fine. What I need is a many core cpu to be able to simulate and debug concurrency and other threading effects in MT applications. And these are mostly server applications, not something anyone would want to run at home on commodity hardware anyway. I'll take a little more time to decide, but I think I will probably go for either a 4C8T E3-series Xeon or FX-8350.

Not that I am telling you anything you don't already know, but others reading this thread might not be aware, the challenge you are seeking to address is all the more complicated by non-CMP based microarchitectures.

Both HT and CMT make such efforts microarchitecture-dependent, something which wasn't an issue just a couple of years ago when cores were cores and having more than one per socket simply meant you accounted for standard CMP effects in terms of cache and ram bandwidth.

I'm very curious to see how you manage the characterization of the compute topology in today's (and tomorrow's) CMT/HT type microarchitectures :thumbsup:
 

mrle

Member
Mar 27, 2009
33
0
0
From my experience, if you are not deliberately stressing a particular architecture, special CMT/HT optimizations should be left to the operating system. If you try to be smart and set thread affinities and order in which cores should be loaded, you'd better have a good reason to do it, because not only will you make scheduler's job more difficult, but you will also lose flexibility if the application is eventually migrated to a different/faster platform, or virtual (cloud) infrastructure.

Let me illustrate my point with a screenshot of Intel's VTune analyzing execution of my benchmark on a dual-core machine:
6ilkUqA.png

I use ThreadPoolWork in the code, which is a nice abstraction over plain thread scheduler with good performance on various underlying hardware. Basically, you give it "work" units and it assigns them to threads. You can see that it behaves exactly as expected for one and two concurrent work units (far left) - it assigns them to separate threads and they run on 2 available cores uninterrupted. More than 2 is obviously suboptimal for dual core machine, but it is worth analyzing what is going on. For 8 threads, you might expect that it will just spawn 8 threads, assign a work unit to each one of them and let them fight it out for cpu resources. But another thing is actually going on, there are only 4 threads and the work units are quite well balanced between them. This reduces context switches and gives an ideal scaling factor (for a dual core machine) of almost 2 even with 8 concurrent work units, while still being able to use all 8 cores effectively for example on a FX-8350. I think that wouldn't be so easy to achieve with manual optimizations.
 

SPBHM

Diamond Member
Sep 12, 2012
5,066
418
126
1 thread(s) 2460357 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2462251 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2819033 cpu cycles (114 % of 1 thread). MT scaling factor: 2.62
4 thread(s) 2963254 cpu cycles (120 % of 1 thread). MT scaling factor: 3.32
5 thread(s) 5387725 cpu cycles (218 % of 1 thread). MT scaling factor: 2.28
6 thread(s) 5407131 cpu cycles (219 % of 1 thread). MT scaling factor: 2.73
7 thread(s) 5733985 cpu cycles (233 % of 1 thread). MT scaling factor: 3.00
8 thread(s) 5927344 cpu cycles (240 % of 1 thread). MT scaling factor: 3.32

core i3 2100
 

Dinkydau

Member
Apr 1, 2012
50
5
71
Amd opteron 6272 (8 modules, 16 integer cores)

Code:
1 thread(s) 3041928 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 3051962 cpu cycles (100 % of 1 thread). MT scaling factor: 1.99
3 thread(s) 3059131 cpu cycles (100 % of 1 thread). MT scaling factor: 2.98
4 thread(s) 3068054 cpu cycles (100 % of 1 thread). MT scaling factor: 3.97
5 thread(s) 3080683 cpu cycles (101 % of 1 thread). MT scaling factor: 4.94
6 thread(s) 3099179 cpu cycles (101 % of 1 thread). MT scaling factor: 5.89
7 thread(s) 3127752 cpu cycles (102 % of 1 thread). MT scaling factor: 6.81
8 thread(s) 3128746 cpu cycles (102 % of 1 thread). MT scaling factor: 7.78

Restricted to one node (4 modules, like fx-8350)
Code:
1 thread(s) 3017522 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 3031920 cpu cycles (100 % of 1 thread). MT scaling factor: 1.99
3 thread(s) 3211052 cpu cycles (106 % of 1 thread). MT scaling factor: 2.82
4 thread(s) 3231453 cpu cycles (107 % of 1 thread). MT scaling factor: 3.74
5 thread(s) 3237524 cpu cycles (107 % of 1 thread). MT scaling factor: 4.66
6 thread(s) 3244135 cpu cycles (107 % of 1 thread). MT scaling factor: 5.58
7 thread(s) 3265295 cpu cycles (108 % of 1 thread). MT scaling factor: 6.47
8 thread(s) 3286271 cpu cycles (108 % of 1 thread). MT scaling factor: 7.35
 

lakedude

Platinum Member
Mar 14, 2009
2,778
528
126
So it is really no surprise that AMD scales better, after all the 8350 has 8 real integer cores so you should expect nearly 8x the performance.

HT is just using 4 cores with the extra threads beyond 4 taking up otherwise wasted cycles. No way anyone actually expects perfect scaling from HT.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
So it is really no surprise that AMD scales better, after all the 8350 has 8 real integer cores so you should expect nearly 8x the performance.

HT is just using 4 cores with the extra threads beyond 4 taking up otherwise wasted cycles. No way anyone actually expects perfect scaling from HT.

Tailoring the workload to avoid the bottlenecks in the microarchitecture does that, true.

I suppose one could perform some deep interrogation of the inner-workings of HT, figure out how to maximize its relative performance and create a workload that sees nearly 100% speedup with HT as well. I imagine the code would have to involve something that causes a ridiculous amount of stalls in the pipeline.
 

mrle

Member
Mar 27, 2009
33
0
0
Yeah, this particular workload might be tailored for scaling, but another thing is also interesting. For sure, power management and turbo might have influenced single thread clock cycles values, but they should be at least somewhat comparable for IPC assessment -- on Intel cpus TSC runs at BCLK x base multiplier (tested 3400 MHz on a stock i7-2600K), and on an FX cpu it runs at P0 speed (highest non-turbo p-state, according to the information available in AMD docs). So it is in fact base clock frequency for both.

Now, Intel cpus I tried need around 2.1 - 2.2 million cycles for 1 MB of data, which gives around 2 clocks per byte, about right for the algorithm used. But here I see reports of Phenoms needing 2.6 M, and FX even more, 2.8 - 3.0 M. That is a big difference clock-for-clock, more than enough to offset 4C/8T Intel HT-scaling handicap vs 8C/8T AMD.
 

shady28

Platinum Member
Apr 11, 2004
2,520
397
126
Just for reference, this is for my (new) FX-8320, all *stock* 3.5Ghz frequencies, nothing optimized.

I ran this 5 times and, while some of the intermediate numbers varied a bit, the end 8 thread number was always within +/- 0.03 of the results below

(FX-8320 is 3.5Ghz stock vs FX-8350 @ 4Ghz )


1 thread(s) 2550540 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2796467 cpu cycles (109 % of 1 thread). MT scaling factor: 1.82
3 thread(s) 2727673 cpu cycles (106 % of 1 thread). MT scaling factor: 2.81
4 thread(s) 2711951 cpu cycles (106 % of 1 thread). MT scaling factor: 3.76
5 thread(s) 2894195 cpu cycles (113 % of 1 thread). MT scaling factor: 4.41
6 thread(s) 3029573 cpu cycles (118 % of 1 thread). MT scaling factor: 5.05
7 thread(s) 3065688 cpu cycles (120 % of 1 thread). MT scaling factor: 5.82
8 thread(s) 3151845 cpu cycles (123 % of 1 thread). MT scaling factor: 6.47
 

mrle

Member
Mar 27, 2009
33
0
0
Just for reference, this is for my (new) FX-8320, all *stock* 3.5Ghz frequencies, nothing optimized.

If it's not too much trouble, could you please run it with turbo-core disabled? You should see much more linear scaling, because in your case turbo significantly affects 1-thread result which is a baseline for all other results.

I ran this 5 times and, while some of the intermediate numbers varied a bit, the end 8 thread number was always within +/- 0.03 of the results below

Intermediate results may vary because they depend on scheduler assigning work units to threads. But at least 1-thread and 8-thread results should be constant, as you noticed.
 

Rainer

Junior Member
Mar 14, 2013
14
0
0
It is interesting how right Idontcare is :) The Power Management really distorts the results, as the scaling factor is more or less all over the place:

F:\Temp>crcbench
1 thread(s) 2592039 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2592619 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 3165399 cpu cycles (122 % of 1 thread). MT scaling factor: 2.46
4 thread(s) 3281959 cpu cycles (126 % of 1 thread). MT scaling factor: 3.16
5 thread(s) 3145442 cpu cycles (121 % of 1 thread). MT scaling factor: 4.12
6 thread(s) 2914909 cpu cycles (112 % of 1 thread). MT scaling factor: 5.34
7 thread(s) 2866660 cpu cycles (110 % of 1 thread). MT scaling factor: 6.33
8 thread(s) 2610735 cpu cycles (100 % of 1 thread). MT scaling factor: 7.94

F:\Temp>crcbench
1 thread(s) 2587300 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2591936 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2966658 cpu cycles (114 % of 1 thread). MT scaling factor: 2.62
4 thread(s) 3318273 cpu cycles (128 % of 1 thread). MT scaling factor: 3.12
5 thread(s) 3311967 cpu cycles (128 % of 1 thread). MT scaling factor: 3.91
6 thread(s) 2997183 cpu cycles (115 % of 1 thread). MT scaling factor: 5.18
7 thread(s) 2793220 cpu cycles (107 % of 1 thread). MT scaling factor: 6.48
8 thread(s) 2605273 cpu cycles (100 % of 1 thread). MT scaling factor: 7.94
Now two test runs with Power Managment set to "CPU always 100%"

F:\Temp>crcbench
1 thread(s) 2591451 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2592157 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2592291 cpu cycles (100 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2593343 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2595047 cpu cycles (100 % of 1 thread). MT scaling factor: 4.99
6 thread(s) 2594137 cpu cycles (100 % of 1 thread). MT scaling factor: 5.99
7 thread(s) 2598689 cpu cycles (100 % of 1 thread). MT scaling factor: 6.98
8 thread(s) 2599617 cpu cycles (100 % of 1 thread). MT scaling factor: 7.97

F:\Temp>crcbench
1 thread(s) 2591609 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2592904 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2591829 cpu cycles (100 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2592925 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2593471 cpu cycles (100 % of 1 thread). MT scaling factor: 5.00
6 thread(s) 2594952 cpu cycles (100 % of 1 thread). MT scaling factor: 5.99
7 thread(s) 2598849 cpu cycles (100 % of 1 thread). MT scaling factor: 6.98
8 thread(s) 2596708 cpu cycles (100 % of 1 thread). MT scaling factor: 7.98

F:\Temp>
CPU is an Opteron 6128.
 
Last edited: