Multithreading with Intel and AMD

mrle

Member
Mar 27, 2009
33
0
0
I have been searching around, but I couldn't find any good thread scaling comparisons of Intel and AMD cpus with pure integer workloads. Hope some of you guys can help me out here.

The problem is, at the moment I have a task that can be easily run on multiple threads - basically it's error checking and statistical analysis of some binary data, so only integer load, without any FPU or special SIMD instructions. I want to run it on 8 threads if possible, so I created a small benchmark for an intensive section of the code (CRC calculation) to see if it can benefit from HT. Here's the result on an Intel i7-2600K:

Code:
1 thread(s) 2199714 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2203655 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2376672 cpu cycles (108 % of 1 thread). MT scaling factor: 2.78
4 thread(s) 2570756 cpu cycles (116 % of 1 thread). MT scaling factor: 3.42
5 thread(s) 2603060 cpu cycles (118 % of 1 thread). MT scaling factor: 4.23
6 thread(s) 2666084 cpu cycles (121 % of 1 thread). MT scaling factor: 4.95
7 thread(s) 2701991 cpu cycles (122 % of 1 thread). MT scaling factor: 5.70
8 thread(s) 3080511 cpu cycles (140 % of 1 thread). MT scaling factor: 5.71

What the numbers above mean is that processing 8 threads in parallel takes about 40% more time than with only one thread, but the amount of data processed is of course 8 times as much, which gives a 5.71 scaling factor.

I'd like to know if AMD cpus can do better in this scenario. Could someone with a FX-8350 please download and run http://www.sendspace.com/file/6cuvcq and post the output here? It's a tiny console Win32 app, doesn't read, write or connect to anything, just runs threads and prints out output like the one above. Also feel free to test with any other cpu. Thanks.
 

guskline

Diamond Member
Apr 17, 2006
5,338
476
126
Feel like a fool asking this but, I finally downloaded it (a bear downloading the supporting software) When I click on the file to run, it runs quick and disappears. Walk me through the steps to keep the results up on my screen. BTW I ran it on my 8350 rig below.
 

Maximilian

Lifer
Feb 8, 2004
12,604
15
81
Hmm looks interesting!

i7 3930k
1 thread(s) 2123115 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2168113 cpu cycles (102 % of 1 thread). MT scaling factor: 1.96
3 thread(s) 2204964 cpu cycles (103 % of 1 thread). MT scaling factor: 2.89
4 thread(s) 2248677 cpu cycles (105 % of 1 thread). MT scaling factor: 3.78
5 thread(s) 2251502 cpu cycles (106 % of 1 thread). MT scaling factor: 4.71
6 thread(s) 2260676 cpu cycles (106 % of 1 thread). MT scaling factor: 5.63
7 thread(s) 2669511 cpu cycles (125 % of 1 thread). MT scaling factor: 5.57
8 thread(s) 2647994 cpu cycles (124 % of 1 thread). MT scaling factor: 6.41

i5 2400s
1 thread(s) 1893164 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 1997567 cpu cycles (105 % of 1 thread). MT scaling factor: 1.90
3 thread(s) 2313105 cpu cycles (122 % of 1 thread). MT scaling factor: 2.46
4 thread(s) 2368806 cpu cycles (125 % of 1 thread). MT scaling factor: 3.20
5 thread(s) 4530681 cpu cycles (239 % of 1 thread). MT scaling factor: 2.09
6 thread(s) 4559187 cpu cycles (240 % of 1 thread). MT scaling factor: 2.49
7 thread(s) 4674793 cpu cycles (246 % of 1 thread). MT scaling factor: 2.83
8 thread(s) 4738848 cpu cycles (250 % of 1 thread). MT scaling factor: 3.20
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Code:
1 thread(s) 2199714 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2203655 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2376672 cpu cycles (108 % of 1 thread). MT scaling factor: 2.78
4 thread(s) 2570756 cpu cycles (116 % of 1 thread). MT scaling factor: 3.42
5 thread(s) 2603060 cpu cycles (118 % of 1 thread). MT scaling factor: 4.23
6 thread(s) 2666084 cpu cycles (121 % of 1 thread). MT scaling factor: 4.95
7 thread(s) 2701991 cpu cycles (122 % of 1 thread). MT scaling factor: 5.70
8 thread(s) 3080511 cpu cycles (140 % of 1 thread). MT scaling factor: 5.71

What the numbers above mean is that processing 8 threads in parallel takes about 40% more time than with only one thread, but the amount of data processed is of course 8 times as much, which gives a 5.71 scaling factor.

Cool program and concept mrle!

A couple bits of advice though based on what I learned in the course of the data collected and presented in this thread - when it comes to thread scaling you really have to pro-actively manage the core loading in terms of balancing physical vs. logical cores (Intel w/HT) and modules vs. cores (AMD).

If you don't do that pro-actively, if you just leave it to Window's scheduler, then the results you get in terms of thread-scaling will vary from user to user because the load-balancing will cause performance penalities (unlocked thread affinities, cache thrashing, etc) as well as resource contention on given cores and modules.

And last but not least is power-management and turbo-boost or turbo-core. If that is not disabled then your measured (apparent) thread scaling will be much lower than it is in reality because the lower-thread count runs will process at much higher core clockspeeds than those of the larger thread count runs.

Here's a few examples taken from that linked thread above:

ThreadsvsGFlopsat4GHz.png


Here are the Cinebench results:

ThreadsvsCineBenchat4GHz.png


Cinebench scales well with Piledriver, but the CMT tax is right near the expected 20%.

CineBenchCMTTaxat4GHz.png


Comparing the M1/M2 test versus the M1 C0 test, the thread scaling is a nearly perfect 2:1. M1/M2 = 1 thread per module

Compare that to the M1-M4 test with four threads, thread scaling is again a nearly perfect 2:1. M1-M4 = 1 thread per module

Likewise if we look at the C0/C1 test versus the M1 C0 test, the CMT tax is ~18%. Pretty close the 0.8x scaling that AMD mentioned would be the trade-off with their specific implementation of CMT on bulldozer. C0/C1 = 2 threads per module

The CMT tax is consistent, extending the calculation from 2 threads to 4 threads the C0-C3 test shows the same 0.82x scaling of that produced by the M1-M4 test. C0-C3 = 2 threads per module.


ThreadsvsFritzChessat4GHz.png



To recap - if you want your thread-scaling numbers to make sense and be relevant then you need to (1) lock thread affinities to avoid thread migration performance penalties, (2) actively balance the threads to load physical cores or modules first with minimal shared resource conflicts, and (3) disable turbo-boost and turbo-core.
 

mrle

Member
Mar 27, 2009
33
0
0
Thanks for the comments, Idontcare.

You are right, currently I don't set affinity to threads, so they run on whichever core windows scheduler assigns them to. In fact, I only care about 8-threads run, because I want to fully load the cpu, and I expect windows to distribute them evenly. But in this case implementing thread affinity in code is easy for me, and I can do it if there is interest.

Second, I tried to "go around" the issue of various turbo- and overclocking settings by measuring cpu cycles with RDTSC instruction instead of time. Unfortunately, different architectures run the TSC differently so its not really an accurate measurement either. I might eventually revert to QueryPerformanceCounter method which uses HPET timers (if available).

In the end, I expect FX-8350 to come around 7.0 - 7.2 for 8 threads, if AMD's estimates on 0.8x penalty are correct. Also, in a lightly loaded system, I expect windows scheduler and background tasks not to influence results significantly.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
While the best performance might come from setting affinity and tailoring the program to particular hardware its not really a scalable solution for software. Its the operating systems task to assign threads in an appropriate manor and it will work differently on different architectures. It might make the performance slightly higher to set affinity but its not the right solution for programming and concurrency in general.

As to single verses multithreaded its actually accurate. If you run the program single threaded it will see a higher clock speed, that is by design. So you don't to remove that improvement to single threading as the impactt when multithreading is a reduction in clock speed and hence single core performance.

The benchmark is fine as is, it produces results that represent how software is written and utilises multiple cores when embarrassingly parallel problems are implemented.
 

mrle

Member
Mar 27, 2009
33
0
0
Forgot to mention that each n-threads run is repeated 10 times and the lowest value is taken. That should represent best case scenario with regard to scheduling and other thread interference.
 

tarmc

Senior member
Mar 12, 2013
322
5
81
1 thread(s) 2195518 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2197893 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2197141 cpu cycles (100 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2196016 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 3277059 cpu cycles (149 % of 1 thread). MT scaling factor: 3.35
6 thread(s) 3380844 cpu cycles (153 % of 1 thread). MT scaling factor: 3.90
7 thread(s) 3971853 cpu cycles (180 % of 1 thread). MT scaling factor: 3.87
8 thread(s) 4222198 cpu cycles (192 % of 1 thread). MT scaling factor: 4.16
on an i7 920 @ 3.8
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
While the best performance might come from setting affinity and tailoring the program to particular hardware its not really a scalable solution for software. Its the operating systems task to assign threads in an appropriate manor and it will work differently on different architectures. It might make the performance slightly higher to set affinity but its not the right solution for programming and concurrency in general.

As to single verses multithreaded its actually accurate. If you run the program single threaded it will see a higher clock speed, that is by design. So you don't to remove that improvement to single threading as the impactt when multithreading is a reduction in clock speed and hence single core performance.

The benchmark is fine as is, it produces results that represent how software is written and utilises multiple cores when embarrassingly parallel problems are implemented.

The point I was making is that you can either benchmark the thread-scaling of the processor's microarchitecture or you can benchmark the thread-scaling of the OSes thread scheduler combined with ambient-driven parameters (TDP and temperature) that feed back into the processor's power management policies (which themselves are subservient to the motherboard's bios).

I was assuming the OP wanted to interrogate the performance capabilities of the various processor microarchitectures with pure integer workloads. To do that you cannot convolute the results with the vagaries of OS schedulers and power-management dependencies.

I mean sure you can convolute the data, but only at the expense of creating conclusions that are questionable at best and probably no better than just taking guesses at the performance ("AMD says it is 0.8x, so that is what I expect"...well then why run the test if you are simply seeking confirmation bias?).

But I'm really not here to crap on the OP's thread, his project is noble and I wish him all the best. But it is fundamentally flawed in ways that I recognize because of my own efforts in testing thread-scaling, and I'd be the dick here if I noticed this but chose to withhold my knowledge from the OP by not mentioning it to them.

With power-management enabled the thread-scaling results are going to be local-environment dependent. It will also be BIOS and mobo dependent.

Without affinity locking the thread scaling is going to be OS dependent (scheduler dependent). WinXP is not going to yield the same thread scaling as Win8 (or Linux).

So the question then becomes - if you aren't testing the thread-scaling of the microarchitecture in isolation of the other thread-scaling impacting variables then what are you testing the thread-scaling of, and is that really what you wanted to test?
 

Lorne

Senior member
Feb 5, 2001
873
1
76
Hmmmm, How to word this.
IDC is correct
Bright is also correct that the most commonality of OS used, But doesnt see that IDC is pointing out the flaws in what the OS itself can cause.
Then Mrle can modify coding to read both ways to show differences between resualts or design coding to always take the better path to show the best results.
This is how scaling in architecture of both hardware and software are improved.
 

mrle

Member
Mar 27, 2009
33
0
0
From the results with Intel HT-enabled cpus posted above, I can conclude that the windows scheduler is usually smart enough to load "real" cores first, and then their HT-mirrors. But I also agree that it is wrong to rely on scheduler doing the best possible thing, so I'll modify the source to explicitly use cpu affinity and "modules fully loaded last" strategy as proposed by Idontcare.
 

beginner99

Diamond Member
Jun 2, 2009
5,314
1,756
136
I have been searching around, but I couldn't find any good thread scaling comparisons of Intel and AMD cpus with pure integer workloads. Hope some of you guys can help me out here.

The problem is, at the moment I have a task that can be easily run on multiple threads - basically it's error checking and statistical analysis of some binary data, so only integer load, without any FPU or special SIMD instructions. I want to run it on 8 threads if possible, so I created a small benchmark for an intensive section of the code (CRC calculation) to see if it can benefit from HT. Here's the result on an Intel i7-2600K:

Code:
1 thread(s) 2199714 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2203655 cpu cycles (100 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2376672 cpu cycles (108 % of 1 thread). MT scaling factor: 2.78
4 thread(s) 2570756 cpu cycles (116 % of 1 thread). MT scaling factor: 3.42
5 thread(s) 2603060 cpu cycles (118 % of 1 thread). MT scaling factor: 4.23
6 thread(s) 2666084 cpu cycles (121 % of 1 thread). MT scaling factor: 4.95
7 thread(s) 2701991 cpu cycles (122 % of 1 thread). MT scaling factor: 5.70
8 thread(s) 3080511 cpu cycles (140 % of 1 thread). MT scaling factor: 5.71

What the numbers above mean is that processing 8 threads in parallel takes about 40% more time than with only one thread, but the amount of data processed is of course 8 times as much, which gives a 5.71 scaling factor.

I'd like to know if AMD cpus can do better in this scenario. Could someone with a FX-8350 please download and run http://www.sendspace.com/file/6cuvcq and post the output here? It's a tiny console Win32 app, doesn't read, write or connect to anything, just runs threads and prints out output like the one above. Also feel free to test with any other cpu. Thanks.

For me that scaling look ok because you actual have only 4 cores and you are doing only integer stuff. HT is for exactly the opposite, when you have a mixed workload. So 1 thread of a HT core uses integer stuff, the other FPU.

AMD will get better scaling because it has 8 real cores (at least for integer stuff).
 

beginner99

Diamond Member
Jun 2, 2009
5,314
1,756
136
From the results with Intel HT-enabled cpus posted above, I can conclude that the windows scheduler is usually smart enough to load "real" cores first, and then their HT-mirrors. But I also agree that it is wrong to rely on scheduler doing the best possible thing, so I'll modify the source to explicitly use cpu affinity and "modules fully loaded last" strategy as proposed by Idontcare.

True, windows has been HT-aware for years.
 

Lorne

Senior member
Feb 5, 2001
873
1
76
I know this is for the pressent day but would also be nice to see how the improvement from older OS scheduler to modern, With and w/o affinity strategy (NT-win8).

Where the AMD scores people.
 

AnonymouseUser

Diamond Member
May 14, 2003
9,943
107
106
Don't have an FX-8350, but figured I could share my score with a Phenom II X6 @ 3.7GHz

Code:
C:\Users\Mike\Downloads\CRCBench>CRCBench.exe
1 thread(s) 2618653 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2612558 cpu cycles (99 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2614747 cpu cycles (99 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2618850 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2626673 cpu cycles (100 % of 1 thread). MT scaling factor: 4.98
6 thread(s) 2634472 cpu cycles (100 % of 1 thread). MT scaling factor: 5.96
7 thread(s) 5213969 cpu cycles (199 % of 1 thread). MT scaling factor: 3.52
8 thread(s) 5227017 cpu cycles (199 % of 1 thread). MT scaling factor: 4.01

C:\Users\Mike\Downloads\CRCBench>
 

Lorne

Senior member
Feb 5, 2001
873
1
76
Been trying to dl the file, But I keep getting the wrong file (Some Ilivid BS), Maybe Im just to old.
If I have to DL a program to DL a program then F'it.
 

AnonymouseUser

Diamond Member
May 14, 2003
9,943
107
106
Been trying to dl the file, But I keep getting the wrong file (Some Ilivid BS), Maybe Im just to old.
If I have to DL a program to DL a program then F'it.

Are you using ad-block? If not, you're probably being tricked into clicking the wrong link. Look for the blue box that says: "Click here to start download from sendspace"
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
Don't have an FX-8350, but figured I could share my score with a Phenom II X6 @ 3.7GHz

Code:
C:\Users\Mike\Downloads\CRCBench>CRCBench.exe
1 thread(s) 2618653 cpu cycles (100 % of 1 thread). MT scaling factor: 1.00
2 thread(s) 2612558 cpu cycles (99 % of 1 thread). MT scaling factor: 2.00
3 thread(s) 2614747 cpu cycles (99 % of 1 thread). MT scaling factor: 3.00
4 thread(s) 2618850 cpu cycles (100 % of 1 thread). MT scaling factor: 4.00
5 thread(s) 2626673 cpu cycles (100 % of 1 thread). MT scaling factor: 4.98
6 thread(s) 2634472 cpu cycles (100 % of 1 thread). MT scaling factor: 5.96
7 thread(s) 5213969 cpu cycles (199 % of 1 thread). MT scaling factor: 3.52
8 thread(s) 5227017 cpu cycles (199 % of 1 thread). MT scaling factor: 4.01

C:\Users\Mike\Downloads\CRCBench>
that is near linear scaling...
its also kinda funny, every cpu except the one he wants.
 

AnonymouseUser

Diamond Member
May 14, 2003
9,943
107
106
that is near linear scaling...
its also kinda funny, every cpu except the one he wants.

Yep, until you hit threads 7 & 8. The i7-920 mentioned earlier did the same up to 4 threads, and still beats out my PhII overall (4.16 to 4.01). But while it does show scaling, it isn't giving a time to completion, so it's only half the story.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
I will estimate that FX8350 will get more than 7MT scaling factor.

Im not home to test the FX but i have my mini-itx Llano with me,

llano3870kmtscaling.jpg
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,598
4,506
75
Decided to try your benchmark on a Q9400. Results:
Code:
$ wine CRCBench.exe 
fixme:heap:HeapSetInformation (nil) 1 (nil) 0
wine: Call from 0x7bc4cdf0 to unimplemented function KERNEL32.dll.CreateThreadpoolWork, aborting
Did I mention the Q9400 is running Linux?