• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

CMT vs SMT - Bulldozer vs Gulftown Scaling

The end results do matter (that's what benchmarks measure). The thing is the scaling is very similar for both despite the throughput claims. Perhaps rather than CMT, AMD should be looking at SMT which is less complex and takes up less die area. 🙂

The odd one out is CLOMP.......
CLOMP is the C version of the Livermore OpenMP benchmark developed to measure OpenMP overheads and other performance impacts due to threading in order to influence future system designs. This particular test profile configuration is currently set to look at the OpenMP static schedule speed-up across all available CPU cores using the recommended test configuration.
Bulldozer seems to level off at 4 threads. Could be "core"/thread scheduling impact (second "core"/thread of each module being used)? :hmm:

Also there's a weird anomaly for the Gulftown at between 6 to 8 threads on some of these tests. 😛
 
Last edited:
I would think that SMT is more complicated to implement.

Why? Register renaming (which all modern x86 cpus have) takes you halfway there. If you can rename the flags register too, you basically don't have to do any changes to the execution units to support SMT.
 
Full 12 threads 990x, full 8 threads FX-8150
Test Speedup % 990x full HT Speedup % FX-8150 full Module
C-Ray 5.18 83.35
Smallpt 38.16 83.65
GraphicsMagick 39.39 44.68
GraphicsMagick 46.88 62.50
7-Zip 30.82 91.54
x264 21.75 88.44
NAS Parallel -4.38 68.60
NAS Parallel 49.06 86.60
NAS Parallel -20.92 59.29
NAS Parallel 9.25 67.91
NAS Parallel 9.69 65.28
CLOMP 25.00 -2.77

vwrak.png
 
Last edited:
Full 12 threads 990x, full 8 threads FX-8150
For scaling, I'm looking at 8 threads on the Bulldozer versus 8 threads on the Gulftown, since we do not have 12 thread Bulldozer sample data (thus how Bulldozer scales beyond 8 threads is unknown, waiting for Interlagos on that one). Deltas are similar on a few of those tests (the ones without the Gulftown anomalies) 🙂
 
I was contrasting HT to CMT gains, I must say HT has come a long way since the P4 implementation some solid gains. CMT doesn't have as much room to improve but it's an interesting approach to running more threads.

Is it possible to disable cores in a 2600K? Someone with access to comparable 2600K and 8150 systems could then start from 1 core no HT and 1 module 1 thread all the way up to 4 core with HT and 4 module 2 thread per module.
 
Last edited:
For scaling, I'm looking at 8 threads on the Bulldozer versus 8 threads on the Gulftown, since we do not have 12 thread Bulldozer sample data (thus how Bulldozer scales beyond 8 threads is unknown, waiting for Interlagos on that one). Deltas are similar on a few of those tests (the ones without the Gulftown anomalies) 🙂

That's problematic. For eight threads we know that BD is fully loaded, but what does that mean for Gulftown? It might mean that the benchmark is running on four cores with two threads per core, or it could be running on all six cores with only two cores fully loaded via SMT -- or five with three fully loaded. Depending on how the threads are managed you could be looking at a huge difference in scaling.

A better comparison for CMT vs. SMT would be one Intel core vs. one BD module, or at least compare products that have the same number of total threads. It can be difficult to tell if the benchmarks are favouring loading up physical cores before logical ones, or the other way around. 6/12 vs 4/8 just compounds the issue. Too much random noise.
 
Last edited:
That's problematic. For eight threads we know that BD is fully loaded, but what does that mean for Gulftown? It might mean that the benchmark is running on four cores with two threads per core, or it could be running on all six cores with only two cores fully loaded via SMT -- or five with three fully loaded. Depending on how the threads are managed you could be looking at a huge difference in scaling.

That's a non-issue, Windows is HT aware and will always load physical cores first and logical cores second so with 8 threads it will always load 6 physical cores and 2 logical cores. An older unpatched version of Windows like Windows XP may load 4 physical cores and 4 logical cores but it doesn't happen on newer versions of Windows.
BTW. If you don't trust windows scheduler you can always assign core affinity manually.
 
Back
Top