AMD Zen supports CMT and SMT

mrmt · Apr 4, 2015

MiddleOfTheRoad said:
Hilarious development.

You are well known on this forum for being one of the harshest critics of AMD.

And then it turns out you work in the oil industry -- which is the king of the universe for EPIC FAILS. Colossal, monstrous, amazingly huge screw-ups that massively dwarf just about any other failure (Deepwater Horizon, Exxon Valdez) which end up making AMD's Bulldozer look like an award-winning blockbuster home run success by comparison.

So I just want to thank you for your contribution. You didn't just make my day -- you made my year.

That's the nature of our activity. There aren't many industries as hazardous as ours. Indeed the companies made a lot of mistakes and the successes in our industry don't appear at all. But that doesn't make you less ignorant about us than you are.

But speaking about failure, do you live in a cave without heating wearing leather rags? Because unless you do, you are very dependent of our industry as is every citizen of a modern society. This is the measure of success of our industry, along with the billions in profits, salaries and taxes we generate every year.

And in our company's case, we don't buy AMD cpus at all.

DrMrLordX · Apr 5, 2015

There are a lot of companies that don't buy AMD CPUs. Just look at market share figures for x86 servers. AMD commands . . . what, 2.5% of the server market now?

It's great that someone found a use case where CMT > SMT. Good for them, high fives and all that. Of course, you can always disable HT if it's bringing about erratic task completion. But whatever.

There's still no direct reference to CMT or SMT in the source articles feeding this thread, so as much fun as it is to speculate about some CMT+SMT monstrosity, I think we'd best wait and see what Zen is really about. Which uh, could take awhile.

AtenRa · Apr 5, 2015

Servers = Throughtput

CMT = higher Throughtput than SMT

Now, mrmt may try to distort reality that the reason for AMDs low Server marketshare is CMT but that is far from reality.

CHADBOGA · Apr 5, 2015

MiddleOfTheRoad said:
Technically, he might not be wrong. If you were to compare an entry-level Haswell Celeron G1830 to a Streamroller Athlon X4 860K -- Passmark only separates them by a mere 76 points in single threaded performance. Seems like a fair comparison, too -- since there is only a $20 difference in price between those 2 chips.

It's clearly no contest once you break into the i5 Haswells or above (versus Steamrollers) -- but it really depends on which Haswell he was referring to.... If he was only referring to Celerons or even the low-end Pentiums, then sure.... It is a true statement.

I'd assume he was talking about the best of each processor family, even though it then makes his statement a laughing stock, but past history does suggest that.

AtenRa · Apr 5, 2015

CHADBOGA said:
I'd assume he was talking about the best of each processor family, even though it then makes his statement a laughing stock, but past history does suggest that.

For the same price/segment, Kaveri ST performance is close to Haswell.

For example A10-7770K/7850K ST performance is close to Core i3 41xx.

Also, since Kaveri doesnt have L3 cache, the ST and MT performance is lower than what the SteamRoller architecture can achieve.
And, the situation would have been even better IF they would have used the 20nm SOI GateFirst node which would give them even higher frequencies than the 28nm Bulk process.

The problem is not in the architecture, the problem is they didnt have a good node process. Now with 14nm FF they have the process they need to make the CMT architecture stretch its legs.

ShintaiDK · Apr 5, 2015

AtenRa said:
For the same price/segment, Kaveri ST performance is close to Haswell.

For example A10-7770K/7850K ST performance is close to Core i3 41xx.

Also, since Kaveri doesnt have L3 cache, the ST and MT performance is lower than what the SteamRoller architecture can achieve.
And, the situation would have been even better IF they would have used the 20nm SOI GateFirst node which would give them even higher frequencies than the 28nm Bulk process.

The problem is not in the architecture, the problem is they didnt have a good node process. Now with 14nm FF they have the process they need to make the CMT architecture stretch its legs.

First of all, your distortion of price also enables the 43xx series.

Secondly its not even remotely true what you write. Even if we just look at passmark. The 7850K got 1577. While an i3 4370 got 2239. An i3 4160 got one of 2071. (They sell 4170 as well). And thats CPUs with 40W less TDP!

Your misleading gets so obvious when you try and spec any products. The only way your case gets true is if you use some obsolete lowend Celeron against your best Kaveri. But then the G1840 Celeron is still faster with 1651.

A 40$ Celeron beat a 150$ Kaveri in ST! (And 2 threads for the matter).

BSim500 · Apr 5, 2015

AtenRa said:
For the same price/segment, Kaveri ST performance is close to Haswell. For example A10-7770K/7850K ST performance is close to Core i3 41xx.

I think you mean A10-7850K MT is close to an i3-4xxx in 4-thread MT benchmarks. For ST, in the real world the A10-7850K consistently has around 1/2 to 3/4 of the single core performance (with a +5-10% clock advantage). You can tell when a benchmark is genuinely ST as the higher clocked i3's usually get more fps / points / take less time than some of the lower clocked i5's, and 1x Kaveri core is not remotely "close" at all to 1x Haswell core:-

"Our LAME audio conversion test is single-threaded."
http://media.bestofmicro.com/0/Q/418634/original/lame.png

"Printing a PowerPoint file to PDF happens in one thread, and the results of our benchmark are right in line with what previous metrics tell us to expect"
http://media.bestofmicro.com/0/8/418616/original/acrobat.png

"Super PI is a single threaded benchmark that calculates pi to a specific number of digits"
http://cdn.eteknix.com/wp-content/uploads/2014/02/kaveri_SuperPi.png

http://www.extremetech.com/wp-content/uploads/2014/03/Torchlight.png
http://pclab.pl/zdjecia/artykuly/radek/2013/i3hsw/wykresy/sc2_1920.png
http://pclab.pl/zdjecia/artykuly/radek/2013/i3hsw/wykresy/wot_1920.png
http://pclab.pl/zdjecia/artykuly/radek/2013/i3hsw/wykresy/fsx_1920.png

If 44%, 33%, 65%, 77%, 49%, etc, ST differences are "close", then a $70 Pentium G3258 is "very close indeed" to a $240 FX-9590 in most mixed / moderately threaded software / games...

NostaSeronx · Apr 5, 2015

Sustained Throughput (Best <- Worst);
CSMT <- CMP <- CMT <- SMT

Hope for the fusion of CMT and SMT, not for the separate usage of it.

Dufus · Apr 5, 2015

TheELF said:
This can not happen.
There is a thing called thread migration,each time a thread is beeing rescheduled it gets assigned to a new core(real or virtual) no matter how many or few threads you run each and everyone will get the same amount of cpu time so there is no way for them to finish at erratically different times.
Even if they went out of their ways to use affinity and force each process/thread to a different core,the scheduler would still give each thread an equal amount of cpu time.

Which OS does this happen on?

TheELF · Apr 5, 2015

Any and every halfway modern OS,windows has it since XP sp2 (at least)
http://support.microsoft.com/en-us/kb/896256

When single-threaded workloads run on multiprocessor systems that include dual-core configurations, the workloads may migrate across available CPU cores. This behavior is a natural artifact of how Windows schedules work across available CPU resources.

Dufus · Apr 5, 2015

TheELF said:
This can not happen.
There is a thing called thread migration,each time a thread is beeing rescheduled it gets assigned to a new core(real or virtual) no matter how many or few threads you run each and everyone will get the same amount of cpu time so there is no way for them to finish at erratically different times.

TheELF said:
When single-threaded workloads run on multiprocessor systems that include dual-core configurations, the workloads may migrate across available CPU cores. This behavior is a natural artifact of how Windows schedules work across available CPU resources

Click to expand...

These 2 statements are not the same with the second one being correct. For Windows normally each application is given a quantum to run, usually 15.6ms then put to the back of the queue. When it comes around to the front again it would be beneficial to run on the same hardware thread as there may be remaining filled cache lines from last time it ran, if not then next available core / thread. Windows does not specifically choose a different core / HW thread each time for the process. Higher priority threads get preference and a thread may be thrown off either to make way for a more critical process or because it is waiting (stalled) and that can happen before it's quantum has finished.

el etro · Apr 5, 2015

Would be cool and kinda crazy to see a Jaguar-like core inside a CSMT Zen/K12 core. The processor schedules light works to the little core, and heavier loads to the big cores activating all resources of the module. CMT and SMT will be playing together on this, so the gain will happen only the heavily-threaded situations. Then will be made a good Skylake competitor on HPC and Server workloads.

BTW, is more important to AMD improve the IPC and the per-thread performance scaling of its processors first.

TheELF · Apr 5, 2015

Dufus said:
These 2 statements are not the same with the second one being correct. For Windows normally each application is given a quantum to run, usually 15.6ms then put to the back of the queue. When it comes around to the front again it would be beneficial to run on the same hardware thread as there may be remaining filled cache lines from last time it ran, if not then next available core / thread.

Run anything that is singlethreaded,look at task manager's cpu resource monitor and you will see for yourself,no matter if it would be better or not to always run on the same thread it just won't.

podspi · Apr 5, 2015

Bulldozer was CMT/SMT, wasn't it?

The shared FPU handled threads in SMT fashion, but shared a frontend, hence CMT.

Anyway, it seems to me that CMT and SMT both are trying to do the same thing, in just different ways. CMT tries to increase utilization by having a beefy frontend service multiple weaker cores (who are highly utilized because they are weak). SMT increases utilization by having stronger cores service multiple threads.

I disagree with a lot of people in this thread who claim CMT is inherently flawed, I don't think it is, but it certainly seems harder to pull off than SMT, and might not be as applicable to general use stuff, due to the inherent ST deficit arising from using weaker cores.

ShintaiDK · Apr 5, 2015

podspi said:
Bulldozer was CMT/SMT, wasn't it?

The shared FPU handled threads in SMT fashion, but shared a frontend, hence CMT.

Nope. Its a CMT INT cluster with a shared FPU without SMT.

el etro · Apr 5, 2015

podspi said:
SMT increases utilization by having stronger cores service multiple threads.

And have a far better efficiency per mm². AMD struggled too much time to copy Intel's approach.

Thankfully AMD is dropping CMT in order to stay with SMT only in Zen. The adoption of AVFS and HDL plus SMT points Zen being a huge and much denser core, and i personally think this is a awesome approach. Clocks and core count have the bigger impact on the power consumption, so the processor efficiency can surprise in the end.

podspi · Apr 5, 2015

ShintaiDK said:
Nope. Its a CMT INT cluster with a shared FPU without SMT.

I don't think that is accurate, since two threads can use the FPU simultaneously if one of the threads doesn't fully saturate the FPU. That is pretty much the definition of SMT. I'd believe it the level or sophistication of their implementation isn't up to the level of Intel's but it is still SMT.

Indeed, the Wikipedia article for Bulldozer notes the FPU implements SMT, and the Wikipedia article for SMT notes Bulldozer includes a partial implementation (floating point only).

Granted, not the best sources but better than an unsubstantiated 'nope'.

pTmdfx · Apr 5, 2015

ShintaiDK said:
Nope. Its a CMT INT cluster with a shared FPU without SMT.

The OOO execution pipeline of FPU is SMT. The decode/renaming front-end fed by the cores' dispatch unit is interleaved. The retirement queue is replicated for each core.

TheELF · Apr 6, 2015

podspi said:
I don't think that is accurate, since two threads can use the FPU simultaneously

Because there are two fpu cores ,they are physically separated,if you want to call them SMT because they are exactly the same,no problem,just don't confuse it with Hyper-threading,it's nothing alike .
http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

AtenRa · Apr 6, 2015

TheELF said:
Because there are two fpu cores ,they are physically separated,if you want to call them SMT because they are exactly the same,no problem,just don't confuse it with Hyper-threading,it's nothing alike .
http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

There is only a single FPU core, only a single thread(4 ops) can be fetched per cycle. When fetched in the FPU, ops from different threads(two) can be executed.

What you call two cores in the FPU are the two FMAC(128bit) execution units.

TheELF · Apr 6, 2015

AtenRa said:
What you call two cores in the FPU are the two FMAC(128bit) execution units.

So still two physically separate "things" ,whatever you want to call them.

naukkis · Apr 6, 2015

TheELF said:
Because there are two fpu cores ,they are physically separated,if you want to call them SMT because they are exactly the same,no problem,just don't confuse it with Hyper-threading,it's nothing alike .
http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

It's just like hyperthreading, fetching instructions every other cycle from different thread if both threads have waiting instructions. Dozer-family have pure SMT FPU unlike integer, which front-end is similar but execution resources are dedicated per thread.

naukkis · Apr 6, 2015

TheELF said:
So still two physically separate "things" ,whatever you want to call them.

Same way separated as are integer ALU's. They are just execution units.

Abwx · Apr 6, 2015

TheELF said:
Because there are two fpu cores ,they are physically separated,if you want to call them SMT because they are exactly the same,no problem,just don't confuse it with Hyper-threading,it's nothing alike .
http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

There are no such thing as an "FPU core" within a module, that s total non sense, there is two cores and each core manage both integer and FP operations.

TheELF · Apr 6, 2015

naukkis said:
It's just like hyperthreading, fetching instructions every other cycle from different thread if both threads have waiting instructions.

That's not what hyper threading does
https://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology

Figure 3. By giving the processor access to two threads in the same time slice, Intel® HT Technology reduces the level of idle hardware resources, which typically increases efficiency and throughput.

AMD Zen supports CMT and SMT

Diamond Member

Lifer

Lifer

Platinum Member

Lifer

Lifer

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Golden Member

Lifer

Golden Member

Golden Member

Member

Diamond Member

Lifer

Diamond Member

Golden Member

Golden Member

Lifer

Diamond Member