AMD Zen supports CMT and SMT

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136


The FPU unit in the Bulldozer family can Fetch up to four ops from a single thread per cycle.
But when the ops are in the FPU, ops from different threads can be executed simultaneously in the same cycle.


http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/03/47414_15h_sw_opt_guide.pdf

Page 37.

FPU Features Summary and Specifications:
• The FPU can receive up to four ops per cycle. These ops canonly be from one thread, but the thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be
executed.
SMT and Hyper-Threading is the same thing. Hyper-Threading is just an Intel registered name for their SMT technology.

From your own link,

Intel® Hyper-Threading Technology (Intel® HT Technology)¹ is a hardware feature supported in many Intel® architecture-based server and client platforms that enables one processor core to run two software threads simultaneously. Also known as Simultaneous Multi-Threading, Intel HT Technology improves throughput and increases energy efficiency.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
So one thread on one core sends 2 ops the first cycle, then at the second cycle the second thread on the second core sends 2 ops and then at the third cycle they get executed at the same time,but in two separate "units" ?

With Hyper Threading each core (real+virtual) executes commands from two threads in the same cycle within the same core.

I fail to see how those two things could be considered even close to being the same.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
So one thread on one core sends 2 ops the first cycle, then at the second cycle the second thread on the second core sends 2 ops and then at the third cycle they get executed at the same time,but in two separate "units" ?

With Hyper Threading each core (real+virtual) executes commands from two threads in the same cycle within the same core.

I fail to see how those two things could be considered even close to being the same.

Hell read more carefully,

Each cycle the FPU can fetch four (4) ops from a single Thread.

Once inside the FPU, four(4) ops can be executed from different Threads at the same cycle.

There is only a single FPU unit in each Module. That FPU Unit has 2x 128bit-FMACs, each FMAC can execute ops from different Threads simultaneously. It can also execute a single 256bit AVX per cycle.
 
Last edited:

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
With Hyper Threading each core (real+virtual) executes commands from two threads in the same cycle within the same core.

Just like dozer fpu. In hyperthreading cpu has also only one instruction decoder so when executing two threads instructions are decoded every other cycle from thread a and b. From steamroller upwards dozers have two independent instruction decoders so probably fpu also can get instructions simultaneously from two different threads.

FPU instruction execution times are long, up to dozens of cycles so there's usually tens of instructions on execution state from both threads simultaneously.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
"Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so."

You cant do that can you?
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
"Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so."

You cant do that can you?

Not within the FPU Unit. You can do that per Module.
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
"Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so."

You cant do that can you?

Sure can. Both Intel HT and AMD post steamroller dozers have just one instruction decoder so they can decode and issue only from one thread at one cycle. Execution on both is SMT, they can issue instructions from different threads simultaneously.

Steamroller upwards have two independent instruction decoders so it can decode and issue from two threads simultaneously but main point to have two decoders is to increase capability to decode more instructions per clock cycle.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
"Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so."

You cant do that can you?

I can, although I think biology/computer analogies lose their usefulness after a while. Discussing wetware is kinda OT though, don't you think? :colbert:

Anyway, assuming you meant "Bulldozer can't do that can it?" instead of "You cant do that can you", which makes this technical discussion oddly and sadly ... personal ... yes, it can. As I stated before, the Bulldozer module has a partial SMT implementation.

It is a weird, weird architecture. I thought it made sense years ago (save space in preparation for HSA) but:

1) It was pretty terrible
2) It still had a huge die for its performance.
and
3) HSA still hasn't taken off.

Doesn't change the fact that it uses SMT, a pretty old technology. It isn't Intel magic, like some people seem to think it is:

While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of the ACS-360 project.[1]

Link.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Wrong. From your own sharing: "Once received by the FPU, ops from multiple threads can be executed."

The FPU in Bulldozer can fetch only from a single Thread per cycle, in SMT the core can fetch from multiple Threads per cycle.

That is the difference here because the Bulldozer FPU is decoupled from the INT Units. Both the Bulldozer FPU and the SMT in Intel Cores can execute and retire from multiple threads per cycle.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
Run anything that is singlethreaded,look at task manager's cpu resource monitor and you will see for yourself,no matter if it would be better or not to always run on the same thread it just won't.
Here's a 4M SuperPi run with affinity set for all threads and priority set to high. I had to run this several times to get it to switch threads so you can see it is able to do that albeit some 40 odd seconds after the start.
5sfp5.png

Now it's happy to run on the same thread however if an application specifically asks to run on that thread then it can be kicked off onto a different thread to make way. For instance a simple core temperature monitoring program needs to access each core to read all cores.

Windows task manager monitor isn't the best way to see what is happening. Here's an example with normal priority showing SuperPi's thread distribution over 2.5 seconds. Which thread the app runs on depends on everything else that is happening during it's execution and is nothing to do with Windows purposely trying to distribute it evenly across all threads.
669hr5.png


As for hyperthreading what I did find interesting was that using Linpack with quad core Haswell, HTT and AVX2 run on one thread per core at 3GHz resulted in a throughput of 166GFLOPS. If however a simple app was run on each of the other 4 threads using just GPR's and simple MOV instructions in a loop so no memory accesses or CPU cache usage, this would result in Linpack throughput dropping to 98GFLOPS, a 40% drop in performance.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
For scientific computing, most researchers seem to prefer all 8 projects running/finishing at a consistent speed (FX-8320e) -- versus 4 projects finishing quickly, with 4 projects dragging along like a slug on the "hyper" threads (Core i7).

If the threads are really being given very different priority there's no reason why the OS can't intermittently swap the two, much like it traditionally preempts threads.

I can tell you that all the tests I've ran showed even scaling w/hyper threads on a modern Linux kernel. That doesn't mean there's always a performance benefit in doing so, though.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
As for hyperthreading what I did find interesting was that using Linpack with quad core Haswell, HTT and AVX2 run on one thread per core at 3GHz resulted in a throughput of 166GFLOPS. If however a simple app was run on each of the other 4 threads using just GPR's and simple MOV instructions in a loop so no memory accesses or CPU cache usage, this would result in Linpack throughput dropping to 98GFLOPS, a 40% drop in performance.
That's a real problem. This even happens with those parallel process' threads having a lower priority as long as there is no SMT capable scheduler taking care of that.