Speculation: Ryzen 4000 series/Zen 3

Page 135 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
So stopping ambitious server ARM K12 project in 2015 (year when A77 and server Neoverse N1 development started) is very bad move.
A pity they had no magic 8 ball, or you with a Delorean to go back and tell them innit then?

Hindsight is 20/20 - but again pish poshing R&D budgets is ridiculous, they have to justify expenditures to investors, and back then ARM was simply not a priority, and certainly not in comparison to x86.

You can get loans yes, but justifying a whole extra ISA uArch team back then would not have gone well considering that ARM server prospects were simply nowhere near their current place.

The fact that they significantly scaled back their properties during that period demonstrates that there is no magic money tree when in debt and on the ropes as they were.

As I have mentioned earlier, the mess with Vega release shows that back then they lacked the staff, and perhaps even the correct management structure to handle 2 concurrent projects of that size and expect a sanguine result from both at the finish line

Either way your argument is becoming increasingly pointless as the Cortex X program effectively makes such problems moot.

Even without customisation it will be able to compete with others in the market (most of which are already using N1) - with customisation, no doubt there will be SMT variants, and lord knows what else for X2 and beyond.
 
  • Like
Reactions: Tlh97 and Valantar

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
And also Fujitsu A64FX development started around 2015 including SVE vectors. To cancel K12 during this huge ARM movement was very bad idea.
In one breath you say K12 stopped in 2015 and then reference a project for a single supercomputer you say started the same year.

You really should try a spot of research - Google is your friend.

SVE was not even announced until August 2016, and it was in development at least as early as 2014 given it is likely based on the ARGON SIMD paper published at the DATE conference during that year.

FUJITSU only announced the Post K project a month before SVE in July 2016. Link here.

ARGON paper link.

SVE announcement link.

A64FX is certainly impressive no doubt - but as I mentioned above it was not only designed for a supercomputer, but rather one specific supercomputer project alone. Who knows whether Fujitsu will make more for other projects, but it was designed solely for this use.

AMD had enough on their plate worrying about Intel and nVidia in 2015, do you really think they were paying attention to a possible, as yet unannounced supercomputer chip project, before ARM based servers had gained significant traction when there were far more immediate problems to worry about?
 
Last edited:

Adonisds

Member
Oct 27, 2019
98
33
51
You are being foolish. There is no SMT4. I have given you proof. Yet you still run your mouth. Shut up and wait for it to come out. Keeping it a secret? What a joke! 6 ALU's, shut up. You don't think Intel or AMD could come up with a 6 ALU design? Yes, we will soon see if Zen 3 has SMT4 or not. When you are proven wrong, will take a slice of humble pie?
He shows signs of being incapable of admitting he was wrong if he turns out to be wrong about SMT4. I have no problem with people who are wrong, but people who can't admit their mistakes I can't stand. We'll see what happens
 
  • Like
Reactions: Tlh97 and Makaveli

Makaveli

Diamond Member
Feb 8, 2002
4,715
1,049
136
Yes that account for 5ns, which ends up with a total of 20ns improvement and that bones well for the consoles.

The consoles are using Zen 2 do we know if that has been tweaked to be more like the apu core than the standard Zen 2 core?
 

amd6502

Senior member
Apr 21, 2017
971
360
136
imho, AMD will be killing off SMT rather than increasing it.

=> Kill SMT
=> Switch to double FE + 1C/1T L0i.

Zen3 will launch with "SMT2" but I have been hearing it is actually a VMT2 implementation w/ ST mode being best overall perf/watt.
Zen4 will then drop multithreading on a single core and push for double piped front-end and an improved singlethreaded pure-L0i(no switching between op-cache and L0i).

What is VMT?

I have many doubts that Zen4 would go monothreading again. It's possible they might have a mode where there is 1 thread plus 3 background threads.

Assuming the focus on one main thread, some low hanging fruit might be clustering to enable sharing of the two FPUs (1 FPU/1 core). So, for FPU code, SMT2 would be happening within core pairs (aka modules in CMT lingo), which would double the maximum FPU should the neighboring core not utilize its FPU.

Another idea (this is sort of far out considering 'fusion' vision hasn't materialized and the majority of products are still without iGPU): to utilize the iGPU for far ahead speculative FPU calculations. Potentially useful calculations would be stored in the L1 and L2.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
What is VMT?

I'm sure someone could explain better than I, but if I'm not mistaken, VMT = virtual multi-threading, or "reverse" SMT as it's colloquially known. Basically, where SMT exposes physical cores to the OS as many virtual cores (from few to many), VMT is virtualizating multiple physical cores into one virtual core (from many to few).

SMT feeds wide cores with many threads so that you get higher utilization of resources when you have many light workloads, while VMT takes many narrow cores and gangs them up into a virtual large core to tackle workloads requiring more single threaded performance.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
What is VMT?
In my post, it is vertical multithreading. However, it might not be historical vertical multithreading in implementation.

17h is single path instruction flow and it switches between thread A or thread B.
19h could be aimed towards going for two path instruction flow and it algorithmically prioritizes for thread A on path A and thread B on path B. With future models dropping thread B and implementing dual task/process execution without two logical cores. Expanding OoO efficiency w/o duct taping SMT on the core.

Bucket A + Bucket B
Bucket A + Bucket A minus N or Bucket A + Bucket A plus N

Hardware OoO + Software SMT is also an option after they kill SMT.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
In my post, it is vertical multithreading. However, it might not be historical vertical multithreading in implementation.

Ok I had to do some research on vertical MT. Supposedly this is a crude predecessor to modern SMT, where each stage can only work on one thread at a time, but where stalled threads allow a another thread to wake up and resume. So kind of a coarse grain MT. Supposedy these were the early days of P4 hyperthreading as well as Larrabee atom HT.

I much doubt this would happen (maybe with exception for background threads). I also disagree with duct tape analogy. Duct tape SMT method was already done for the FPU side in BD/Piledriver family. Zen seems very much designed for SMT2 from the start. I disagree with Richie that Zen1 would have been A7 Apple related; these apple acorn cores are monothreaders. It seems funny to think that they took an A7, duct taped some SMT to it, also duct taped an x86 decoder to it, and also made it run on AMD mu ops rather than armv8.

I agree with Richie that cores are getting wider. I kind of doubt it'd be as wide as EV8. But regardless, whether it is an 8+4 wide or 5+3 wide core, Vertical MT would do almost nothing to help these pipes from getting underutilized. I guess the Apple stategy was to not mind all the underutilization, and to probably put most of them to sleep when idle like conditions were detected. (I don't think that's a good approach).

But for a thread that aims to maximize IPC in a modern very wide core, the amount of branch prediction, look ahead, and spec execution doesn't seem to be agreeable if maximizing perf/watt is one of the main goals. So for that I could see that be limited to one main priority thread, while the other threads aim for much lower OoO execution and modest IPC's like Piledriver's.

Or... straight up SMT4, but the OS would taskset running processes to the fewest number of cores, and so maximize the the number of idling cores that can then be put into low power mode. And then hope that the SMT4 quarters the amount of lookahead and spec execution, so that it goes into reasonable and energy efficient territory.
 
Last edited:
  • Like
Reactions: Vattila

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Ok I had to do some research on vertical MT. Supposedly this is a crude predecessor to modern SMT, where each stage can only work on one thread at a time, but where stalled threads allow a another thread to wake up and resume. So kind of a coarse grain MT. Supposedy these were the early days of P4 hyperthreading as well as Larrabee atom HT.

I much doubt this would happen (maybe with exception for background threads). I also disagree with duct tape analogy. Duct tape SMT method was already done for the FPU side in BD/Piledriver family. Zen seems very much designed for SMT2 from the start. I disagree with Richie that Zen1 would have been A7 Apple related; these apple acorn cores are monothreaders. It seems funny to think that they took an A7, duct taped some SMT to it, also duct taped an x86 decoder to it, and also made it run on AMD mu ops rather than armv8.

I agree with Richie that cores are getting wider. I kind of doubt it'd be as wide as EV8. But regardless, whether it is an 8+4 wide or 5+3 wide core, Vertical MT would do almost nothing to help these pipes from getting underutilized. I guess the Apple stategy was to not mind all the underutilization, and to probably put most of them to sleep when idle like conditions were detected. (I don't think that's a good approach).

But for a thread that aims to maximize IPC in a modern very wide core, the amount of branch prediction, look ahead, and spec execution doesn't seem to be agreeable if maximizing perf/watt is one of the main goals. So for that I could see that be limited to one main priority thread, while the other threads aim for much lower OoO execution and modest IPC's like Piledriver's.
VMT is working in modern cores already. They divide 1 single thread into several sub-threads for each back-end port(ALU, LSU, FPU). OoO machine can speculatively execute both branch ways if needed. I would say there is no need for VMT at macro-level because it's already built in OoO mechanism.

I never said that Keller brought A7 design and say built this and duct-tape SMT on it. But there are some surprising similarities:
  • A7 ..... 4xALU .... 2xBranch shared ..... 192-entry ROB ... 64kB+64kB L1 cache ... 2xAGU
  • Zen .... 4xALU .... 2xBranch shared .....192-entry ROB ... 64kB+64kB L1 cache ... 2xAGU
There might be more identical things in INT core but A7 info is very limited. Nosta mentioned this similarity long time ago and he was right. I'm not HW engineer so I don't understand all Nosta's ideas however sometimes he has good catch (like with early N5 Zen4).

I'm afraid that AMD did reduce Keller's EV8 resurrection AKA Zen3 into something smaller though. Probably 8xALU -> 6xALU and keeping SMT4 of course.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
I'm sure someone could explain better than I, but if I'm not mistaken, VMT = virtual multi-threading, or "reverse" SMT as it's colloquially known. Basically, where SMT exposes physical cores to the OS as many virtual cores (from few to many), VMT is virtualizating multiple physical cores into one virtual core (from many to few).

Doing that is not possible. Something in this vein is frequently suggested as something someone should figure out how to do by people who have absolutely no understanding of how any of this works. The latency on die between two cores is simply too high for "two cores working on single thread" to ever produce any performance benefit.

Anyone who suggests any kind of "reverse SMT" should instantly be discredited, as they clearly no not understand even the very basics of what limitations matter when doing things in silicon.

When Zen3 will not have SMT4 I will say: "OK. I was wrong, you we right. But it's a missed opportunity to be more advanced over Intel."

Wider SMT is not more advanced. The server market is not currently asking for more SMT. I know for a fact that it has lately become more common for server customers to go the opposite way, and completely disable SMT on machines they purchase. There are two reasons for this:

Firstly, some of the recent security issues hit machines with SMT worse than ones without, and it's disabled for perceived security reasons.

Secondly, it's disabled because disabling it improves the performance/cost of the server. Renewed competition in the CPU area has drastically lowered the part of a server's cost that goes to the CPU. This has left RAM as the largest cost, by far, typically near half the cost of the whole server. Every process you run requires the same amount of ram to do it's thing, regardless of how fast it is running. If you double the amount of threads, but only increase the system throughput by 25%, you have just doubled the amount of RAM you need to pay for, for just a quarter extra speed. In a world where RAM is more than 50% of the cost of a new server, this has just worsened your performance/cost by ~16%.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Doing that is not possible. Something in this vein is frequently suggested as something someone should figure out how to do by people who have absolutely no understanding of how any of this works. The latency on die between two cores is simply too high for "two cores working on single thread" to ever produce any performance benefit.

Anyone who suggests any kind of "reverse SMT" should instantly be discredited, as they clearly no not understand even the very basics of what limitations matter when doing things in silicon.
Aww come on, let me dream that dream, even if it's completely unrealistic currently. :sob:
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Doing that is not possible. Something in this vein is frequently suggested as something someone should figure out how to do by people who have absolutely no understanding of how any of this works. The latency on die between two cores is simply too high for "two cores working on single thread" to ever produce any performance benefit.

Anyone who suggests any kind of "reverse SMT" should instantly be discredited, as they clearly no not understand even the very basics of what limitations matter when doing things in silicon.



Wider SMT is not more advanced. The server market is not currently asking for more SMT. I know for a fact that it has lately become more common for server customers to go the opposite way, and completely disable SMT on machines they purchase. There are two reasons for this:

Firstly, some of the recent security issues hit machines with SMT worse than ones without, and it's disabled for perceived security reasons.

Secondly, it's disabled because disabling it improves the performance/cost of the server. Renewed competition in the CPU area has drastically lowered the part of a server's cost that goes to the CPU. This has left RAM as the largest cost, by far, typically near half the cost of the whole server. Every process you run requires the same amount of ram to do it's thing, regardless of how fast it is running. If you double the amount of threads, but only increase the system throughput by 25%, you have just doubled the amount of RAM you need to pay for, for just a quarter extra speed. In a world where RAM is more than 50% of the cost of a new server, this has just worsened your performance/cost by ~16%.
You also missed another item in the decision tree of not having SMT enabled on servers: Licensing costs. Big software vendors are moving from a "per-socket" licensing model to a "per-thread" model, by way of "per-core". If those extra threads are costing you as much as the first threads on each core cost, but they only bring an additional 25% of performance to the table, while also increasing operating temperatures and slowing down the performance of individual threads (there is some overhead to running the extra threads), it can make more sense to just deploy a few extra servers with SMT off and come out ahead in the long run. To enhance what you were saying about RAM, disabling SMT can also let you run lower density memory modules in each server as there is less memory demand from fewer active threads, cutting RAM costs markedly on a per GB basis.
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Doing that is not possible. Something in this vein is frequently suggested as something someone should figure out how to do by people who have absolutely no understanding of how any of this works. The latency on die between two cores is simply too high for "two cores working on single thread" to ever produce any performance benefit.

Anyone who suggests any kind of "reverse SMT" should instantly be discredited, as they clearly no not understand even the very basics of what limitations matter when doing things in silicon.



Wider SMT is not more advanced. The server market is not currently asking for more SMT. I know for a fact that it has lately become more common for server customers to go the opposite way, and completely disable SMT on machines they purchase. There are two reasons for this:

Firstly, some of the recent security issues hit machines with SMT worse than ones without, and it's disabled for perceived security reasons.

Secondly, it's disabled because disabling it improves the performance/cost of the server. Renewed competition in the CPU area has drastically lowered the part of a server's cost that goes to the CPU. This has left RAM as the largest cost, by far, typically near half the cost of the whole server. Every process you run requires the same amount of ram to do it's thing, regardless of how fast it is running. If you double the amount of threads, but only increase the system throughput by 25%, you have just doubled the amount of RAM you need to pay for, for just a quarter extra speed. In a world where RAM is more than 50% of the cost of a new server, this has just worsened your performance/cost by ~16%.
Your last paragraph is one hell of an example of 2nd and 3rd order effects. Who would have clearly seen that?
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Secondly, it's disabled because disabling it improves the performance/cost of the server. Renewed competition in the CPU area has drastically lowered the part of a server's cost that goes to the CPU. This has left RAM as the largest cost, by far, typically near half the cost of the whole server. Every process you run requires the same amount of ram to do it's thing, regardless of how fast it is running. If you double the amount of threads, but only increase the system throughput by 25%, you have just doubled the amount of RAM you need to pay for, for just a quarter extra speed. In a world where RAM is more than 50% of the cost of a new server, this has just worsened your performance/cost by ~16%.
It sounds reasonable in theory. But why AWS comparison of Graviton2 vs. Rome is with SMT2 enabled?

Amazon cloud can disable SMT but it's asked mainly by customer whos running HPC loads. https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/

As far as most server systems has SMT ON, then you are wrong. However feel free to provide the data that most servers running SMT OFF. Not mentioning that some tasks with low ILP like SQL benefits from SMT a lot.

Zen3 with 8xALU and SMT4 would need to reduce SMT4 down to SMT2. Disabling completely SMT would let the core underutilized.
 
  • Like
Reactions: Exist50

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
It sounds reasonable in theory. But why AWS comparison of Graviton2 vs. Rome is with SMT2 enabled?

Amazon cloud can disable SMT but it's asked mainly by customer whos running HPC loads. https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/

As far as most server systems has SMT ON, then you are wrong. However feel free to provide the data that most servers running SMT OFF. Not mentioning that some tasks with low ILP like SQL benefits from SMT a lot.

Zen3 with 8xALU and SMT4 would need to reduce SMT4 down to SMT2. Disabling completely SMT would let the core underutilized.

Why would Amazon care if SMT is Off or On with respect to software licensing costs? All they have to provide is the VM instance. Depending on what software foundation they're using, they aren't seeing any higher licensing costs per thread over per core. Its the user of the iron that's got to figure out what's best for them. If I'm throwing a software package on the cloud that I have to pay "per-thread" licensing for, then all I care about is what cloud has the best performance per thread and per dollar of CPU time that works with my licensing model. If I'm hosting my own server, and I'm paying for a DB package that is licensed by the thread, then I'm going to be looking for the solution that gives me the lowest cost of performance that can fit in my existing footprint. That's a complicate calculus as rack space is finite, cooling costs money, and per thread licensing can be quite expensive. Its not impossible that it makes more sense to get more physical cores in more physical systems because my per thread licensing is crazy high, and I better maximize the performance of each clock cycle that I have to license.

If I'm just hosting iron for other people, then all I have to worry about is my hypervisor, uptime management, and load balancing among systems while providing the highest value per vCPU minute that I offer. SMT makes sense for me.
 
  • Like
Reactions: Tlh97 and Elfear

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Why would Amazon care if SMT is Off or On with respect to software licensing costs? All they have to provide is the VM instance. Depending on what software foundation they're using, they aren't seeing any higher licensing costs per thread over per core. Its the user of the iron that's got to figure out what's best for them. If I'm throwing a software package on the cloud that I have to pay "per-thread" licensing for, then all I care about is what cloud has the best performance per thread and per dollar of CPU time that works with my licensing model. If I'm hosting my own server, and I'm paying for a DB package that is licensed by the thread, then I'm going to be looking for the solution that gives me the lowest cost of performance that can fit in my existing footprint. That's a complicate calculus as rack space is finite, cooling costs money, and per thread licensing can be quite expensive. Its not impossible that it makes more sense to get more physical cores in more physical systems because my per thread licensing is crazy high, and I better maximize the performance of each clock cycle that I have to license.

If I'm just hosting iron for other people, then all I have to worry about is my hypervisor, uptime management, and load balancing among systems while providing the highest value per vCPU minute that I offer. SMT makes sense for me.
SMT has some pros and cons like any other tech. Please do not talk only. But give me a proof that more than 50% of servers runs with SMT OFF today. Give numbers, links...
 
  • Like
Reactions: amd6502