Speculation: Ryzen 4000 series/Zen 3

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
You just don't do that in hardware. There's no point to it.
Windows for Intel does HW control in regards of p-states. Called Speedshift, why wouldn't it be believable for them to extend Hyperscheduler to Windows..

-> Most of the current available solutions to the power saving problem are based on a software routine that requests a power management system to enable the power saving. Further, the corresponding power saving schedules are often created in a static and/or a manual fashion, which is error prone. Moreover, neither the dynamic power saving methods nor the static power saving methods do provide an accurate prediction of the power consumption.
-> It is an object of the invention to provide for a hardware task scheduler with an improved power saving efficiency. In order to achieve the object defined above, a hardware task scheduler, a multiprocessing system, and a hardware-based power saving method are provided.
-> The hardware task scheduler may implement scheduling policies supporting heterogeneous multi-core architectures, where each of the processor cores can be multi-threaded or single-threaded.
-> However, if this processor core is overloaded (for instance, in case that the processing element is multi-threaded and/or virtualized, in which case other tasks may be assigned to virtual processing elements physically mapped to the processing element, in addition to task currently running on it), the load balancing method may recommend running the task on some another processor core.

AfPI72M.png


HWscheduling is faster and more power efficient.

AMD Zen Automotive => Hardware Realtime SMP? or Software Realtime SMP?

=> Advanced Technologies Group is a startup group with a mission to advance the future of safer transportation as we look to bring millions of drivers into assisted and automated driving. We develop artificial intelligence perception software for driver-assistance and automated driving, with a focus on implementing efficient deep neural networks on AMD’s automotive-grade processors.

--> That said, it is not surprising that TSMC has already taped out the first chip using its N7+ technology. Furthermore, the company is prepping a specialized version of the process aimed at the automotive industry, which indicates that N7+ is going to be a “long” node.
https://news.synopsys.com/2018-10-0...rade-IP-in-TSMC-7-nm-Process-for-ADAS-Designs
https://news.synopsys.com/2017-09-1...fied-for-TSMCs-Advanced-7-nm-FinFET-Plus-Node
https://news.synopsys.com/2018-04-3...-High-performance-7-nm-FinFET-Plus-Technology
^-- AMD partner for most of their IP.

Zen1(K8) -> Zen2(Greyhound) -> Zen3(New core), anyone asking this is my official position.
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
Secondly, the OS could void the tasksetting of high niceness processes to little cores when the system load goes to a low number (say, a load that is below the number of physical cores). In these conditions, all software threads get taskset to the main SMT2 cores.

Correction:

Actually I think for the consumer world, it's best to have all process affinity be for the main SMT2 logical cores, but that lower niceness process simply have higher priority for this affinity. This means logical small cores simply stay idle most of the time, and during very high system loads, lower priority (highly niced) threads simply overflow into the small logical threads. On a 4c SoC this would only happen when system load exceeds 8.

If not already implemented in linux, it could easily be done.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
And every tom, dick and harry has to chime in as soon as SMT4 is mentioned.
Put up some proof, or just forget about it.
Zen 2 is already as wide as Power 7, which had SMT4. *shrugs*
It is farfetched, but it is not unfounded.

I'm still not sold on SMT4, 8, or anything else. Necessarily. What's the real advantage of 4c/16t over 8c/16t? Fewer transistors? Lower power consumption? And what would be the drawbacks of relying on SMT4? ARM designers went in a completely different direction by just adding a bunch of little extra cores to their SoCs via big.LITTLE/DynamIQ. They were very successful doing so.
SMT is for better utilization of all the chip's resources, which in turn is more power efficient than running the same threads on separate cores.

I honestly think big.LITTLE/DynamIQ of the ARM world doesn't really apply to x86. In the ARM world you have plenty very power efficient older cores while new designs become wider and wider. Combining such older core with bleeding edge performance cores is the fastest approach to offering both the best performance as well as the polished power efficiency of older cores everybody knows from ARM. In the x86 only Intel has something partially comparable with its Atom cores and is finally trying to make use of that with Lakefield. AMD doesn't truly have any low power high efficiency cores, instead the Zen cores are both high performance and high efficiency.

Furthermore the big.LITTLE/DynamIQ approach is a bottom up one, the primary target is a relatively small number of cores in a mobile space and growing from there. AMD's approach with Zen is top down, design for the server space and cut down from there. There AMD is already throwing in the kitchen sink wrt the amount of cores, adding a couple of little cores to 64 wide cores is inane. Instead SMT offers the opportunity to run many low utilization threads using already existing resources while keep as many cores power gated as possible, thus increasing power efficiency this way.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
SMT is for better utilization of all the chip's resources, which in turn is more power efficient than running the same threads on separate cores.

But what does this do for power efficiency when resources are generally underutilized, such as when I have a moderately-taxing 4t workload on a 16t chip? What's the power usage going to be on 2c/16t, 4c/16t, 8c/16t, etc.? Seems like having more cores rather than relying on SMT makes power gating much easier in those scenarios.

I honestly think big.LITTLE/DynamIQ of the ARM world doesn't really apply to x86.

You mentioned Lakefield. We'll see if Intel continues in that direction. AMD has their old cat cores which they could update . . . or they could cut down Zen2. I doubt they want to spend the money on that. AMD is mostly ignoring the low end anyway.

In the ARM world you have plenty very power efficient older cores while new designs become wider and wider.

That's sort-of true, but not entirely. Take a look at ARM chips combining A76 and A55. A55 is probably very similar to A53 (and some older cores still), but it's also Aarch64-compliant. It's not literally and old core from 5+ years ago. There's work done to at least update the "little" cores.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,801
136
...Zen1(K8) -> Zen2(Greyhound) -> Zen3(New core), anyone asking this is my official position.


What the? It's early and I'm tired, but did you just call Zen K8?? Now Greyhound is interesting as I had never heard of it before.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
But what does this do for power efficiency when resources are generally underutilized, such as when I have a moderately-taxing 4t workload on a 16t chip? What's the power usage going to be on 2c/16t, 4c/16t, 8c/16t, etc.? Seems like having more cores rather than relying on SMT makes power gating much easier in those scenarios.
How do you think do more cores make power gating easier than SMT? Not sure I'm following there.

Take a single CCX. Without SMT and 4 concurrent low utilization threads all four cores of the CCX would fire up. With SMT2 two cores could stay in deep sleep state. With SMT4 this could be increased to three cores staying in deep sleep state.

big.LITTLE usually relies on an imbalanced ratio, with significantly more little cores than big ones. Lakefield relies on a 4-1 ratio. For the above example giving a tangible difference the ratio would have to be at least 1-1 or better for little cores (SMT2: 2-1, SMT4: 4-1). Considering Zen cores are rather small to begin with, the additional space required for such little core counterparts is likely better spent making the big cores as well as SMT more efficient.

or they could cut down Zen2.
Fine grained power gating is essentially cutting down without changing the silicon. Why add additional cores just for that if you can do the same in real time anyway?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
What the? It's early and I'm tired, but did you just call Zen K8?? Now Greyhound is interesting as I had never heard of it before.
Greyhound is the actual name of the Family 10h cores; Agena, Deneb, Thuban, Llano all use Greyhound.

K7 = Bobcat & Jaguar (The floating-point execution units include a store-convert unit (STC) that drives results to the main-core data cache, a floating-point adder (FPA) that shares roots with the AMD K7 FPA, and a floating-point iterative multiplier derived from the Bobcat FPM design and K7 divide/square-root algorithms.)
K8 = Zen (17h) (Rather than 64-bit(vertical power), it is more IPC(horizontal power))
Greyhound = Zen2 (17h) // Probably don't look at Agena, instead look at Deneb(Greyhound+) ((Higher frequency + at lower node + 256-bit FPU)
New core = Zen3 (19h)
Enhanced new core = Zen4 (19h)
Next-gen new core = Zen5 (21h)
Improved new core = ZenX, etc (21h?/22h?) <- 3D-Arch project
 
Last edited:

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
I suppose I meant additional sources of info on SMT4, rather than just the AdoredTV link (and all the rumors based on that one video).
I have been searching for any AMD Patents for 4 way SMT but there are none, I guess they will be using IBM SMT-4 just like when they did IBM SMT-2?
 

amd6502

Senior member
Apr 21, 2017
971
360
136
Take a single CCX. Without SMT and 4 concurrent low utilization threads all four cores of the CCX would fire up. With SMT2 two cores could stay in deep sleep state. With SMT4 this could be increased to three cores staying in deep sleep state.

big.LITTLE usually relies on an imbalanced ratio, with significantly more little cores than big ones. Lakefield relies on a 4-1 ratio. For the above example giving a tangible difference the ratio would have to be at least 1-1 or better for little cores (SMT2: 2-1, SMT4: 4-1). Considering Zen cores are rather small to begin with, the additional space required for such little core counterparts is likely better spent making the big cores as well as SMT more efficient.


That's a great approach and the battery life gains on such mobile quadcore would make amazing gains. I just don't see how they can NOT go SMT4 in the next two generations.


Also, a brilliant way to look at the SMT4 versus ARM approach.



I think of 7nm node's main strength as efficiency improvement. So double up on efficiency with both architecture and fabrication and the products coming to mobile in the 1-3 years will make leaps and bounds. (And the great thing about SMT4 or similar wide core approach is that performance/IPC can also make gains while making that leap in efficiency.)
 
  • Like
Reactions: DarthKyrie

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
I have been searching for any AMD Patents for 4 way SMT but there are none, I guess they will be using IBM SMT-4 just like when they did IBM SMT-2?
So, aside from the AdoredTV rumor, there is nothing indicating AMD will be moving to SMT4. Zero.
 
  • Like
Reactions: NTMBK

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
You would think they would have patented their own SMT, but I am still searching for that patent, when I find it I will post it here.
Not everything is patented.
I don't think there are patents for Navi L1 either.
 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
  • Like
Reactions: Yotsugi

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
How do you think do more cores make power gating easier than SMT? Not sure I'm following there.

Take a single CCX. Without SMT and 4 concurrent low utilization threads all four cores of the CCX would fire up. With SMT2 two cores could stay in deep sleep state. With SMT4 this could be increased to three cores staying in deep sleep state.

That assumes the scheduler works that way. If I have a 4c/8t chip (such as a single CCX) and all I have is one demanding thread and one low-utilization thread, do you think the scheduler is going to put them both on the same core? In Win10, the low-utilization thread will probably bounce between three cores, bringing them into and out of sleep constantly while the demanding thread stays on the first core. SMT will probably not see any utilization. In the same scenario on a DynamIQ setup with 4C + 4c, the scheduler can keep up to four low-utilization threads busy while only waking up one of the big cores. Or I can have smaller, narrower cores and just keep two of them awake instead of having two larger, wider cores awake to handle the same two threads.

In the extreme example, let's say I have SMT8 with 1c/8t instead of a CCX. Now if all I have are two threads, regardless of their intensivity, I have to wake up the entire beast to do anything. Surely that comes at a power penalty no?

big.LITTLE usually relies on an imbalanced ratio, with significantly more little cores than big ones.

Not necessarily. Look at the Snapdragon SoCs. And Kirin 980. DynamIQ would let them use an asynchronous arrangement of cores - something that was less possible under big.LITTLE - but they still have an even balance of resources.Kirin 980 is 4 A76 + 4 A55, and so is Snapdragon 855 (though one of the A76 cores in Snapdragon 855 runs at a higher clockspeed than the others).

Fine grained power gating is essentially cutting down without changing the silicon. Why add additional cores just for that if you can do the same in real time anyway?

Compare Intel's Core-Y series to the high-performance mobile SoCs, and look at their power profiles. Intel can't match their idle power consumption, even in generations where the mobile SoCs didn't necessarily have a big process lead as they do today. Power gating can only do so much.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
That assumes the scheduler works that way. If I have a 4c/8t chip (such as a single CCX) and all I have is one demanding thread and one low-utilization thread, do you think the scheduler is going to put them both on the same core? In Win10, the low-utilization thread will probably bounce between three cores, bringing them into and out of sleep constantly while the demanding thread stays on the first core. SMT will probably not see any utilization. In the same scenario on a DynamIQ setup with 4C + 4c, the scheduler can keep up to four low-utilization threads busy while only waking up one of the big cores. Or I can have smaller, narrower cores and just keep two of them awake instead of having two larger, wider cores awake to handle the same two threads.
That all is purely a software problem though. If a scheduler is theoretically capable of detecting low utilization thread and keeping them on little cores it's also theoretically capable of keeping them on fewer cores using SMT. That the Windows scheduler is mindless and braindead has been repeatedly shown, but surely you agree that shouldn't influence hardware design decisions in any way?

In the extreme example, let's say I have SMT8 with 1c/8t instead of a CCX. Now if all I have are two threads, regardless of their intensivity, I have to wake up the entire beast to do anything. Surely that comes at a power penalty no?
Sure, big.LITTLE is better in the mobile space, I wrote as much before. But the context I'm talking about all this is server chips with up to 64 cores right now. Unless you are arguing AMD adding 64 little cores to their 64 big ones is a better idea than going SMT4?

Compare Intel's Core-Y series to the high-performance mobile SoCs, and look at their power profiles. Intel can't match their idle power consumption, even in generations where the mobile SoCs didn't necessarily have a big process lead as they do today. Power gating can only do so much.
The idle power consumption is mainly down to the uncore and caches that can't be gated off. That's a separate optimization issue neither big.LITTLE nor SMT can help with.

In Intel's Core-Y series the additional issue is that they are not SoCs, so the further required chipset and controllers (like for Thunderbold etc.) only add to the power requirement. The Atom chips being SoCs have better idle usage at the cost of worse connectivity.
 
  • Like
Reactions: DarthKyrie

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
That all is purely a software problem though. If a scheduler is theoretically capable of detecting low utilization thread and keeping them on little cores it's also theoretically capable of keeping them on fewer cores using SMT. That the Windows scheduler is mindless and braindead has been repeatedly shown, but surely you agree that shouldn't influence hardware design decisions in any way?

Is any other operating system's scheduler going to do better with an SMT CPU? Also if you try moving threads onto occupied cores, then you have the issue of what happens if the "big" thread runs a slice of code that can utilize all available execution resources (AVX2 or what have you). Now you have the scheduler trying to put a second thread on the CPU when there are no pipeline stalls or other obvious "gaps" where the second thread can execute. Now the scheduler is going to have to move that thread to another core entirely, which is probably why "mindless and braindead" schedulers pick physical cores over logical cores first. Or at least one reason why.

Sure, big.LITTLE is better in the mobile space, I wrote as much before. But the context I'm talking about all this is server chips with up to 64 cores right now.

But we are also talking about AMD. Their server core design will be present in all of their products, at least until they grow to the point where they want to maintain separate core designs. I see no clear indicator that AMD will even consider such a strategy on any of their roadmaps. Do we want SMT4 on the desktop? In a server, it's realistic to believe that most of a CPU's resources will be committed most of the time (if not all of the time). So we don't worry so much about when and how a scheduler wakes up a particular core. Zen2 is heading for laptops in Renoir. Presumably, Zen3 will follow the same circuitous path. Does AMD want SMT4 in laptops? I don't think we should rationally consider it possible (or plausible) that AMD will emulate big.LITTLE or DynamIQ in their core designs, but you have to admit, if they did, it would ease the transition to low-end computing devices, far moreso than would adoption of SMT4. Realistically-speaking, I think AMD will avoid any change away from SMT2 in the near future. They will keep selling more of the same since it works.

There's also the issue of SMT and VMs. A lot of cloud vendors just disable SMT/HT right out of the gate. AMD has every intention of selling hardware to them, and I do not think that SMT4 will be a big selling point for those buyers. I also question whether a DynamIQ-style ansychronous core arrangement would be useful since it would complicate the allocation of bare metal assets during creation of a VM.

Unless you are arguing AMD adding 64 little cores to their 64 big ones is a better idea than going SMT4?

I think the answer is c). None of the above. AMD simply doesn't have little cores available to use, so being the frugal sorts that they are, they'll just punt on that question and add more of the same SMT2 cores they already have (with planned updates).

The idle power consumption is mainly down to the uncore and caches that can't be gated off. That's a separate optimization issue neither big.LITTLE nor SMT can help with.

Not entirely true. Some of those challenges are unique to Infinity Fabic. Others are unique to AMD's CCX design. The mobile SoCs can easily gate off lower-level caches since they are not shared (I think the standard DynamIQ design calls for shared L3). So can pretty-much anyone else. ARM's DSU has some interesting additional features though, like being able to gate off part or all of a cluster's L3 cache depending on load:

https://www.androidauthority.com/arm-dynamiq-need-to-know-770349/

It still remains to be seen whether any of these power gating features will be attractive outside of the mobile world. Does anyone want a server processor made up of multiple clusters of 1x A76 + 4x A55, or what have you? If so, why? Nobody has made that use case yet. The existing ARM server SoCs appear to have synchronous core configurations. Everything is the same core, at the same clockspeed.

In Intel's Core-Y series the additional issue is that they are not SoCs, so the further required chipset and controllers (like for Thunderbold etc.) only add to the power requirement. The Atom chips being SoCs have better idle usage at the cost of worse connectivity.

To date, Atom hasn't been competitive either, though. Not in the lower-power mobile space.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
Well I was able to find Said patent, but I am not to versed in CPU Architecture to know if it's the real deal or not.

https://patents.google.com/patent/US5944816A/en

The patent was assigned to GlobalFoundries Inc (1996) The inquire at the time believed that it was for a possible future HT.

AMD patent could enable hyperthreading
https://www.theinquirer.net/inquirer/news/1029950/amd-patent-enable-hyperthreading


Hyper threatting is way too dangerous. I don't think they would do it, and if they did, people probably would be hesitant to use it.


As for SMT-n it's been floating around CS academia for decades, so I doubt it's patentable.


Maybe a trademark search (eg threadripping) might some day give us more to speculate on.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
706
578
136
Windows scheduler does it's job. There's absolutely no point of putting threads to SMT cores instead of real cores for power reasons, even without big.Little low core utilization will keep it's clock frequency and voltages low and save energy. SMT only makes sense when there's full utilization of cores and using SMT will provide more throughput. And putting other threads to same core that already runs high-priority thread instead of idle cores is just stupid as it will slow down that high-priority thread.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
Is any other operating system's scheduler going to do better with an SMT CPU? Also if you try moving threads onto occupied cores, then you have the issue of what happens if the "big" thread runs a slice of code that can utilize all available execution resources (AVX2 or what have you). Now you have the scheduler trying to put a second thread on the CPU when there are no pipeline stalls or other obvious "gaps" where the second thread can execute. Now the scheduler is going to have to move that thread to another core entirely, which is probably why "mindless and braindead" schedulers pick physical cores over logical cores first. Or at least one reason why.
You have exactly the same issue with big.LITTLE. If a scheduler is theoretically capable of detecting high utilization threads and moving them from little to big cores it's also theoretically capable of moving them from SMT shared to dedicated physical cores. It's all a software problem.

But we are also talking about AMD. Their server core design will be present in all of their products, at least until they grow to the point where they want to maintain separate core designs. I see no clear indicator that AMD will even consider such a strategy on any of their roadmaps.
Did you see AMD going with SMT2 before they announced it? Did anybody see that first implementation beating Intel's HT with the very first implementation?

Do we want SMT4 on the desktop?
That's completely beside the point. Does the majority of desktop users need AVX2? Most very likely do not.

In a server, it's realistic to believe that most of a CPU's resources will be committed most of the time (if not all of the time).
That's actually wrong unless you are specifically talking about HPC specifically. Servers in general are all about over-provisioning all kinds of resources, being prepared for the worst case resource usage scenarios.

So we don't worry so much about when and how a scheduler wakes up a particular core.
Patently wrong. The more cores a chip contains in one shared envelope the more the cores' activity will affect each other. The more cores can be put into deep sleep state the more headroom other cores can make use of. And as we know AMD developed Zen's microcode in PB in a way to dynamically make use of more headroom so it profits from that now already.

Zen2 is heading for laptops in Renoir. Presumably, Zen3 will follow the same circuitous path. Does AMD want SMT4 in laptops? I don't think we should rationally consider it possible (or plausible) that AMD will emulate big.LITTLE or DynamIQ in their core designs, but you have to admit, if they did, it would ease the transition to low-end computing devices, far moreso than would adoption of SMT4. Realistically-speaking, I think AMD will avoid any change away from SMT2 in the near future. They will keep selling more of the same since it works.
But in the last two years AMD did the opposite of "selling more of the same since it works". Zen to Zen 2 completely changed the MCM topology. SMT is still very new to AMD, having been introduced only two years ago. Software support didn't prevent AMD from launching any of the Ryzen nor the Threadripper chips. Windows scheduler had serious issues with TR 1's NUMA, then again with TR 2 WX's unbalanced NUMA.

There's also the issue of SMT and VMs. A lot of cloud vendors just disable SMT/HT right out of the gate. AMD has every intention of selling hardware to them, and I do not think that SMT4 will be a big selling point for those buyers. I also question whether a DynamIQ-style ansychronous core arrangement would be useful since it would complicate the allocation of bare metal assets during creation of a VM.
What is this "allocation of bare metal assets during creation of a VM" you are speaking of, resource allocation can be changed even after the creation of a VM, just as you can change the PC hardware after installing an OS. That again is purely a software issue.

And disabling SMT/HT for cloud providers is due to them specifically offering resources per single vCPU, and you don't want this vCPU resource being a variable that depends on how many concurrent threads are on it. But that doesn't prevent server providers offering computing resources per CCX (or comparable big.LITTLE blocks) instead where SMT could be left enabled.

I think the answer is c). None of the above. AMD simply doesn't have little cores available to use, so being the frugal sorts that they are, they'll just punt on that question and add more of the same SMT2 cores they already have (with planned updates).
You yourself were arguing for the cat cores before.

Not entirely true. Some of those challenges are unique to Infinity Fabic.
...which is part of the uncore and offers intra chip connectivity that one always needs on any chip...

Others are unique to AMD's CCX design. The mobile SoCs can easily gate off lower-level caches since they are not shared (I think the standard DynamIQ design calls for shared L3). So can pretty-much anyone else.
And Zen cores can power gate everything except the shared L3$. (I think I remember the APUs can even power gate the L3$ itself since it's not shared due to its single CCX nature, not sure.)

ARM's DSU has some interesting additional features though, like being able to gate off part or all of a cluster's L3 cache depending on load:
That's a good area for further improvements for AMD there indeed. (Also finding a way to make the shared L3$ globally writable instead just local slices per core. Making better use of that massive L3$ should give a good performance boost.)

But that's again about the cores which are plenty optimized for power efficiency as is already. The uncore is where most further power efficiency optimizations can be done.

And putting other threads to same core that already runs high-priority thread instead of idle cores is just stupid as it will slow down that high-priority thread.
Is that what the Windows scheduler does? :D
 
Last edited: