Speculation: Ryzen 4000 series/Zen 3

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
22,885
12,941
136
When Apple can do this, anybody can do this too. It's just matter of time and resources. Qualcomm, ARM Cortex and any start-up. Nobody needs to beg Intel for x86 license.

Instead they have to beg ARM for an ARM license. And I'm sorry to say that nobody in the ARM world has caught up with Apple yet. Or even AMD (Zen2). Maybe someday Apple will license one of their core designs to Huawei or . . . whoever, who will then turn around and use a custom interconnect to produce a beastly 64c server processor, with SVE2 tacked-on for good measure. Then AMD will have to worry. As it stands, Rome went a long way towards killing the existing ARM competition. ThunderX2 is looking a lot worse right now. So is Qualcomm's 64c chip (which is embargoed in some large markets anyway).

For now, A-series chips are staying in their cage.

And regarding RISC-V. Actual RISC-V CPUs are weak in performance because lack of development at this CPU not because instruction set.

ehhhh

https://news.ycombinator.com/item?id=17611047

And that's just one thread of criticisms. I've read other complaints. RISC-V may be of great benefit to markets that want a locally-developed CPU to please some centralized autocratic government. Outside of that, I have little hope for it.

RISC-V is cheaper and more open version of ARM instruction set.

Is it? You can license a current-gen ARM core and get all kinds of resource support for matching it to the design rules of a specific existing node for what probably costs less than hiring your own designers to try and figure out how to leverage RISC-V well enough to beat said ARM core. Can you hire a team to beat an A76-based chip on a core-per-core basis? No? I can't either. Maybe if I had enough money, I could poach a few people from Rockchip (or wherever) to throw together a 4xA76 + 4xA55 chip via DynamIQ (also provided by ARM; very nice of them, isn't it?). That's an attainable goal.

Eventually someone may put up the piles of cash necessary to make a truly-open RISC-V design. Then I can copy that design (or iterate upon it) and compete with people schlepping standard ARM designs. But someone's going to lose a bunch of money up front. Just not me!

So if you want fight in server market for big money then you need develop powerful CPU from scratch -> best choice is RISC-V. IMHO that's why it attracts attention.

Name one company even looking at RISC-V to do that. Note that making a state-anointed replacement CPU to x86 and ARM does not count, since that is not "fighting in the server market". That's making homebrew hardware with backdoors specific to your own local intelligence agency (instead of the NSA or whoever).
 

moinmoin

Diamond Member
Jun 1, 2017
5,240
8,454
136
It's a pity to see another competitor leave the market. Are there any rumours about what's going to fill the void? Who's going to take over the fabs etc.
From public facing side it does look like tsmc has taken the lion's share of the pie and samsung is playing second fiddle for high performance parts (I know they're bigger in memory). Surely the AMD link has a sizable contract timeline so they can't just stop 14/12nm production for a few years as all the epyc parts have to be supported for years. Zepplin and epyc2io dies will be being made for years although is GF really have no customers of note then amd should be stockpiling everything they can get from old nodes whilst being in the position to negotiate cheaper prices.
AMD guarantees availability of its products for up to 10 years from launch indeed (embedded products are 10 years, PRO and server chips are 5 years I think). I'd think whoever is taking over GloFo's FinFET business would also inherit those contractual obligations (including WSA? Anybody know whether that's transferable?).

Your second quoting is not from me but @amd6502, please fix. ;)
 

Thunder 57

Diamond Member
Aug 19, 2007
4,020
6,734
136
Since 90's things changed a lot.
  1. What is the most sold CPU architecture today? ARM - due to smart phones, IoT, smart TV, ... it's just everywhere and creating new markets which were not possible with x86. ARM is huge platform living in parallel to x86 for different markets.
  2. And it's creating a lot of software for ARM too. You have several OSes ready for ARM, Win10, Linux, BSD, Android and tons of opensource SW. In 90's you didnẗ have a alternative for x86 because of software. Apple or Amiga were better HW but they starved for SW (Amiga died and Apple was saved by SW from Bill Gates MS). Thing has changed and even new architecture like RISC-V thanks to opensource SW has ton of re-compiled applications too. Just wainting for more powerful RISC CPU, everything else is ready.
  3. Actually it's fight about surviving x86 in desktop, laptop and servers. Everywhere else is RISCs already dominant. It's waiting to reach the critical point. And that point is powerful ARM CPU to flood the market. Apple has one already but it will stay in their garden only.
  4. Are you happy your x86 CPU is using 4xALUs? Why not, it's powerful... What about Apple Vortex core is using 6xALUs and is much powerful that any x86 CPU? This is the point I'm starting to be unhappy. It shows how x86 CPU development is retarded due to lack of competition. Me as a customer deserve more than Intel and AMD produce right now, we need more competition. When Apple Vortex core can have 6xALU then AMD Zen3 can have 6xALUs + SMT4 too.

1. And what kind of margins do they get on IoT, smart TV, etc? Chump change compared to the server market.

2. There is definitely more software today. But don't forget Windows NT used to run on RISC, MIPS in particular.
IA-32, x86-64, ARM and Itanium (and historically DEC Alpha, MIPS, and PowerPC)
So the OS was there, but the market still went with x86.

3. Let's wait and see what happens if/when Apple decides to drop Intel in favor of ARM. Then we will have something to talk about. Until then, it's pure speculation.

4. I'm doing just fine with 4 ALU's. There is a limit to ILP. What use are 6 ALU's if half of them sit idle? And who knows what Zen 3 will bring? Maybe you'll get 6 ALU's and SMT4. Competition has only recently re-emerged in the x86 world. SMT4 may be beneficial for servers, but I question how useful it will be on your average desktop.
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
So the OS was there, but the market still went with x86.
And market will continue going with x86, since x86 vendors still rock.
Particularly with AMD burying merchant ARM servers and the likes of it.
I'm doing just fine with 4 ALU's. There is a limit to ILP. What use are 6 ALU's if half of them sit idle? And who knows what Zen 3 will bring? Maybe you'll get 6 ALU's and SMT4. Competition has only recently re-emerged in the x86 world. SMT4 may be beneficial for servers, but I question how useful it will be on your average desktop.
SMT4 is a meme and isn't happening on anything Zen period.
Also ILP wall is still pretty up there, just getting to it is fairly hard in terms of balancing OoO, power and clocks.
 

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
SMT4 is a meme and isn't happening on anything Zen period.
That is a bold statement right there without any fact as to why a 4 way SMT is not possible, I hope it ages well, because I will come back here to quote this post when it does'nt
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
That is a bold statement right there without any fact as to why a 4 way SMT is not possible, I hope it ages well, because I will come back here to quote this post when it does'nt
There's been no supporting evidence for SMT4, except from that nobody on AdoredTV. One freaking "source" and everyone has lost it.
 
  • Like
Reactions: Thunder 57

DrMrLordX

Lifer
Apr 27, 2000
22,885
12,941
136
SMT2 seems like it's getting the job done already, so why try to strap more threads onto each core? I don't see much evidence that Zen2 in particular is suffering from enough pipeline stalls/cache misses to present major opportunities for performance gains by trying to take on even more threads.
 

mtcn77

Member
Feb 25, 2017
105
22
91
SMT4 is a meme and isn't happening on anything Zen period.
But all the while, zen has been the killer inside joke of the industry. We know they will upgrade the branch predictor as well. So, why stop there? SMT4 and branch predictor have very similar characteristics. Both are basically prefetchers.
 
  • Like
Reactions: DarthKyrie

Thunder 57

Diamond Member
Aug 19, 2007
4,020
6,734
136
That is a bold statement right there without any fact as to why a 4 way SMT is not possible, I hope it ages well, because I will come back here to quote this post when it does'nt

Of course it's possible, but is it practical? Would there be a net gain in performance? Those are the questions. I'll gladly come back here and say I was wrong if we see SMT4 on Zen 3, because I'm that confident that it will not happen.
 
  • Like
Reactions: linkgoron and IEC

amd6502

Senior member
Apr 21, 2017
971
360
136
I'm confident that SMT4 will probably be in Zen eventually.


Short-term: SMT4/SMT6/SMT8
Long-term: XMT=SMT32~~SMT64


I agree, eventually. But I don't think it will go beyond SMT4.

I think near term 4-way MT (entailing SMT2 plus two free low IPC threads) is very good low hanging fruit that will have benefits for perf/watt, i.e. mobile 4c/16t APU and server application. It should also have side benefits of secure speculation free threads that ride almost for free and use very low power.
 

DrMrLordX

Lifer
Apr 27, 2000
22,885
12,941
136
I'm still not sold on SMT4, 8, or anything else. Necessarily. What's the real advantage of 4c/16t over 8c/16t? Fewer transistors? Lower power consumption? And what would be the drawbacks of relying on SMT4? ARM designers went in a completely different direction by just adding a bunch of little extra cores to their SoCs via big.LITTLE/DynamIQ. They were very successful doing so.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
What's the real advantage of 4c/16t over 8c/16t?
Technically, the only example that would work is 8c/16t to at least 10c/40t to 16c/64t. It should also be noted that it can possibly be dynamic SMT; core x can be 4-threaded, 3-threaded, 2-threaded, single-threaded. Depending on what the branch predictor tells the thread predictor or whatever. With the option of custom thread occupation; 1-thread to OS + 3-thread to "user".
8c/16t (No HW control; either SMT-on or SMT-off) -> 10c/10t to 40t(SMT-4, SMT-3, SMT-2, SMT-off via HW control)
 
Last edited:
  • Like
Reactions: DarthKyrie

VirtualLarry

No Lifer
Aug 25, 2001
56,583
10,223
126
With the option of custom thread occupation; 1-thread to OS + 3-thread to "user".
You've obviously never heard of symmetric multi-processing (SMP) OSes... like Windows.

Hint: True SMP OSes don't "reserve cores/threads", and differentiate between OS and applications. Any process thread can execute on any CPU core/thread, unless affinity is specified for some reason, generally NUMA or HT and performance concerns.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
Implementing them needs to have benefits. Benefits of Big.little are clear in mobile space ( power efficiency ). What about this "background thread" idea? How common outside of servers it is to have fully loaded core, that is stalled on FPU/Memory, but somehow has free resources for the thread that ALU/AGU whatever heavy? What is the use case here, mine bitcoin on your web server that is somehow not bound by response time SLA or maybe some cloud bs, where you pay cheaper to share CPU with someone paying cents for 'background threads"? For power efficiency it is better to just use main thread resources and go to sleep faster, instead of treating your massive core as "in order" and proceeding at P4 speed in 2020.

These "background threads" are better served by BIG.little and i think Apple/Intel/ARM all agree on that. SMT is sharing resources even in some magic dynamic sharing case, TLB, cache space/bw are as important as some internal CPU queues.


That is a really good point. Race to idle can work well sometimes, and your point of "treating your massive core as [mostly] in order and proceeding at P4 speed in 2020" definitely would need addressing.

I can imagine wide cores need to have some sort of low power mode when low utilization happens for quite a while. So maybe Zen3 can power down the wide core and run in a 2 ALU + 2 AGU mode or similar.

Secondly, the OS could void the tasksetting of high niceness processes to little cores when the system load goes to a low number (say, a load that is below the number of physical cores). In these conditions, all software threads get taskset to the main SMT2 cores.

So, a hardware and software solution is the simplest way and best in my opinion.

It's probably also possible on just a hardware level with a virtual cores (letting the chip decide which virtual core gets matched to what physical core). But I think that would be a troublesome way to do it, so it's much better with a software+hardware solution.


NB/edit: I dug up some benches that show two P4-like cores ripping along at modern day ~4.5ghz would materially benefit the bottom line MT:

versus PD module including L3: https://browser.geekbench.com/geekbench3/compare/7983468?baseline=4120890

vs PD core without L3: https://browser.geekbench.com/geekbench3/compare/8168621?baseline=8585690

It would be optimal to have fixed frequency comparison (3ghz and no boost) but above the frequencies are only ballpark similar and you have to make adjustments in your head, including ignoring encryption benchmarks for which the P4 is probably lacking instructions. Riding along at 4+ghz a small virtual core would add somewhat significantly to the bottom line multithread. In fully loaded normal operation for mobile and server ~3ghz it would also measurably increase the perf/watt.

P4 | PD-w/o-L3
Twofish Multicore 3710 | 5305
BZip2 Compress Multicore 3346 | 2966
BZip2 Decompress Multicore 2857 | 4078
JPEG Compress Multicore 4014 | 3806
JPEG Decompress Multicore 4159 | 4832
PNG Compress Multicore 3714 | 3600
PNG Decompress Multicore 4165 | 4390
Sobel Multicore 4428 | 4885
Lua Multicore 2997 | 3722
Dijkstra Multicore 2000 | 2350


Floating Point 1013 | 2020
Floating Point Multicore 2501 | 3136
BlackScholes Multicore 2832 | 3105
Mandelbrot 1114 | 2159
Mandelbrot Multicore 3842 | 4016
Sharpen Filter 900 | 2078
Sharpen Filter Multicore 2187 | 3283
Blur Filter 814 | 1599
Blur Filter Multicore 1811 | 1625
SGEMM 1090 | 2584
SGEMM Multicore 2332 | 4059
DGEMM Multicore 2175 | 3495
SFFT 1078 | 1788
SFFT Multicore 2644 | 2711
DFFT 1132 | 1744
DFFT Multicore 2589 | 2623
N-Body 884 | 1862
N-Body Multicore 2037 | 3171
Ray Trace 1199 | 2675
Ray Trace Multicore 3144 | 4332



If small threads have an average value of even ⅓ an SMT2 thread, that means the MT goes up 33% at the cost of 1 to 2 watts. More MT gain than what you get for adding a core to quadcore SoC.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
But if they implement some sort of "OS thread + 3 application threads" per core, IN HARDWARE, then that means that it CAN'T RUN WINDOWS... because Windows is SMP. Whoosh!
-> Up until recently, both Xbox One and PlayStation 4 have reserved two entire CPU cores (out of eight available) in order to run the background operating system in parallel with games. <-

With the option of custom thread occupation. All cores can be used by the OS(1-thread) and all cores can be used by the game(3-thread). The OS sees one thread, the developer sees up to three threads, the hardware sees four potential threads, * x cores.

In the case of Windows, CMT/SMT optimization cases;
Two threads or more = reduced performance => push it to another core
Two threads or more = increased performance => keep it on the same core.
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,583
10,223
126
With the option of custom thread occupation. All cores can be used by the OS(1-thread) and all cores can be used by the game(3-thread). The OS sees one thread, the developer sees up to three threads, the hardware sees four potential threads * x cores.
You just don't do that in hardware. There's no point to it. This sort of stuff is all handled by the OS scheduler, which can use processor core/hardware thread affinity masks to limit OS threads to one thread per 4-threaded core, for example.

Edit: My point was, that if they did that in hardware, it would BREAK WINDOWS on those cores, so that they wouldn't be useful for creating Windows-oriented PCs.
 
  • Like
Reactions: extide

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
You just don't do that in hardware. There's no point to it.
Windows for Intel does HW control in regards of p-states. Called Speedshift, why wouldn't it be believable for them to extend Hyperscheduler to Windows..

-> Most of the current available solutions to the power saving problem are based on a software routine that requests a power management system to enable the power saving. Further, the corresponding power saving schedules are often created in a static and/or a manual fashion, which is error prone. Moreover, neither the dynamic power saving methods nor the static power saving methods do provide an accurate prediction of the power consumption.
-> It is an object of the invention to provide for a hardware task scheduler with an improved power saving efficiency. In order to achieve the object defined above, a hardware task scheduler, a multiprocessing system, and a hardware-based power saving method are provided.
-> The hardware task scheduler may implement scheduling policies supporting heterogeneous multi-core architectures, where each of the processor cores can be multi-threaded or single-threaded.
-> However, if this processor core is overloaded (for instance, in case that the processing element is multi-threaded and/or virtualized, in which case other tasks may be assigned to virtual processing elements physically mapped to the processing element, in addition to task currently running on it), the load balancing method may recommend running the task on some another processor core.

AfPI72M.png


HWscheduling is faster and more power efficient.

AMD Zen Automotive => Hardware Realtime SMP? or Software Realtime SMP?

=> Advanced Technologies Group is a startup group with a mission to advance the future of safer transportation as we look to bring millions of drivers into assisted and automated driving. We develop artificial intelligence perception software for driver-assistance and automated driving, with a focus on implementing efficient deep neural networks on AMD’s automotive-grade processors.

--> That said, it is not surprising that TSMC has already taped out the first chip using its N7+ technology. Furthermore, the company is prepping a specialized version of the process aimed at the automotive industry, which indicates that N7+ is going to be a “long” node.
https://news.synopsys.com/2018-10-0...rade-IP-in-TSMC-7-nm-Process-for-ADAS-Designs
https://news.synopsys.com/2017-09-1...fied-for-TSMCs-Advanced-7-nm-FinFET-Plus-Node
https://news.synopsys.com/2018-04-3...-High-performance-7-nm-FinFET-Plus-Technology
^-- AMD partner for most of their IP.

Zen1(K8) -> Zen2(Greyhound) -> Zen3(New core), anyone asking this is my official position.
 
Last edited: