Speculation: Ryzen 4000 series/Zen 3

Page 25 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
Of course FLOPS is compute
If only.
I was talking about hardware AND software
No, it's mostly about hardware.
because CUDA had essentially created a software monopoly in the GPGPU market.
CUDA had humble beginnings.
And would be worth nothing if nV didn't seed the capable hardware into academia all these years ago.

nV was there first and actively iterated on their h/w.
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
CUDA had humble beginnings.
I've not been around that long, but long enough to know humble and nVidia are not things that belong in the same sentence, in any context - maybe before the fall of 3dfx, I barely even noticed that they existed back then.

Just because they sent hardware to academia doesn't mean their aim was not a monopoly then, if anything it simply shows some pre meditation towards creating such a monopoly - after all GPGPU was a pretty nascent field back then, so academia would be the perfect place to get a user base hooked on a specific brand before they get hired to companies or start their own.
nV was there first and actively iterated on their h/w.
Err... ATI had something of their own too, Brook I believe it was called or something similar sounding, and of course ATI/AMD iterated on their hardware too, what a silly thing to say - Terascale1 was not great for compute, but it was clearly a step forward over their previous architecture.

If anything I would say that AMD buying ATI may well have caused them to stumble somewhat at such an early stage in GPGPU development, not that the HD 2000 misfire did them any good for that matter.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
CUDA had humble beginnings.
And would be worth nothing if nV didn't seed the capable hardware into academia all these years ago.

nV was there first and actively iterated on their h/w.

Close but even though CUDA wasn't always as fully featured as it is now compared to back then it still set the foundation for the right programming model and thus it ended up being the right tool for academia as well ...

Academia does not want to write all of their kernel code in separate files but they want to write kernel code in a single file hence the "single-source" model where source code is kept in a single file! Everybody wants the ability to execute C++ templates on their GPUs ...
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
but long enough to know humble and nVidia are not things that belong in the same sentence
I didn't say nV was humble.
Just that CUDA had very humble beginnings (and it's not like nV DC biz was actually making money before the ML boom).
Just because they sent hardware to academia doesn't mean their aim was not a monopoly then
Who cares what their intent was, they had the more capable hardware and they seeded it.
and of course ATI/AMD iterated on their hardware too, what a silly thing to say
Not in a way nV iterated on their GPGPU capabilities.

I don't like nV, but credit is where credit is due, for they did a lot during GPGPU origins.
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
they had the more capable hardware
Says who?

If CUDA got the most exposure in academia - who lets face it were the only ones doing serious GPGPU back then - how can you even be sure that AMD/ATI hardware was fairly tested against?

HD 2000 was a wash certainly, but the succeeding products were not, and I don't remember seeing a great deal of CUDA type work in the mainstream during the 8800-9800-180 rinse and repeat shrink Tesla uArch era - not exactly blowing my skirt up on the HW innovation front with so many shrinks of basically the same chip....

Again, this is a reason why nVidia purposefully lagged on their OpenCL implementation, it made directly comparing with AMD's products on compute a very difficult proposition, for the same reasons that AMD has to slog it on their own porting new versions of TensorFlow, or Octane on AMD platforms never materialised properly until now (they're going Vulkan compute and RT).
Not in a way nV iterated on their GPGPU capabilities.
You keep saying this, can you elaborate rather than just sounding mysterious.

Now I will grant you that Fermi was a serious iteration to Tesla, but AMD followed with a significant iteration to Terascale (VLIW4), and not so long after with GCN - I don;t see nVidia's rate of hardware iteration being better here unless you count a shrink as an iteration.

I'm not saying CUDA itself didn't see activity during that era, I'm pretty sure it had a ton of development, but the pre Fermi uArch changes not so much.
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
I don't like nV, but credit is where credit is due, for they did a lot during GPGPU origins.
They also created a monopoly on GPGPU through CUDA - they created a walled garden yay!

For years I have wondered if Neil Trevett's only reason for being at Khronos is to hamper OpenCL, because nVidia certainly has no positive interest in developing it.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
They also created a monopoly on GPGPU through CUDA - they created a walled garden yay!

For years I have wondered if Neil Trevett's only reason for being at Khronos is to hamper OpenCL, because nVidia certainly has no positive interest in developing it.

It wasn't solely Nvidia at fault for hampering OpenCL ...

Mobile vendors don't care about professional compute so their driver/compiler implementations often failed in more complex scenarios so the likes of ARM/Qualcomm are arguably worse than Nvidia in terms of OpenCL support ...

AMD gave up on OpenCL after producing 3 different driver stacks (Orca/PAL/ROCm) and nobody in the HSA Foundation wanted what they were having as well so they moved on to HIP which will eventually give them PyTorch and Tensorflow support. I hear it from former AMD employees all the time in the ways they admire CUDA ...

Nobody also wanted to work on Mesa's clover stack so not even the community cares if there even is one ...

Face it, Nvidia is only isn't the only issue when nobody else cares about OpenCL aside from Intel ...
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Slower than L3 for sure, but a damn sight lower latency than DDR mounted on the motherboard, not to mention lower power draw too as the distance is so much less to travel - if they ever mount HBM on the CCX chiplets it would make a killer L4.

I don’t think it actually would make a good L4; it is still DRAM and it is quite slow to read. It is pretty unclear where AMD is going next. I don’t know if it is really worth it to put cpu chiplets on an interposer. The cpu chips in Rome are around 600 square mm by themselves. It is over 400 more for the IO die. That would be a very big and expensive interposer possibly without much benefit.

I could see them going with an interposer for the IO die though; an active interposer would make a huge amount of sense. They could place all of the larger, higher power transistors needed to drive external interfaces into the interposer and then stack 7 or 5 nm chiplets on top for the logic. Interposers add a lot of flexibility, so it is hard to predict what configuration they would choose. The IO interposer may actually be smaller than the current IO die since it would have twice the effective die area. They may be limited by the number of pads required for all of their interfaces though. It has a huge number of signal pins that may limit how small it can be.

I am assuming that the current IO die arranges the memory controllers into 4 128-bit controllers. The infinity fabric is now 256-bit read and 128-bit write, so they need 128-bit memory controller operating at DDR rate (effectively 256-bit per clock) to supply it. I could see them using four separate 7 or 5 nm memory controller chiplets with a huge amount of SRAM cache on each one. They could connect to each other with very wide paths at low clocks to save power. Cache scales very well, so it could be a very large cache.

They would also need another few chiplets for the IO and fabric logic but that isn’t as important as the memory latency, so it doesn’t need to be as tightly coupled. This may be where the 15 chiplet rumors come from. They may have room on the Epyc package for 2 more cpu die, bringing the total up to 80 cores.

This would require a huge amount of redesign of just about everything though, so something like this may not be coming until Zen 4 or 5. I would guess at least Zen 4. They may do something like this with the switch to DDR5. I suspect Zen 3 will be similar IO die; it is already massively over kill on the IO. I think they will probably focus on core improvements. They may tweak the IO die design for better latency. They could even shrink it and add some L4 cache, although, I don’t know if 7 nm makes sense. A shrink could also make room for more cpu die, although IO doesn’t scale well.

For core improvements, I don’t think SMT 4 is that outlandish. Zen was designed with server in mind. It had a giant infinity fabric switch and 4 IFOP links that were completely wasted in the consumer space. Zen 2 gets rid of that so the consumer design doesn’t have all of the extra server components wasting die area. You generally wouldn’t want SMT 4 for the consumer space, so if they do implement it, it may be not supported on consumer parts or disabled by default.

What enthusiast don’t seem to get is that for a lot of servers, all of that AVX hardware is completely wasted. A lot of servers perform almost zero FP operations. A lot of server code is very branch intensive with hard to predict branches. They also have large memory footprints that reduce the effectiveness of caches. Server code often achieves an IPC of much less than one. I have profiled code with processor counter monitor and even seemingly compute heavy code often only achieves an IPC of around 1.

Certain types of server code can make good use of SMT since it can’t achieve very high IPC anyway, so you can run some extra threads and get much more throughput. It is a good way of sharing things like the FP units that go mostly unused in many servers. Earlier processors have done 4 and 8 way SMT. The SPARC T-series processors went up to 8-way a long time ago. The issue with those were that they were made on 65 nm for the T2 with 8 cores / 8 threads per core. That is maybe a couple hundred million transistors, but it performed well for some specific applications. Zen 2 is already close to 4 billion transistors just on the cpu die. With 7 nm+ or 5 nm, they can afford to duplicate or intelligently share a lot of resources that would not have been possible with earlier implementations. Also, for those that don’t know, SMT is often not useful for HPC applications. It depends on the application, but many HPC applications are compute intensive enough that SMT can hurt performance. It is often disabled on HPC machines.

If they don’t increase the actual core count with Zen 3, then doubling the number of logical cores could be a good substitute if they throw enough hardware at it. I wouldn’t mind having 512-thread machine for compiling code, even if it is *only* 128 physical cores.

I kind of doubt that we will see AVX512. I always saw AVX512 as intel’s kludge to try to make their cpus perform like gpus. From what I have heard about Xeon Phi from HPC people, it didn’t really work that well and intel is now designing their own gpu anyway. If you have something that can really take advantage of 512-bit vectors, then you probably should look at running it on a gpu. They may go up to 4 full 256-bit FMA units, but increasing the width again also requires increasing all of the interconnect to feed the new units. The die area may not be the issue. The interconnect may just burn too much power.

I guess I kind of expect some bigger core updates with Zen 3 and the IO not changing that much. The big IO changes may come with Zen 4 with DDR5 and, I guess, possibly pci-express 5.0. They could certainly tweak the IO die with Zen 3, but I don’t think there will be radical changes. Moving to an interposer requires completely redesigning everything to take advantage of wide internal paths; it seems too soon for that. They will need to redesign for DDR5 anyway. I don’t expect HBM because it would probably require an interposer and therefore a complete redesign. It also just isn’t a good cache.
 
  • Like
Reactions: Drazick

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
For core improvements, I don’t think SMT 4 is that outlandish. Zen was designed with server in mind. It had a giant infinity fabric switch and 4 IFOP links that were completely wasted in the consumer space. Zen 2 gets rid of that so the consumer design doesn’t have all of the extra server components wasting die area. You generally wouldn’t want SMT 4 for the consumer space, so if they do implement it, it may be not supported on consumer parts or disabled by default.

What enthusiast don’t seem to get is that for a lot of servers, all of that AVX hardware is completely wasted. A lot of servers perform almost zero FP operations. A lot of server code is very branch intensive with hard to predict branches. They also have large memory footprints that reduce the effectiveness of caches. Server code often achieves an IPC of much less than one. I have profiled code with processor counter monitor and even seemingly compute heavy code often only achieves an IPC of around 1.

Certain types of server code can make good use of SMT since it can’t achieve very high IPC anyway, so you can run some extra threads and get much more throughput. It is a good way of sharing things like the FP units that go mostly unused in many servers. Earlier processors have done 4 and 8 way SMT. The SPARC T-series processors went up to 8-way a long time ago. The issue with those were that they were made on 65 nm for the T2 with 8 cores / 8 threads per core. That is maybe a couple hundred million transistors, but it performed well for some specific applications. Zen 2 is already close to 4 billion transistors just on the cpu die. With 7 nm+ or 5 nm, they can afford to duplicate or intelligently share a lot of resources that would not have been possible with earlier implementations.

Careful now.

Following along that train of thought, Bulldozer and derivatives should have excelled in the server marketplace.

We know that they obviously didn't, so I don't think sharing FP units among too many threads is something AMD will go back to in a rush. Perhaps that was due to a crap front end that was insufficient for even the weak pipes it had or there were other bottlenecks - and the high level design is carrying the blame.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Careful now.

Following along that train of thought, Bulldozer and derivatives should have excelled in the server marketplace.

We know that they obviously didn't, so I don't think sharing FP units among too many threads is something AMD will go back to in a rush. Perhaps that was due to a crap front end that was insufficient for even the weak pipes it had or there were other bottlenecks - and the high level design is carrying the blame.
IMHO Bulldozer was crap due to 2xALU core (K10 had 3xALUs). Shared FPU was a great thing on BD design (despite very weak FPU it was delivering fair performance). Zen2 has quite similar FPU design till these days just more powerfull (4x 256-bit pipes).

Knowing that Jim Keller is developing in Intel super wide core Golden Cove (probably 6xALU) I assume Intel hired him to develop answer on Zen3. This supports rumor that Zen3 is a wider 6xALU core. SMT4 is also being effective for wider core. All puzzles are just matching together.

I guess I kind of expect some bigger core updates with Zen 3 and the IO not changing that much.
I agree. There is no reason to change IO die for desktop. Maybe new revision for improved stability if needed.
 
  • Like
Reactions: darkswordsman17
Mar 11, 2004
23,075
5,557
146
IMHO Bulldozer was crap due to 2xALU core (K10 had 3xALUs). Shared FPU was a great thing on BD design (despite very weak FPU it was delivering fair performance). Zen2 has quite similar FPU design till these days just more powerfull (4x 256-bit pipes).

Knowing that Jim Keller is developing in Intel super wide core Golden Cove (probably 6xALU) I assume Intel hired him to develop answer on Zen3. This supports rumor that Zen3 is a wider 6xALU core. SMT4 is also being effective for wider core. All puzzles are just matching together.


I agree. There is no reason to change IO die for desktop. Maybe new revision for improved stability if needed.

Yeah Bulldozer's issues were more than the shared FPU aspect. That would've been interesting if AMD hadn't botched so much of the rest of the design (and then realized it, leading them to abandon it about as quickly as they could). I could see some return to some of the ideas, especially if they have the rest in place to make it actually successful.

I actually think there is good reason to change the I/O, although its just mostly wishful thinking on my part. Integrate a bit of HBM into the I/O die (just one stack on top of the I/O; keep complexity low and stacking it removes the need for interposer), to serve as general cache for the whole system, further helping latency and aiding unified memory space, and should be especially beneficial for giving a boost to GPU performance (which if nothing else will let them trial some CPU and GPU chiplet designs if they're not ready to go terribly far there).
 

Veradun

Senior member
Jul 29, 2016
564
780
136
I agree. There is no reason to change IO die for desktop. Maybe new revision for improved stability if needed.

I believe IOD will be renewed in 2021 to add DDR5. Meanwhile they'll focus on the compute die.

There's probably space for a Zen3+ in 2021 with the same compute die as Ryzen 4000 (and better binning with process optimization) but with a new platform, AM5. They'll probably shrink the IOD only at that point
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Careful now.

Following along that train of thought, Bulldozer and derivatives should have excelled in the server marketplace.

We know that they obviously didn't, so I don't think sharing FP units among too many threads is something AMD will go back to in a rush. Perhaps that was due to a crap front end that was insufficient for even the weak pipes it had or there were other bottlenecks - and the high level design is carrying the blame.

Zen is already sharing FP units (and a bunch of other units) between two threads. There is huge amounts of server code that does near zero FP instructions, so it isn’t an issue. Bulldozer was more towards a speed racer design anyway, which is generally terrible on server code. Branch heavy code with hard to predict branches gets hit hard by miss-predict penalties. Also, they were trying to do 8-core at 28 nm. A lot of sacrifices probably had to be made. It did perform relatively well on a small subset of well threaded code, but pervasive use of multithreaded code has been slow to take off.

Hardware always has to lead software though. No one is going to optimize their software for 8 core / 16 thread if 99 percent of the market is 4 core or less. Intel (and nvidia in other ways) really held software development back by staying with 4 core for so long. Taking advantage of 4 cores is trivial. Going beyond that takes a little extra work. We could have had 8 cores mainstream at 20 nm easily. If that had happened, multi-threaded game engine development would have been years ahead of where it is now. It is rather ridiculous that game consoles and cell phones actually probably pushed multi-threaded development more than PCs. A lot of phones had 6 or 8 cores and the consoles went with 8 low power cores a long time ago (2013?), yet mainstream PCs were stuck at 4 until Zen.

And I launched into a core count rant again. It annoys me that I am stuck with a 4 core / 4 thread Xeon at work also. I’ll just stop now.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I believe IOD will be renewed in 2021 to add DDR5. Meanwhile they'll focus on the compute die.

There's probably space for a Zen3+ in 2021 with the same compute die as Ryzen 4000 (and better binning with process optimization) but with a new platform, AM5. They'll probably shrink the IOD only at that point

I could see them doing a shrink to global foundries new 12LP+ or something for the IO die. I don’t know if they can switch over to TSMC completely. Technically, they probably could have a generation of compute die that could be used with either ddr4 or DDR5, just pair with different io die.

The IO die for desktop and Epyc may diverge. There isn’t much reason to use an interposer for the tiny desktop IO die, but there would some benefit for the giant Epyc IO die where L4 cache would be desirable. The way all of the CCX communicate with the IO die makes me think that L4 will be a thing. I initially thought that CCXs on the same die would be able to talk directly to each other, but that doesn’t seem to be the case. They may be making some more specialized or custom components for HPC and other niche markets also.

With how quickly AMD is moving on design iterations, I don’t know if we will see any Zen3+. After ThreadRipper, I would hope that the next part is a monolithic 7 nm (or 7 nm+) APU for mobile. With the way they are doing things, that may be branded as a Ryzen 4000, even though it will probably be Zen 3 based, perhaps with some design tweaks for lower power.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Yeah Bulldozer's issues were more than the shared FPU aspect. That would've been interesting if AMD hadn't botched so much of the rest of the design (and then realized it, leading them to abandon it about as quickly as they could). I could see some return to some of the ideas, especially if they have the rest in place to make it actually successful.

I actually think there is good reason to change the I/O, although its just mostly wishful thinking on my part. Integrate a bit of HBM into the I/O die (just one stack on top of the I/O; keep complexity low and stacking it removes the need for interposer), to serve as general cache for the whole system, further helping latency and aiding unified memory space, and should be especially beneficial for giving a boost to GPU performance (which if nothing else will let them trial some CPU and GPU chiplet designs if they're not ready to go terribly far there).
If your mounting HBM on top of the IO die, then it essentially is an active interposer and would need to be manufactured like an interposer with through silicon vias and such. For a higher end laptop, my preference would be an 8 core APU connected to a small, discrete, HBM based GPU for when more performance is necessary. That seems to make the most sense.
 

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
Zen is already sharing FP units (and a bunch of other units) between two threads.

SMT and CMT are not the same thing, though. The way Zen/Zen+/Zen2 "share" FP resources between threads does not result in as many slowdowns in fp-heavy code as the way BD did it. If I have relatively unoptimised fp code using something old like . . . SSE3, I can get good performance loading 2t per core on Zen while on BD I get very little improvement moving from 1t per module to 2t per module.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,351
1,537
136
SMT and CMT are not the same thing, though. The way Zen/Zen+/Zen2 "share" FP resources between threads does not result in as many slowdowns in fp-heavy code as the way BD did it. If I have relatively unoptimised fp code using something old like . . . SSE3, I can get good performance loading 2t per core on Zen while on BD I get very little improvement moving from 1t per module to 2t per module.

This, by the way, has nothing to do with the FP units themselves. The limiting factor for FP-heavy code on BD is practically always store throughput. There are always free cycles available at the execution pipes, simply because the write-through L1 and crappy L2 throughput means that even a single thread on a module can typically completely exhaust available stores. And SIMD code typically has more store demand than scalar code. In this way, the shared FPU is the largest red herring in cpu design I've ever seen; people look at the lackluster FP performance, look at how the FPU is shared, and blame that for it. The reality is that the FPU on BD is overprovisioned. It's more powerful than the rest of the module can support, and it's never the thing slowing you down. The shared FPU was fine. It's all the rest that was terrible.

In any case, CMT is never coming back. Corporations don't always do the rational thing. Even if CMT was somehow an actually good decision going forwards, AMD will never use it again because the people involved who championed it are gone, and no-one wants to tie their name to failure by picking up the torch for a design element that is now seen as a really bad idea.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
AMD will never use it again because the people involved who championed it are gone
Boxborough cores design team is still up at AMD. They did Steamroller/Excavator, btw. They are also headed by a Bulk/FDSOI guy, not sure if that has changed.

They have three CMT implementations to pick from; Keller's(~1998+), Glew's(~2002+), and Moore's(~2004+) implementation.

If Piccasso's RAVEN/[RAVEN2] -> Dali's [RAVEN1] has any design differences, making it a performance successor to RAVEN2. It could open room for the Boxborough core successor to SR/XV to be placed where RAVEN2 was. So, Dali can be RAVEN1 and a SKU w/ the third core from Boxborough.

If I see Stoney again in Ryzen 4000... please, please don't be A9-9435/A6-9235/A6-9225C/A4-9125C/E2-9025C.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
Knowing that Jim Keller is developing in Intel super wide core Golden Cove (probably 6xALU) I assume Intel hired him to develop answer on Zen3. This supports rumor that Zen3 is a wider 6xALU core. SMT4 is also being effective for wider core. All puzzles are just matching together.

It seems the Keller was brought in to really up Intel's game on server grade CPUs. I suspect that the server group will, finally, be driving core design and the client group will mainly be tweaking the design for desktop/laptop thermal envelopes. Keller, at this point, is really a systems guy (design, prototyping (FPGA), implementation, and QA). I'm guessing that he really wants to get back to his roots in design.

I would be surprised if Zen3 goes wider. It appears that AMD's focus for Zen3 is power efficiency - going wider would blow that up, I think.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,801
136
I would be surprised if Zen3 goes wider. It appears that AMD's focus for Zen3 is power efficiency - going wider would blow that up, I think.

I've read that regarding power and Zen 3. I hope the go wider, after all Intel is. Zen is already incredibly efficient, we have 65W 16 core CPU's on the way. They should be able to go wider and maintain power by going to 12nm+ for the I/O die and 7nm+ for the chiplets.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Zen3 on the device/node level is focusing on higher transistor density and increased power efficiency. However, Family 19h(Zen3) can potentially be another Family 17h(Zen-Zen2) w/ another >40% IPC boost w/ same energy level. Higher density/effiency can mean larger caches, more ALUs, more FPUs, larger queues, etc.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,801
136
Zen3 on the device/node level is focusing on higher transistor density and increased power efficiency. However, Family 19h(Zen3) can potentially be another Family 17h(Zen-Zen2) w/ another >40% IPC boost w/ same energy level.

I don't think AMD catches lightening in a bottle twice in such rapid succession. There isn't much in the way of rumors ATM, but I would expect another 15%, maybe 20%. Once we start hearing more, I will gladly adjust those numbers up or down.
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
I don't think AMD catches lightening in a bottle twice in such rapid succession. There isn't much in the way of rumors ATM, but I would expect another 15%, maybe 20%. Once we start hearing more, I will gladly adjust those numbers up or down.

TSMC's 7+ node allows for ~10% decrease in power and an ~15% decrease in area for the same the implementation due solely to different transistor characteristics (which may have only been achievable via EUV, not sure). Anyway, more execution resources (at peak utilization), larger cache sizes an such eat more power. I think on some Intel designs, features could only be added if they gave a 2% performance boost for the cost of 1% more power use. That could be what AMD have done - and it surely wasn't easy.