Speculation: Ryzen 4000 series/Zen 3

french toast · Apr 22, 2020

moinmoin said:
While I totally agree this is what is going to happen, I personally wish the control over SMT would move into the processor, making it decide itself which amount of logical threads are most efficient for handling a given workload.

(And to be honest I'm fed up of discussing Windows as the obstacle to progress in CPU features.)

That sounds interesting.. A bit like selective 4wd in cars these days?
Probably be some use cases that trip it up just like there are efficiency downsides with the extra gubbins those fancy 4wd's carry even in 2wd mode over a pure 2wd vehicle...whilsts never being quite as capable as a proper all time 4wd.

Seriously unlikely for Ryzen anytime soon... Epyc?.. Makes alot more sense and as this is a new uarch I wouldn't be surprised if they increase the assets now for some possibility of SMT4 on 5nm for Epyc, with a wider core and lower clocked /throughput nature of datacentres, server cpus are ripe for something like this at the right time.

Veradun · Apr 22, 2020

DannyH246 said:
Does anyone have any thoughts on fabrication options for the IO die moving forward? Currently it is manufactured on 14nm at GloFlo. Will Zen3 IO die be manufactured on GloFlo's enhanced 12nm process? Or TSMC's 7nm?
Another question i had - would AMD ever consider GloFlo's FD-SOI process for any future IO die?

I was thinking about this the other day. Is it possible to go 12FDX? Does it have the requirements to be used for the motherchip?

DisEnchantment · Apr 22, 2020

Gideon said:
The only clients that have workflows that actually benefit from SMT-4 would know how to enable it in BIOS and would mostly be running linux anyway. Enabling SMT-4 out of the box on consumer chips IMO just seems dumb.

For single socket systems I doubt SMT4 is going to bring much gains. If AMD can work on their prediction, fetch, cache and Load store system and overall minimizing thread stall you can guarantee the gains from SMT4 will diminish greatly.
Patents indicate they are working hard on minimizing misprediction thread stall etc

Speculative DRAM memory read into L3
Speculative DRAM page activation
Cache control aware IMC
AGEN Bypass
Cache Bypass
BTB compression
Load/store combine
Unified AGU queue
Early return address prediction
...

From the X3D architecture it looks to me AMD is bringing data/memory even closer to the CPU more than ever, making SMT4 more niche considering that their entire goal is to minimize thread stalling by increasing data locality.

NostaSeronx · Apr 22, 2020

Veradun said:
Is it possible to go 12FDX?

Not yet, there hasn't been any tapeout/signoff runs or mpws yet. So, 12FDX is non-existent. Also, 12LP+ delayed the initial 12FDX from running at Malta.

However, 12FDX appears to be set to use newer things;
FD-2D Smartcut 2.0 wafers from SOITEC(Starts at 5nm SOI thickness vs 12nm SOI thickness)
A bunch of new materials/process steps will be added as well. Which is meant to push it towards the Networking/Computing/Server market solutions.

Ocean12:
"In building its FDX technology offer, GLOBALFOUNDRIES has substrate requirements that Soitec will develop and manufacture through the installation of a new pilot line. It is a first objective of this task to develop high quality SOI substrates to enable satisfactory yield and performance for 22FD, Next Gen 22FD, 12FD node and beyond circuits."

"We have to highlight the important role of GLOBALFOUNDRIES in the qualification of SOITEC SOI substrates pilot line. The sampling of the substrates will be provided to GLOBALFOUNDRIES to be implemented in their next Gen 22FDX and 12FDX pilot line. The incoming material check, inline parameters, defect checks as well as circuit yield data obtained at GLOBALFOUNDRIES will provide an important feedback to define substrate characteristics."

"OCEAN12, considering Next Gen 22FD & 12FD technologies, updated BOX electrical properties specifications, anticipated to support RBB / FBB extended use, will induce needs for additional development on BOX quality metrology & performance. The objective is to reach for this substrate generation a quality comparable to state of the art Gate Oxide."

^== SiO2 to HfO2 level of development objective on substrate oxide as well.

22FDX-NextGen (Mobility, Industrial, Space, but not Computing) and 12FDX (Multi-Market; Industrial/Mobility/Computing/Space)

moinmoin · Apr 22, 2020

amrnuke said:
That would be really interesting to let SMT be an on-the-fly switch, perhaps even on a per-core basis.

I was thinking per-core (or rather, per-process) indeed.

french toast said:
Seriously unlikely for Ryzen anytime soon... Epyc?.. Makes alot more sense and as this is a new uarch I wouldn't be surprised if they increase the assets now for some possibility of SMT4 on 5nm for Epyc, with a wider core and lower clocked /throughput nature of datacentres, server cpus are ripe for something like this at the right time.

Yes, it's clearly wishful thinking on my part. It doesn't even need to be for SMT-4, even for SMT-2 it would be an improvement in all the cases that run better with SMT disabled. And that would be useful even in Ryzen.

I guess my general thinking is that AMD managed to automatize the boost behavior of their chips beyond the hard coded tables used until the previous gens. As @DisEnchantment notes they appear to be working hard on improving the whole data management to make predictions and caches work more efficient, which includes a lot of automatizing logic. Handling something like SMT (at whatever size) in an automatic fashion would fit well in that line of progress.

Geranium · Apr 22, 2020

amrnuke said:
No

No

No

No

No

😛

Then the score are not eligible to compare with each other.

DrMrLordX · Apr 22, 2020

Richie Rich said:
That rumor about samples with disabled SMT

Got anything else to talk about? Just curious.

Markfw · Apr 22, 2020

Geranium said:
Then the score are not eligible to compare with each other.

Exactly. Except Richie Rich will argue that till the cows come home. He Look at his sig ! Your post 2523 is where you are disputing his post, and I agree with you. You can't compare apples to steaks.

Thibsie · Apr 22, 2020

Markfw said:
Exactly. Except Richie Rich will argue that till the cows come home. He Look at his sig ! Your post 2523 is where you are disputing his post, and I agree with you. You can't compare apples to steaks.

[Fun]
But cows and steaks are very closely related right?
[/fun]

Thunder 57 · Apr 22, 2020

Markfw said:
Exactly. Except Richie Rich will argue that till the cows come home. He Look at his sig ! Your post 2523 is where you are disputing his post, and I agree with you. You can't compare apples to steaks.

Mark, when you reference a post number, could you kindly link them? It makes it so much easier for us than having to go back page(s) and find it.

Example. That will bring you right to 2523.

Also, I'm sure you know how I feel about Richie Rich. SMT4! Zen 3 uses Jim Keller's (a god apparently) K12 with 6ALUs!. But at the same time x86 is garbage!

Not saying he is an idiot, just that his beliefs are misguided. And he doesn't seem to be open to discussion since he seems so certain in many things.

Markfw · Apr 22, 2020

Thunder 57 said:
Mark, when you reference a post number, could you kindly link them? It makes it so much easier for us than having to go back page(s) and find it.

Example. That will bring you right to 2523.

Also, I'm sure you know how I feel about Richie Rich. SMT4! Zen 3 uses Jim Keller's (a god apparently) K12 with 6ALUs!. But at the same time x86 is garbage!

Not saying he is an idiot, just that his beliefs are misguided. And he doesn't seem to be open to discussion since he seems so certain in many things.

Believe it or not, I have been on here 20 years, and did not know how to do that. But its the icon 2 to the right of the post number, correct ?

DrMrLordX · Apr 22, 2020

The post # itself is a direct link to the post.

Thunder 57 · Apr 22, 2020

Yup, just hover over the post number, right click, and copy link. Then you can use it in your own post to allow quick access.

I doubt its been possible for 20 years, but forums have improved and this was likely a nice little addition at some point. It's never to late to learn something new 🙂 .

NobleX13 · Apr 22, 2020

These rumored changes to the CCX design and higher IPC definitely have me intrigued. I am "slumming it" on a Ryzen 5 1600 right now. Holding out for the 4000-series launch.

Richie Rich · Apr 23, 2020

moinmoin said:
Yes, it's clearly wishful thinking on my part. It doesn't even need to be for SMT-4, even for SMT-2 it would be an improvement in all the cases that run better with SMT disabled. And that would be useful even in Ryzen.

Yes, I agree. OS scheduler can control SMT mode in SW way. If scheduler loads only one thread per physical core (and other virtual cores keeps empty) then it behaves like SMT-off even CPU is SMT4-ON or whatever number capable of. If scheduler loads two threads per core then it behaves like SMT2 etc. In theory there is possible to set SMT mode by configuring OS scheduler. And even more. You can set mix different SMT modes for different cores in the same CPU(or multiple CPUs in server). For example imagine you have 12-core Ryzen 4900 with SMT4 you can set for game to use 8 cores with SMT2 or SMT-off (whatever gives you best gaming performance) and run Blender render on background at remaining 4 cores using full SMT4 (so 16 threads). Only by SW way via OS scheduler. Today's stupid scheduler will utilize second thread by low priority Blender process even you set for game thread priority to very high resulting in performance degradation to half anyway, no matter what priority you set (and with SMT4 it would fall down to 1/4).

Problem is not higher number of virtual cores per physical core. Problem is that today's OS scheduler realy fails to manage thread performance over physical core. It's pure SW failure and BIOS SMT-off is just simple workaround. Sadly single core CPU was able control thread performance via process priority much better than modern SMT systems (Folding@Home at background with low priority set didn't hurt game at all). It's kind of mystery why OS scheduler cannot clear rest of virtual cores to maximize performance for process with higher priority. This would solve all problems related to SMT.

Atari2600 · Apr 23, 2020

moinmoin said:
I personally wish the control over SMT would move into the processor, making it decide itself which amount of logical threads are most efficient for handling a given workload.

(And to be honest I'm fed up of discussing Windows as the obstacle to progress in CPU features.)

But that would require the CPU to know everything about the workload - and not just the micro op instructions.

If there was a software means of over-riding, or setting preferences to the CPU, then yeah - but if you have a problem that is only embarrassingly parallel if you have carefully formed the memory bounds of the problem, then you need the scheduler/CPU to respect that.

An example would be CFD - each process will be assigned (as much as is possible) a continuous block of adjacent cells to process calculations for. This reduces communication between processes - i.e. communication of information between adjacent cells across different processes - and that reduction is seen all the way from DRAM through to L1 cache. It can result in significant efficiency savings.

I'm not saying it couldn't happen. I'm not saying it shouldn't happen. I'm saying it would need significant thought perhaps beyond what you originally envisage.

Thunder 57 · Apr 23, 2020

Richie Rich said:
Yes, I agree. OS scheduler can control SMT mode in SW way. If scheduler loads only one thread per physical core (and other virtual cores keeps empty) then it behaves like SMT-off even CPU is SMT4-ON or whatever number capable of. If scheduler loads two threads per core then it behaves like SMT2 etc. In theory there is possible to set SMT mode by configuring OS scheduler. And even more. You can set mix different SMT modes for different cores in the same CPU(or multiple CPUs in server). For example imagine you have 12-core Ryzen 4900 with SMT4 you can set for game to use 8 cores with SMT2 or SMT-off (whatever gives you best gaming performance) and run Blender render on background at remaining 4 cores using full SMT4 (so 16 threads). Only by SW way via OS scheduler. Today's stupid scheduler will utilize second thread by low priority Blender process even you set for game thread priority to very high resulting in performance degradation to half anyway, no matter what priority you set (and with SMT4 it would fall down to 1/4).

Problem is not higher number of virtual cores per physical core. Problem is that today's OS scheduler realy fails to manage thread performance over physical core. It's pure SW failure and BIOS SMT-off is just simple workaround. Sadly single core CPU was able control thread performance via process priority much better than modern SMT systems (Folding@Home at background with low priority set didn't hurt game at all). It's kind of mystery why OS scheduler cannot clear rest of virtual cores to maximize performance for process with higher priority. This would solve all problems related to SMT.

This SMT4 stuff is way past getting old. There will be no SMT4 in Zen 3. Get over it.

Thibsie · Apr 23, 2020

Thunder 57 said:
This SMT4 stuff is way past getting old. There will be no SMT4 in Zen 3. Get over it.

You're wasting energy IMO

Thunder 57 · Apr 23, 2020

Thibsie said:
You're wasting energy IMO

I don't know, it must take a few calories to burn to type wrong he always is. But I get your point. More like wasting time.

moinmoin · Apr 23, 2020

Atari2600 said:
But that would require the CPU to know everything about the workload - and not just the micro op instructions.

I think thanks to the huge caches AMD is slowly getting there. A lot of the prediction, fetch, cache and load/store optimizations that @DisEnchantment keeps posting patents about are already looking beyond the constraints of single cores, optimizing the data managements (and reducing stalls as a result) on essentially what is the network level (IF is technically not yet that, but getting there). Better adaption to workloads is a central part of such optimization, which requires knowledge of them.

In some ways in Zen chips the central brain already is no longer the CPU cores but the SCF. I fully expect AMD to expand the latter's role further and further.

Richie Rich · Apr 23, 2020

Geranium said:
Only 1.83x improvent with 4 to 8 times of L2 and Mediatek like "Optimization"!!😕
Apple's ARM chip has 4MB L2 per core compared to 512KB and 1MB per core AMD64 chips. The whole benchmark could fit in Apple's L2 cache.
Also Is the benchmark was compiled with same compiler?? Same OS?? Same Storage and RAM size and speed??

Edit : I was repling to the SpeCint2006 benchmark. looks like it is for signature.

Did God prohibit AMD and Intel to use same size of L2 cache like Apple? NO. Core2Duo was using big shared L2 cache 10 years ago. So you cry good but on a wrong shoulder here. You should write complain email to Apple headquarters to stop developing such a powerful cores because your ego cannot digest that your brand new x86 looks like garbage in compare to Apple uarch. Well, the problem is that you should complain 5 years ago because a very old Apple A9 Twister core from 2015 already had higher IPC by 7% than today's 9900K CoffeeLake and Zen2 😎

Stop with the insults/confrontational postings.
Take some more time off for reflection.

AT Mod Usandthem

Thibsie · Apr 23, 2020

I think this is all going too far isn't it ?
Is ignoring the only possibility ?

amrnuke · Apr 23, 2020

moinmoin said:
In some ways in Zen chips the central brain already is no longer the CPU cores but the SCF. I fully expect AMD to expand the latter's role further and further.

Interesting, since it seems like a lot of the gains we have seen have been so narrow - and doesn't it seem likely that we are on the very top of a curve on extracting IPC from a core refinement standpoint? Now we have to focus on the surrounding stuff.

Like a car, the core is the engine but there is so much more.

Right now, Intel is like Dodge with their Challenger - keep adding horsepower to an ancient-looking design. There is no such thing as "handling" or "rear visibility" and anyone who thinks those are a thing are not living in the "real world". AMD is like Ford with the Mustang - great power though less than the Challenger... but... it's actually faster to 60 than the power-oriented Challenge? What, did they put real tires on it? And it can turn?

Atari2600 · Apr 23, 2020

Thibsie said:
I think this is all going too far isn't it ?
Is ignoring the only possibility ?

Long since put him on my ignore list.

Not adding anything interesting ===> ignore list.

DrMrLordX · Apr 23, 2020

Richie Rich said:
Did God prohibit AMD and Intel to use same size of L2 cache like Apple?

Yes? Physics is a bitch. L2 takes more die space than L3, and as you may have noticed, having a lot of L3 with good prefetch units can do a lot of improve the performance of multicore CPUs in parallel workloads with lots of intercore communication. Which is one sort of workload for which Intel and AMD have optimized their CPUs. Compare that situation to Apple who exclusively uses their A-series SoCs in phones and tablets where bursty, single-threaded (or sparsely-threaded) applications predominate. There you have less likelihood of core->core writes, meaning maintaining cache coherency is less important (and therefore, shared L3 is less important). So Apple chose to spend a lot of die area on L2 that could have been spent elsewhere, or that could not have been spent at all (driving higher yields and/or lower costs per die). Apple has the freedom to charge insane amounts of money for their hardware, and they don't have any OEMs telling them to trim costs, since they provide all their own SoCs for their own designs from top to bottom.

Core2Duo was using big shared L2 cache 10 years ago.

Conroe only had two cores. Cache coherency on that generation of CPU wasn't that big of a deal. With shared L2, you didn't even have to think about which core had which data in its cache since it was in a shared L2 and since Intel was using an inclusive cache hierarchy; e.g. if your CPU couldn't find the data in L1d on Core 0 but it was in L1d of Core 1, it was guaranteed to be in the L2, so you wouldn't have to do any core->core communication to read that data into the L1d of Core 0.

Speculation: Ryzen 4000 series/Zen 3

Senior member

Senior member

Golden Member

Diamond Member

Diamond Member

Member

Lifer

Moderator Emeritus, Elite Member

Golden Member

Diamond Member

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Member

Senior member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Golden Member

Golden Member

Golden Member

Lifer