Speculation: Ryzen 4000 series/Zen 3

NostaSeronx · Oct 13, 2019

NTMBK said:
Why the hell are we still discussing SMT4 in Zen 3?

Well...

Then Milan was designed to further erase any asterisks that remain, so in thinking about it, in the original strategy, Milan was where we expected to be back to IPC (or better) parity across all workloads.

- https://www.anandtech.com/show/14568/an-interview-with-amds-forrest-norrod-naples-rome-milan-genoa

Zen2 in its hi(128-bit) and lo(128-bit) FPU has FP; MUL0/MUL1/MUL2/MUL3 and ADD0/ADD1/ADD2/ADD3, while all ports can do VMISC/FMISC. Which is enough to support AVX512.

Now the SMT4 comes into play as there is 4x FMUL+4x FADD. Each thread thus gets 1xFMUL+1xFADD. Since, majority of legacy code is 128-bit, and there are workloads that can't be scaled to 256-bit. It makes more sense to support 128-bit units over 256-bit/512-bit units.

SMT4 based on AMD's hiring of researchers is constrained quantity, add with the dynamic nature of the new SMT model. Aka, at pre-fetch or dispatch it goes 1T/2T/3T/4T on demand, etc. It is more effective than any previous x86 SMT versions.

No value added to Milan = Intel regains their lead. Icelake-SP(XCC*2) w/ SunnycoveX(10nm++ core) isn't a low-volume product, nor does it have less cores than 64-core. It's a >72 core monster that easily replaces Xeon Phi. There is also the ultra-secret Icelake-MDFI w/ mesh chiplets (w/ L4 depot cache).

Thunder 57 · Oct 13, 2019

Zen 3 IPC Gains Are 'Greater' Than 8 Percent - EXCLUSIVE

AMD's Zen 2 architecture feels like its only just arrived, but there's already a lot of excitement as to what we'll be seeing with Zen 3, which is schedule

www.redgamingtech.com

Obviously it's just a rumor, but that sounds believable. Certainly more believable than 30-40% IPC gains that have been stated by some.

soresu · Oct 14, 2019

NostaSeronx said:
No value added to Milan = Intel regains their lead.

They recently made a statement about Milan.

They said it will have superior perf/watt to IceLake, not superior raw performance.

I expect modest IPC gains with some power efficiency gains too, better to expect that and be surprised by more if it comes.

Yotsugi · Oct 14, 2019

soresu said:
They said it will have superior perf/watt to IceLake, not superior raw performance.

It will have superior everything.

soresu said:
I expect modest IPC gains with some power efficiency gains too

It's upper teens for IPC and some clocks to boot, the silicon is already up and running, like, Windows.

maddie · Oct 14, 2019

soresu said:
They recently made a statement about Milan.

They said it will have superior perf/watt to IceLake, not superior raw performance.

I expect modest IPC gains with some power efficiency gains too, better to expect that and be surprised by more if it comes.

If you continue with your line of reasoning, then you're implying that power drops are coming. I take it to mean that the increased perf/W will translate to higher performance as I really don't see them lowering the TDP ratings. Do you?

soresu · Oct 14, 2019

maddie said:
If you continue with your line of reasoning, then you're implying that power drops are coming. I take it to mean that the increased perf/W will translate to higher performance as I really don't see them lowering the TDP ratings. Do you?

That meant (in my parlance anyways) modest IPC/clock gains and modest power drops, but not a great amount of either given the process change is fairly meagre.

Others have stated otherwise, and some have stated in a rather overly optimistic way that a change to 6 wide is coming - but I prefer to expect less and receive more (if indeed there is more), it's better that way.

Of course I'm just as happy to get a regular 20% IPC bump per gen, but even ARM can't do that all the time - A73 case in point.

Having said that, does anyone have a concrete figure for the Cortex A57 -> A72 IPC improvement?

DisEnchantment · Oct 14, 2019

Yotsugi said:
It's upper teens for IPC and some clocks to boot, the silicon is already up and running, like, Windows.

In the video (around the 104 second mark) they said they are already sampling the chip.

Yotsugi · Oct 14, 2019

DisEnchantment said:
In the video (around the 104 second mark) they said they are already sampling the chip.

True that, since the first Si is Q2-ish.

soresu · Oct 14, 2019

maddie said:
as I really don't see them lowering the TDP ratings. Do you?

Depends on the segment, the 2700E may have been a part which was nigh on impossible to lay your hands on but it was a significant TDP drop for a meagre clock decrease at 45W.

Also at 14nm Zen there was only a single 65W 8 core product (1700), now we have several at 7nm, wasn't there a 12 core 65W too?

In the APU segment I would definitely expect a sub 15W TDP SKU, they have more than enough efficiency now to achieve a very good performer at 10W or below, especially is Navi performs as efficiently as you might hope at lower clockspeeds.

maddie · Oct 14, 2019

soresu said:
Depends on the segment, the 2700E may have been a part which was nigh on impossible to lay your hands on but it was a significant TDP drop for a meagre clock decrease at 45W.

Also at 14nm Zen there was only a single 65W 8 core product (1700), now we have several at 7nm, wasn't there a 12 core 65W too?

In the APU segment I would definitely expect a sub 15W TDP SKU, they have more than enough efficiency now to achieve a very good performer at 10W or below, especially is Navi performs as efficiently as you might hope at lower clockspeeds.

Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.

tomatosummit · Oct 14, 2019

maddie said:
Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.

This
Perfect example was r7 1700 to the r7 2700. Although it was amplified by the better boost algorithms, the 2700 maintains higher clock speeds at the same 65w tdp thanks to the 12ff inprovements.

soresu · Oct 14, 2019

maddie said:
Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.

Not really how I see it, but I have offline CG rendering goggles on.

I will always prefer more cores at the same power rather than a few hundred mhz on the same number of cores.

If a 65W 16 core model comes out, I would probably buy it even if it costs more than the higher clocked model at 95W-105W.

Saylick · Oct 14, 2019

Yotsugi said:
It will have superior everything.

It's upper teens for IPC and some clocks to boot, the silicon is already up and running, like, Windows.

Out of curiosity, how do you know the IPC gains are upper teens? Is this just speculation?

Thunder 57 · Oct 14, 2019

Saylick said:
Out of curiosity, how do you know the IPC gains are upper teens? Is this just speculation?

No one knows for sure and if they did, they certainly wouldn't be saying it.

Yotsugi · Oct 14, 2019

Thunder 57 said:
No one knows for sure and if they did, they certainly wouldn't be saying it.

Nah, the cat's outta the bag already in China.

itsmydamnation · Oct 14, 2019

if Zen3 has the same amount or more development effort then high teens doesn't seem unreasonable, they did what 13-15%ish while also doubling datapath width with Zen2.

Saylick · Oct 14, 2019

Yotsugi said:
Nah, the cat's outta the bag already in China.

I haven't been keeping up too frequently in this thread so sorry if it's already been posted, but can you provide the source (assuming it's on Chiphell or some Chinese forum) of this info?

Also, "upper teen" IPC improvement implies the jump from Zen 2 to Zen 3 is as large, if not larger, than from Zen+ to Zen 2. I'd really like to see some proof because that's a BIG jump in IPC.

Saylick · Oct 14, 2019

itsmydamnation said:
if Zen3 has the same amount or more development effort then high teens doesn't seem unreasonable, they did what 13-15%ish while also doubling datapath width with Zen2.

I have an easier time understanding the 15% IPC gains in Zen 2 because the larger mop cache and improved predictor are items that I've seen in the past that directly improves IPC. What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought. Larger registers? Another L/D unit? More ALUs?

Yotsugi · Oct 14, 2019

Saylick said:
but can you provide the source (assuming it's on Chiphell or some Chinese forum) of this info?

Later, when I dig it out of Twitter DMs.

Saylick said:
Also, "upper teen" IPC improvement implies the jump from Zen 2 to Zen 3 is as large, if not larger, than from Zen+ to Zen 2

What's so special about that? Every numbered core is a tock.

Saylick said:
What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought

You'll see.

itsmydamnation · Oct 14, 2019

Saylick said:
I have an easier time understanding the 15% IPC gains in Zen 2 because the larger mop cache and improved predictor are items that I've seen in the past that directly improves IPC. What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought. Larger registers? Another L/D unit? More ALUs?

The answers easy, keep the core fed, so bigger better front end, yes more PRF, more dispatch/retire, increase the OOOe window. The known l3 cache change can make a big IPC difference to various different workloads.

In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.

Saylick · Oct 14, 2019

Yotsugi said:
Later, when I dig it out of Twitter DMs.

I look forward to it.

Yotsugi said:
What's so special about that? Every numbered core is a tock.

That's a fair point, but then again, Intel has use 10% IPC gains as a tock and that's considered a respectable IPC gain for an architectural improvement. 15% on top of another 15% is a fresh change of pace given the incremental improvements we've seen from Intel in the last few years.

itsmydamnation said:
The answers easy, keep the core fed, so bigger better front end, yes more PRF, more dispatch/retire, increase the OOOe window. The known l3 cache change can make a big IPC difference to various different workloads.

In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.

Hahaha, fair enough. That's like me asking, "How do you make a new Corvette faster than the generation before it?", and you replying, "Well, you can make it have more horsepower, fatter tires, better weight distribution, and more downforce." I mean, you'd be right because it's true, but it's the smaller details and reasoning behind certain design decisions that I think that are more interesting.

itsmydamnation · Oct 14, 2019

Saylick said:
Hahaha, fair enough. That's like me asking, "How do you make a new Corvette faster than the generation before it?", and you replying, "Well, you can make it have more horsepower, fatter tires, better weight distribution, and more downforce." I mean, you'd be right because it's true, but it's the smaller details and reasoning behind certain design decisions that I think that are more interesting.

well of course,

have a look at the patent for how they made the 3rd AGU work, its quite interesting. It makes me wonder is the reason apple have 4 cycle latency on simple ALU ops because they are doing the same kind of thing AMD did for the AGU's on the ALU's as well, if they did something like that they probably wouldn't have to do any kind of internal clustering of ALU's.

Cardyak · Oct 14, 2019

There’s loads of potential for further increases, and that’s without radical redesigns needed.

Just some basic stuff off the top of my head

- More execution units (Doesn’t have to be ALU, can be AGU, LEA, FPU, etc...)
- Larger Caches
- Increased ROB and Memory, Scheduler Buffers
- More ports to dispatch instructions to execution units and reduce back end bottle necks

amd6502 · Oct 14, 2019

itsmydamnation said:
In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.

You're seeing it all the time. It's just that it's only (likely) a smaller percentage of the code.

4ALU is quite good already and got zen the 40%+ ipc gain.

The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.

The slight downside is more idle pipes means it would need gating to avoid loosing efficiency. Or a 4-way MT scheme; SMT2+?

itsmydamnation · Oct 15, 2019

amd6502 said:
You're seeing it all the time. It's just that it's only (likely) a smaller percentage of the code.

4ALU is quite good already and got zen the 40%+ ipc gain.

The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.

The slight downside is more idle pipes means it would need gating to avoid loosing efficiency. Or a 4-way MT scheme; SMT2+?

This is just hand waving, nothing of value, for example your only waiting one extra cycle to execute from 4 wide Zen to 6 wide A12 and Zen has lower latency for simple ALU ops, So unless you have sustained 6 ALU ops a cycle over many cycles back to back your not gaining anything yet A12 has an IPC advantage. Why is that? How do you propose to load or store a damn thing while sustaining 6 ALU ops? Zen2 doesn't even have enough issue width right now to sustain the 4 ALU's + 3 AGU's. As i have already proved for SPEC int ( something you moar ALU guys have yet to do) x86 instructions with memory ops make up a very large amount of instructions.

going to 4 alu's alone did not get anywhere near 40% gain, if you want to be specific and correct, bulldozer could already do 4 ALU ops in a core in a single cycle. ( not that it would practically happen or you would want to) .

lets be clear here:
much improved L1I cache ( no more aliasing)
much improved L1D cache
improved instruction fetch/increased fetch
adding of a UOP cache
significantly improved cache hierarchy
dedicated hardware for stack handling ( store to load forwarding at the frontend of the pipeline)
increased instruction dispatch
significantly increased PRF (96 to 168)
improved branch predictors
improved prefetch
improved store forwarding
increased ALU's to 4.

All those thing got 40% performance uplift, not 4 ALU's FFS.

if you go back and look the initial thoughts of people like David Kanter were that Zen's 4:2 ALU:AGU configuration was sub optimal and 3:3 would have been better. Zen 2 comes along and makes it 4:3.... funny that.......

So instead of BS handwaving show me the money! Show me the SPEC int workload that has cycle over cycle over cycle the need to issue 6 ALU instructions while not loading or storing a thing.

im just going to quote agner:

Bottlenecks in AMD Ryzen The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the µop cache can have a throughput of five instructions or six µops per clock cycle. Code that does not fit into the µop cache can have a throughput of four instructions or six µops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.

funny how there nothing about ALU bottlenecks in his "optimization guide for assembly programmers and compiler makers" guide.

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Member

Senior member

Diamond Member