AMD Ryzen Gen 2 Set For Q2 2018

mattiasnyc · Dec 21, 2017

So is there any solid information at all on AMD possibly releasing new chipsets for next year or is it just unsubstantiated rumor at this point? It seems to have come from wccftech only.

I'm asking because I want to upgrade my computer this week, and it's either buying a budget/work computer (am4/1700, b350 probably) with the knowledge that I may upgrade again next year if I get enough work on it, or try to anticipate just growing within the platform and getting an x399 board, likely the Designare gambling that Thunderbolt shows up (and then probably a 1920x, or maybe 1900x)....

Olikan · Dec 21, 2017

IIRC... ryzen L1 cache had like half bandwidth of intel's.

CatMerc · Dec 21, 2017

Olikan said:
IIRC... ryzen L1 cache had like half bandwidth of intel's.

Aye, it suffers from much lower L1 bandwidth, and slightly higher L2 latency.
That said, it has much higher L3 bandwidth, slightly lower L3 latency, and slightly higher L2 bandwidth.

L2 latency was already reduced in EPYC and Threadripper despite being the same die, it's now 12 clks, just like Intel. So increasing L1 bandwidth would be the focus in the cache system, if Intel was the benchmark for cache system.

Overall though, I think the biggest problem is memory latency, especially in-page random latency. This is where AMD is REALLY behind.

IntelUser2000 · Dec 21, 2017

CatMerc said:
Aye, it suffers from much lower L1 bandwidth, and slightly higher L2 latency.
That said, it has much higher L3 bandwidth, slightly lower L3 latency, and slightly higher L2 bandwidth.

The bold part depends on whether you are comparing to a consumer core part or a server focused part. Kabylake/Coffeelake has lower L3 latency than Ryzen, while Skylake-X has higher latency. That is due to ring vs. mesh.

The reason for L1 bandwidth being double on Intel is because they did that for AVX2. Skylake-X doubles that again because it has support for AVX-512. The bandwidth itself isn't a big deal for Ryzen as it has FP throughput half of AVX2-enabled Intel chips.

(For bandwidth it has to be normalized for per core as most benchmarks combine the cores together to show numbers. So an 8 core part will have double the bandwidth compared to 4 core, despite the same architecture)

itsmydamnation · Dec 21, 2017

Thunder 57 said:
It's definitely possible. Supposedly RR already has significantly reduced L2 latency: https://mobile.twitter.com/InstLatX64/status/941279542416760833?prefetchTimestamp=1513805844807

Zen CCX has always been listed as 12 cycles, so it could have been bad measuring software( i read the software optimization guide the day it became available), it could have been early firmware/cache behavior/rules. But fundamentally nothing has changed in the cache design.

look at the data of something like this:
http://www.7-cpu.com/cpu/Zen.html

they list headline as 17 cycles but look at the data

the AMD software optimsation guide says:

2.6.3 L2 Cache The AMD Family 17h processor implements a unified 8-way set associative write-back L2 cache per core. This on-die L2 cache is inclusive of the L1 caches in the core. The L2 cache size is 512 Kbytes with a variable load-to-use latency of no less than 12 cycles. The L2 to L1 data path is 32 bytes wide

Secondly, cache performance did improve in the construction core era, and by a non trivial amount. Piledriver had the same latencies as far as I know, but the point of PD was to get BD's power usage in check, not really go after performance (other than higher clocks due to less power). The first design aimed at doing that was Steamroller, which had better latencies. L2 latency was down slightly, but had much better write performance in particular:

https://www.extremetech.com/computi...roller-digging-deep-into-amds-next-gen-core/2

No they didn't

1. You can reduce your L2 time when you dont have to check an exclusive L3
2. They made changes to the L1D to L2 Write combine buffer, to improve moving data into the L2 but no the caches themselves
3. Even by reducing the L2 cache size by 1/2 in EV they only shaved like 2 cycles off it.

So there may be some cache tweaks, but I don't think there will be anything to significant until Zen 2. Unless of course the SR/RR L2 numbers in that tweet are indeed correct. I think IPC increase will be pretty minimal, with most of the gains coming from higher clocks supposedly allowed by 12nm LP.

They wont be doing anything outside of firmware/layout, that way they keep development costs low and they reduce verification time/effort.

Olikan said:
IIRC... ryzen L1 cache had like half bandwidth of intel's.

Thats because it has 1/2 the load/store width (128bits vs 256bits) outside of AVX it makes 0 difference.

CatMerc · Dec 21, 2017

itsmydamnation said:
Zen CCX has always been listed as 12 cycles, so it could have been bad measuring software( i read the software optimization guide the day it became available), it could have been early firmware/cache behavior/rules. But fundamentally nothing has changed in the cache design.

look at the data of something like this:
http://www.7-cpu.com/cpu/Zen.html

they list headline as 17 cycles but look at the data

the AMD software optimsation guide says:

No they didn't

1. You can reduce your L2 time when you dont have to check an exclusive L3
2. They made changes to the L1D to L2 Write combine buffer, to improve moving data into the L2 but no the caches themselves
3. Even by reducing the L2 cache size by 1/2 in EV they only shaved like 2 cycles off it.

They wont be doing anything outside of firmware/layout, that way they keep development costs low and they reduce verification time/effort.

Thats because it has 1/2 the load/store width (128bits vs 256bits) outside of AVX it makes 0 difference.

It's not an error in measurement with early software. Measuring right now gives Ryzen having 17 clks L2 latency and Threadripper/Raven/EPYC having 12 clks.

At 4GHz:
Ryzen: 4.3ns
Threadripper:3ns

Thunder 57 · Dec 21, 2017

itsmydamnation said:
Zen CCX has always been listed as 12 cycles, so it could have been bad measuring software( i read the software optimization guide the day it became available), it could have been early firmware/cache behavior/rules. But fundamentally nothing has changed in the cache design.

look at the data of something like this:
http://www.7-cpu.com/cpu/Zen.html

they list headline as 17 cycles but look at the data

the AMD software optimsation guide says:

No they didn't

1. You can reduce your L2 time when you dont have to check an exclusive L3
2. They made changes to the L1D to L2 Write combine buffer, to improve moving data into the L2 but no the caches themselves
3. Even by reducing the L2 cache size by 1/2 in EV they only shaved like 2 cycles off it.

They wont be doing anything outside of firmware/layout, that way they keep development costs low and they reduce verification time/effort.

Thats because it has 1/2 the load/store width (128bits vs 256bits) outside of AVX it makes 0 difference.

I agree with you on the first part. Otherwise we would have seen RR have higher IPC. The rest I question. Are you saying that Joel at Extremetech is wrong? He is pretty good at knowing his stuff. Also, to the best of my knowledge, the L3 on the construction cores, as well as Zen and Skylake-X are non-inclusive, not exclusive.

itsmydamnation · Dec 21, 2017

Thunder 57 said:
I agree with you on the first part. Otherwise we would have seen RR have higher IPC. The rest I question. Are you saying that Joel at Extremetech is wrong?

im not saying anyone is wrong, im saying that fundamentally nothing in the cache system has changed in Zen so far. that doesn't exclude bug or firmware fixes.

He is pretty good at knowing his stuff. Also, to the best of my knowledge, the L3 on the construction cores, as well as Zen and Skylake-X are non-inclusive, not exclusive.

If your talking about the BD piece it has throughput not latency which has almost 0 play on "IPC" outside of SIMD based tests. Im only taking about latency, in terms of throughput the bytes/cycle in/out didn't change, but they had real problems servicing 2 L1D's at the same time.

The L3 of BD is "mostly" exclusive, the processor could mark specific cache lines inclusive( i believe they can do the same with Zen's L3) but they never went into detail about it. Intels cache protocol/policy is completely different to AMD's, Intels L3 is inclusive but its L2 isn't. AMD has almost always had a inclusive L2 but "mostly" exclusive L3.

Also, to you, whats the difference of non-inclusive vs exclusive? To me they are the exact same thing, the higher level cache doesn't guarantee to hold the lines of the cache below it.

itsmydamnation · Dec 21, 2017

CatMerc said:
It's not an error in measurement with early software. Measuring right now gives Ryzen having 17 clks L2 latency and Threadripper/Raven/EPYC having 12 clks.

At 4GHz:
Ryzen: 4.3ns
Threadripper:3ns

did you spend 3 seconds to even look at the amd reference or the link i posted.

Code:

 Size        Latency       Increase   Description

  32 K     4                           
  64 K    11                       7           
 128 K    14                       3   
 256 K    16                       2
 512 K    17                       1

The AMD Family 17h processor implements a unified 8-way set associative write-back L2 cache per core. This on-die L2 cache is inclusive of the L1 caches in the core. The L2 cache size is 512 Kbytes with a variable load-to-use latency of no less than 12 cycles. The L2 to L1 data path is 32 bytes wide.

see how the data lines up(almost) with what AMD have said in there own reference manual, so whats the cache latency, depends how you want to count it.

You people need to learn to separate out the differences between things that can be fixed with physical layout+firmware and things that are going to require top to bottom changes to the cache system. The latter wont be happening the former wont do much outside micro benchmarks...........

CatMerc · Dec 21, 2017

itsmydamnation said:
did you spend 3 seconds to even look at the amd reference or the link i posted.

Code:

Size Latency Increase Description 32 K 4 64 K 11 7 128 K 14 3 256 K 16 2 512 K 17 1

see how the data lines up(almost) with what AMD have said in there own reference manual, so whats the cache latency, depends how you want to count it.

You people need to learn to separate out the differences between things that can be fixed with physical layout+firmware and things that are going to require top to bottom changes to the cache system. The latter wont be happening the former wont do much outside micro benchmarks...........

How does it line up?

17 cycles is not the claimed 12 cycles.

Ryzen is the only Zen based CPU out now that doesn't line up with the 12 cycles L2, but it's still the case.

itsmydamnation · Dec 21, 2017

Yes i will happily re correct myself to no less then 12(just like AMD's advice). But it doesn't change my fundamental point now does it?

I doubt cache latencies are going to reduce at all. 1 they are already very good, 2 the way caches operate and interrelate to each other, tags, etc is complex and changes have to be done from a holistic point of view. look at CON cores, the fundamental cache performance didn't change in 5 odd years.

The key word being fundamental, TR isn't exceeding Ryzen from the perspective that they can both deliver data from the L2 with a load to use latency of 12 cycles. TR is handling edge cases better but nothing fundamentally changed. Do you think they went all the way up to the highest levels of design abstraction to make the changes that then flow all the way down the design stack to make certain L2 use cases respond a few cycle earlier? All in a minor stepping?

Now actually show the massive IPC gain from it, because you know 4k page size and regular ryzen returning them as expected.

CatMerc · Dec 21, 2017

itsmydamnation said:
Yes i will happily re correct myself to no less then 12(just like AMD's advice). But it doesn't change my fundamental point now does it?

Which is what exactly? I am quite confused about what you're trying to say.

https://www.reddit.com/r/amd/comments/7k7gif/_/dre5p10

Threadripper has lower latencies than Ryzen, despite being the same silicon, even clock normalized. There's quite a bit of tweaking that can be done in microcode and immutable firmware.

itsmydamnation · Dec 21, 2017

CatMerc said:
Threadripper has lower latencies than Ryzen, despite being the same silicon, even clock normalized. There's quite a bit of tweaking that can be done in microcode and immutable firmware.

I edited by post as you replied, but that is exactly my point. The cache system isn't going to change, there isn't going to be massive improvement of cache performance outside of target cases that aren't likely to deliver massive general performance improvements.

Thunder 57 · Dec 21, 2017

itsmydamnation said:
im not saying anyone is wrong, im saying that fundamentally nothing in the cache system has changed in Zen so far. that doesn't exclude bug or firmware fixes.

If your talking about the BD piece it has throughput not latency which has almost 0 play on "IPC" outside of SIMD based tests. Im only taking about latency, in terms of throughput the bytes/cycle in/out didn't change, but they had real problems servicing 2 L1D's at the same time.

The L3 of BD is "mostly" exclusive, the processor could mark specific cache lines inclusive( i believe they can do the same with Zen's L3) but they never went into detail about it. Intels cache protocol/policy is completely different to AMD's, Intels L3 is inclusive but its L2 isn't. AMD has almost always had a inclusive L2 but "mostly" exclusive L3.

Also, to you, whats the difference of non-inclusive vs exclusive? To me they are the exact same thing, the higher level cache doesn't guarantee to hold the lines of the cache below it.

Agreed, the construction cores cache was very much flawed. I also agree cache latency is more important than bandwidth, though they both matter. Intel's L3 is inclusive in most of their CPU's, but Skylake-X changes that. Also, I do believe you meant AMD has almost always been exclusive, while Intel has been inclusive.

There is also a difference between exclusive and non-inclusive. From the Anandtech Skylake-X review:

"A non-inclusive cache is somewhat between the two, and is different to an exclusive cache: in this context, when a data line is present in the L2, it does not immediately go into L3. If the value in L2 is modified or evicted, the data then moves into L3, storing an older copy. (The reason it is not called an exclusive cache is because the data can be re-read from L3 to L2 and still remain in the L3)."

itsmydamnation · Dec 21, 2017

Thunder 57 said:
"A non-inclusive cache is somewhat between the two, and is different to an exclusive cache: in this context, when a data line is present in the L2, it does not immediately go into L3. If the value in L2 is modified or evicted, the data then moves into L3, storing an older copy. (The reason it is not called an exclusive cache is because the data can be re-read from L3 to L2 and still remain in the L3)."

The "problem" with that is we are really talking about coherency policy at that point and we dont even know how AMD's new coherency protocol works other then it has 7 states and its probably a further iteration of MOSEI. By that logic above you could probably also call zens L3 non-inclusive as well, but for both of them it will depend on how the line is flagged within the coherency protocol as to weather it can be read and in most complex cases its a flush to memory just to be safe.

CatMerc · Dec 21, 2017

itsmydamnation said:
I edited by post as you replied, but that is exactly my point. The cache system isn't going to change, there isn't going to be massive improvement of cache performance outside of target cases that aren't likely to deliver massive general performance improvements.

Oh, I don't think we were arguing massive improvements. That would require for Zen's cache system to be bad in the first place. It is still a very good first iteration.

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9GLzkvNzE5ODI5L29yaWdpbmFsL2ltYWdlMDAyLnBuZw==

However, improvements can be made. Far Cry Primal is a particularly abysmal title for Ryzen, yet under the right configuration, Threadripper can catch up to Intel's systems.

This is an edge case where the lower L2 latency is likely being of help, in addition to the lower L3 and mem latency I've shown in the Reddit thread. A 1900X in game mode is essentially 4C/8T with access to one dual channel controller. This means it's comparable to say a 1500X in setup. A 1920X in game mode is more like a 1600X.

While neither are in this chart, Far Cry Primal does not scale with cores, like, at all. So the 1800X acts as a proxy for them.

Threadripper already has measurably better IPC than Ryzen in specific scenarios if you set it up right so you don't get trashed by inter die latency.

The Stilt · Dec 22, 2017

What "lower L2 latency"?
The SOG specifies that all Family 17h CPUs (which includes Raven and Pinnacle Ridge as well) have L2 latency of "no less than 12 cycles".

Given that on Zeppelin the caches have higher Fmax than the actual CPU cores (due to a process limit) it might be wise to lower the cache latency.
However doing so would pretty much imply that significantly higher Fmax shouldn't be expected either.

CatMerc · Dec 22, 2017

The Stilt said:
What "lower L2 latency"?
The SOG specifies that all Family 17h CPUs (which includes Raven and Pinnacle Ridge as well) have L2 latency of "no less than 12 cycles".

Given that on Zeppelin the caches have higher Fmax than the actual CPU cores (due to a process limit) it might be wise to lower the cache latency.
However doing so would pretty much imply that significantly higher Fmax shouldn't be expected either.

While it specifies no less than 12 cycles, it's true for all parts other than desktop Ryzen for some reason. It has an L2 latency of 17 cycles.

Raven, EPYC, and Threadripper, all have 12 cycles of latency.

The Stilt · Dec 22, 2017

CatMerc said:
While it specifies no less than 12 cycles, it's true for all parts other than desktop Ryzen for some reason. It has an L2 latency of 17 cycles.

Raven, EPYC, and Threadripper, all have 12 cycles of latency.

You do realize that it is impossible the have different L2 latency on identical silicon (AM4 Ryzen and TR) and on identical microcode, while the socket being the only difference?

CatMerc · Dec 22, 2017

The Stilt said:
You do realize that it is impossible the have different L2 latency on identical silicon (AM4 Ryzen and TR) and on identical microcode, while the socket being the only difference?

Well yes, I'm fairly certain it's down to microcode. I'm not saying it's a hardware difference.

The Stilt · Dec 22, 2017

CatMerc said:
Well yes, I'm fairly certain it's down to microcode. I'm not saying it's a hardware difference.

AM4 Ryzen and TR share the microcode.
EPYC has it's own, due being "B2" stepping.

bsp2020 · Dec 22, 2017

@Stilt
Not that I'm doubting you, what do you think is the reason for using different revision for EPYC and Ryzen/ThreadRipper. I hear that from a lot of people and I can't figure out why AMD would do it. If there were bugs that needed to be fixed and they fixed it with a revision before launching EPYC, doesn't it make sense to migrate Ryzen/ThreadRipper to the new revision as well? Have you heard why they are using different revision for EPYC?

mattiasnyc · Dec 22, 2017

CatMerc said:
While it specifies no less than 12 cycles, it's true for all parts other than desktop Ryzen for some reason. It has an L2 latency of 17 cycles.

Raven, EPYC, and Threadripper, all have 12 cycles of latency.

Not that I have a dog in this fight, but isn't 17 no less than 12?

realibrad · Dec 22, 2017

bsp2020 said:
@Stilt
Not that I'm doubting you, what do you think is the reason for using different revision for EPYC and Ryzen/ThreadRipper. I hear that from a lot of people and I can't figure out why AMD would do it. If there were bugs that needed to be fixed and they fixed it with a revision before launching EPYC, doesn't it make sense to migrate Ryzen/ThreadRipper to the new revision as well? Have you heard why they are using different revision for EPYC?

I would imagine that Ryzen and Threadripper had already been tapped out and so it was "too late" to change those. Better to sell it off and update later.

The Stilt · Dec 22, 2017

bsp2020 said:
@Stilt
Not that I'm doubting you, what do you think is the reason for using different revision for EPYC and Ryzen/ThreadRipper. I hear that from a lot of people and I can't figure out why AMD would do it. If there were bugs that needed to be fixed and they fixed it with a revision before launching EPYC, doesn't it make sense to migrate Ryzen/ThreadRipper to the new revision as well? Have you heard why they are using different revision for EPYC?

Two different theories: A) Either the "ZP-B2" stepping is just a marketing trick to create an illusion that there would be an actual different between the consumer (B1) and server silicon, or B) ZP-B1 contains some sort of xGMI related errata, which had to be fixed for server parts (due the 2P support).

I'm pretty sure it's the theory A), since ZP-B1 die stockpiles should have been exhausted a long ago. And there is no point what so ever to produce two different revisions at the same time. Yet we've not seen any AM4 ZP-B2 parts in the wild.

AMD Ryzen Gen 2 Set For Q2 2018

Senior member

Platinum Member

Golden Member

Elite Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Member

Senior member

Lifer

Golden Member