Ryzen: Strictly technical

moinmoin · Aug 21, 2017

Timur Born said:
Surprise, surprise. Turns out that CCX bottlenecks can be objectively measured as increased CPU load!

Nice stats, though is the above meant to be read as sarcasm? Since CPU waiting for IO is always part of the overall CPU load, the inter-CCX bottleneck is just one place where IO can lead to the core waiting for data. Under Linux the CPU load spent for waiting for IO is commonly available as separate stat.

CHADBOGA · Aug 21, 2017

.vodka said:
Nah, it wasn't poor old juan who single handedly made all of this happen. He sure ranted pages and pages on this, as he does on everything AMD. I wonder why isn't he here to spread his word.

I have done my best to bring the Internet Strongman to Anandtech forums.

I stated at least twice on RWT forums that these forums are a proving ground for which he can test his insights and he just ignored me.

I sent him two tweets from different accounts on Twitter, saying he should come to these forums and he blocked me on both occasions.

Timur Born · Aug 23, 2017

I took a much closer look at those "freezes" again that happen for a few seconds under certain workloads, usually CPU stress tests, but I see it happen under real practical workload. I use to call them stalls, but they are the same thing that other people call freezes for what we talk about. Even if only a single logical core is (over)loaded with certain workload then both the GUI/graphic output and keyboard/mouse input are suspended anywhere between a short blip and several seconds (I measured up to around 4). Additionally some but not all processing seems to stop during that time, which I reported wrongly before when I claimed that the whole system stops.

First of all, both the suspension of graphic output and the partly ongoing of background processing can be measured! It's important to note that time-counters seem to roll on regardless of any stalls, which in turn allows software to keep measuring average CPU load, CPU cycles (+delta) and context switches (+delta) and frames per second. The CPU cycles Delta is especially interesting, because it tells us whether a program interrupts processing during a stall or not.

What Delta means is that when you measure over the span of a second then the Delta is the amount of cycles that happened during that second. If a program interrupted its processing during a stall then the CPU cycles Delta decreases on the very next tick right after a stall. If a program kept processing uninterrupted then the Delta increases on the very next tick right after the stall.

Programs that interrupt their processing include: WinRar, 7-Zip, Foobar2000, Firefox (Youtube HTML video output), Furmark
Programs that do not interrupt their processing include: Ableton Live, MediaPlayerClassic, HWinfo

Both WinRar's and 7-Zip's benchmark throughput drop considerably during stalls. WinRar seems to especially dislike my Reaper based workload, or rather the other way around, as WinRar interrupts Reaper even more.

Audio and Video are two very special cases that justify some extra explanation.

Audio:

Audio drivers do not stall, regardless of the audio buffer size being used. I ran a RME Babyface USB ASIO drivers (isochronous USB transfer) at less than 2 ms buffer size without interruption whatsoever while my mouse kept stalling over the very same USB port + hub. If the application keeps processing data (Ableton Live) you can run input audio from the USB audio interface to the application and then back to the audio interface completely uninterrupted while the rest of your system is nearly unusable.

If the application does interrupt its processing during a stall then the size of the application's own audio buffer decides over whether your audio stream gets interrupted or not. For example, if you set Foobar's own audio buffer to a size larger than the stalls (bigger than 4 seconds is good) then you get no audio interruptions. And if stalls are shorter than what Firefox buffers for Youtube playback then you get no interruption. This is because with larger buffers the audio data has already been processed before the stall happens and it seems that the program parts that just shovel the data to the audio driver do not get interrupted during a stall.

Video:

Video playback and graphic output always get interrupted, which in turn results in GPU load and Video Engine load dropping considerably, including the GPU frequency and temperature dropping. Since timers keep rolling the video/graphic output will jump forward to match the new time-frame (a 5 second stall means that the video jumps 5 seconds forward). For Youtube videos in Firefox this is a true jump, for videos in MediaPlayerClassic this is a fast forward that shortly increases the frame-rate (while maintaining the refresh-rate, because disabling VSync doesn't seem to work properly). Both Firefox Youtube and Furmark also see their average CPU load drop because of the stalls that interrupt their processing, even though they do maintain their time-lines in form of a straight jump.

Interestingly HWInfo's graph display works similarly to how Firefox Youtube playback and Furmark works. The graph will do a full jump to the new time-frame instantly after a stall instead of drawing the in-between measurements. If you mouse-over the graph you can see several seconds being absent corresponding to the stall time.

moinmoin · Aug 27, 2017

At Hot Chips AMD gave more information about IF links between dies. Every Zeppelin die has 4 IF links, of which even in Epyc only 3 are used based on positioning on the package to keep trace lengths short.

They also stated that the cost of the MCM approach is 59% the cost of a hypothetical monolithic Epyc chip, including a 10% area overhead for the MCM dies.

Initial reporting: https://www.servethehome.com/amd-epyc-infinity-fabric-update-mcm-cost-savings/
More and better slides: http://www.tomshardware.com/news/amd-threadripper-epyc-mcm-cost,35306.html
Though I'd appreciate if anybody shares a more complete set of the slides if available.

deadhand · Oct 3, 2017

Hi everyone,

Just to update on a post I made earlier this year:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-9#post-38776310

I took a crack at solving this issue and the problem was indeed due to false sharing. CS: Go uses a very similar lightmap baker to the SDK 2013 branch of Valve's Source engine, so it's evident that the issue is present there as well.

Here are the results of the false sharing removed vs. original:

(Note: CPU is a Ryzen Threadripper - affinity masks are set to 8 threads, 16 threads, and 32 threads respectively to simulate 1 CCX, 2 CCX, and 4 CCX processors.)
The negative scaling I experienced with the dual Xeon tests (machine I previously used) is also eliminated (not shown below).

Lower is better!

Here is the pull request with the fix: (Literally two lines of code)
https://github.com/ValveSoftware/source-sdk-2013/pull/436

It seems that AMD CPUs, particularly AMD FX CPUs but also Ryzen are much more susceptible to the effects of false sharing than Intel CPUs.

Also, I'd like to point out that 'vrad' has been used to show off the benefits of multi-core processors in the past and as a benchmark:

https://www.anandtech.com/show/2489/11

It's very likely that this issue was present, even back then.

EDIT: Also, the comparatively poor scaling from 16 to 32 threads is likely due to SMT and Amdahl's law (or, rather, that the rest of the code that used to take a comparatively small portion of the time now takes a relatively long time and seems to also have some scaling issues.)

The speedup in the fixed section of the code is likely a fair bit higher than what's shown here.

tamz_msc · Oct 6, 2017

deadhand said:
Hi everyone,

Just to update on a post I made earlier this year:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-9#post-38776310

I took a crack at solving this issue and the problem was indeed due to false sharing. CS: Go uses a very similar lightmap baker to the SDK 2013 branch of Valve's Source engine, so it's evident that the issue is present there as well.

Here are the results of the false sharing removed vs. original:

(Note: CPU is a Ryzen Threadripper - affinity masks are set to 8 threads, 16 threads, and 32 threads respectively to simulate 1 CCX, 2 CCX, and 4 CCX processors.)
The negative scaling I experienced with the dual Xeon tests (machine I previously used) is also eliminated (not shown below).

Lower is better!

Here is the pull request with the fix: (Literally two lines of code)
https://github.com/ValveSoftware/source-sdk-2013/pull/436

It seems that AMD CPUs, particularly AMD FX CPUs but also Ryzen are much more susceptible to the effects of false sharing than Intel CPUs.

Also, I'd like to point out that 'vrad' has been used to show off the benefits of multi-core processors in the past and as a benchmark:

https://www.anandtech.com/show/2489/11

It's very likely that this issue was present, even back then.

EDIT: Also, the comparatively poor scaling from 16 to 32 threads is likely due to SMT and Amdahl's law (or, rather, that the rest of the code that used to take a comparatively small portion of the time now takes a relatively long time and seems to also have some scaling issues.)

The speedup in the fixed section of the code is likely a fair bit higher than what's shown here.

Very nice work! One question though - was this done with NUMA disabled or enabled? Are there any performance difference, if any, between the two. Interested to know since last time you commented that this is a memory bound situation.

CatMerc · Oct 6, 2017

@The Stilt I have a question. Let's say I buy an AM4 motherboard and a Summit Ridge CPU now, and then later replace that Summit Ridge with a Pinnacle Ridge.
Does the chipset affect memory latency at all? I'd be completely satisfied with X370 I/O wise, but if waiting for X470 would mean getting lower memory latency I'd do it.

Basically I'm asking if pairing Pinnacle Ridge with X470 would give performance advantages over X370.

LightningZ71 · Oct 6, 2017

From what little seems to be leaking onto the internet, it seems that the biggest change between X370 and X470 will be the upgrade of the chipset<=>Processor link to PCI-E 4 from the existing version 3. This will likely result in a bit less bandwidth contention between all of the chipset connected devices for processor and DMA transactions. I suspect that there may be some other modernization of the USB 3.X vX interfaces too, but nothing drastic. Memory latency between the processor and the RAM will be almost entirely based on the supported speeds and the DDR controller on the chipset, which may improve due to minor tweaks during the generation change. That wouldn't be any different between chipset generations. The only other possible difference would be any modifications that the mobo makers may make to the board layouts themselves to assist in running the DRAM faster, but, I feel that there is very little that can be done on that front.

The Stilt · Oct 6, 2017

CatMerc said:
@The Stilt I have a question. Let's say I buy an AM4 motherboard and a Summit Ridge CPU now, and then later replace that Summit Ridge with a Pinnacle Ridge.
Does the chipset affect memory latency at all? I'd be completely satisfied with X370 I/O wise, but if waiting for X470 would mean getting lower memory latency I'd do it.

Basically I'm asking if pairing Pinnacle Ridge with X470 would give performance advantages over X370.

No.
The chipset just provides additional IO to the CPU, similar to e.g. PCI-E to USB3 peripheral controller would.
The chipset isn't required for the CPU to function and on Crosshair VI Hero you can actually completely disable the external PCH.

IRobot23 · Oct 7, 2017

Could AMD improve DDR4 latency with pinnacle?

PhonakV30 · Oct 7, 2017

IRobot23 said:
Could AMD improve DDR4 latency with pinnacle?

I think depend on Infinity Fabric?

deadhand · Oct 7, 2017

tamz_msc said:
Very nice work! One question though - was this done with NUMA disabled or enabled? Are there any performance difference, if any, between the two. Interested to know since last time you commented that this is a memory bound situation.

The problem was essentially what I described here: (except it's technically not really 'false sharing' in this case, though the effects are identical)
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-23#post-38790480

I initially thought it might have been bandwidth related but after reducing the scene size to something extremely small (less than a meg) and increasing sample counts to prevent sub 1-second execution time on my test scene (in this application scene size controls sample counts, but that sample count can be increased by a user defined factor), I came to the conclusion that there was no way any bandwidth issues were related, as the throughput was not improving at all and the thread scaling wasn't either (in terms of light sampling - the number of samples per second were essentially identical regardless of scene size when offset by the sample count scaling).

You can think of false sharing as being a 'serialized portion' of code in Amdahl's law, where it gets worse with more threads. When threads are false sharing between CCX's (or between sockets), the latency in the cache update is worse (than within a CCX) and thus threads waiting for the cache line to go back into a shared state end up waiting longer than if they were within a single CCX.

The most interesting thing about this to me is that the effects of false sharing seem much less significant on something like an i7-7700k than AMD CPUs, even with a hypothetical single CCX Ryzen quad core. The speedup of my fix on this program was relatively minor on an i7-7700k (~38% vs. the hypothetical single CCX Ryzen @ 56%).

This was of course a really extreme case, but it might be responsible for some thread scaling issues in some other applications, and it's quite a fixable problem (if it's found... that's the hard part)

CatMerc · Oct 7, 2017

deadhand said:
The problem was essentially what I described here: (except it's technically not really 'false sharing' in this case, though the effects are identical)
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-23#post-38790480

I initially thought it might have been bandwidth related but after reducing the scene size to something extremely small (less than a meg) and increasing sample counts to prevent sub 1-second execution time on my test scene (in this application scene size controls sample counts, but that sample count can be increased by a user defined factor), I came to the conclusion that there was no way any bandwidth issues were related, as the throughput was not improving at all and the thread scaling wasn't either (in terms of light sampling - the number of samples per second were essentially identical regardless of scene size when offset by the sample count scaling).

You can think of false sharing as being a 'serialized portion' of code in Amdahl's law, where it gets worse with more threads. When threads are false sharing between CCX's (or between sockets), the latency in the cache update is worse (than within a CCX) and thus threads waiting for the cache line to go back into a shared state end up waiting longer than if they were within a single CCX.

The most interesting thing about this to me is that the effects of false sharing seem much less significant on something like an i7-7700k than AMD CPUs, even with a hypothetical single CCX Ryzen quad core. The speedup of my fix on this program was relatively minor on an i7-7700k (~38% vs. the hypothetical single CCX Ryzen @ 56%).

This was of course a really extreme case, but it might be responsible for some thread scaling issues in some other applications, and it's quite a fixable problem (if it's found... that's the hard part)

Wouldn't the lower L3 latency on Skylake account for having a lesser penalty from false sharing?

tamz_msc · Oct 7, 2017

L3 being inclusive might also be a factor in favor of Skylake.

deadhand · Oct 7, 2017

tamz_msc said:
L3 being inclusive might also be a factor in favor of Skylake.

Since Ryzen's L3 stores shadow tags to data in the L2 caches of the cores within a CCX, it might be argued that it sort of behaves a little bit like an inclusive cache. (If there's an L3 miss it checks the shadow tags to see if the data might be in another core's L2 cache before (accessing memory?), from what I understand)

CatMerc said:
Wouldn't the lower L3 latency on Skylake account for having a lesser penalty from false sharing?

I think so, yes, but the inter-CCX issue seems to make it much worse.

IRobot23 · Oct 8, 2017

Does anyone knows how will 12nm effect Ryzen pinnacle ridge?

DrMrLordX · Oct 8, 2017

No clue. It depends on whether it will be a straight die shrink, or whether Pinnacle Ridge will have any fundamental design changes vs Summit Ridge.

maddie · Oct 8, 2017

IRobot23 said:
Does anyone knows how will 12nm effect Ryzen pinnacle ridge?

AMD using 12nm has the opportunity to either lower power usage, increase clocks or a combination of the two. As the present ecosystem of motherboards, coolers, etc are configured for present wattage CPUs, namely 65W & 95W, I believe that they will go for a clock increase as they are already power efficient but need to close the performance ST gap with Intel on the client side.

Realize that 12nm should use less power if clocks are the same as present, so they can also have lower power server products in non highly stressed server farms, using the same die clocked as at present.

All of this is separate to any improvements in the die layout to improve IPC.

AFAIK, density is minimally improved.

DrMrLordX · Oct 8, 2017

If I recall correctly, AMD won't update their Epyc lineup on 12nm anyway (or if they do, it comes later than Pinnacle Ridge), so I'm not sure that they're going to make many moves aimed at server farms.

IRobot23 · Oct 9, 2017

maddie said:
AMD using 12nm has the opportunity to either lower power usage, increase clocks or a combination of the two. As the present ecosystem of motherboards, coolers, etc are configured for present wattage CPUs, namely 65W & 95W, I believe that they will go for a clock increase as they are already power efficient but need to close the performance ST gap with Intel on the client side.

Realize that 12nm should use less power if clocks are the same as present, so they can also have lower power server products in non highly stressed server farms, using the same die clocked as at present.

All of this is separate to any improvements in the die layout to improve IPC.

AFAIK, density is minimally improved.

Thanks for reply.
Since DF speeds is biggest problem to with ryzen memory latency. Some say that DF speed should be at 1:1 (DDR4 MT/s), other say that locked around 2GHz should be enough. Since I do not know much about ryzen DF except speed and bandwidth and basically it is new NB.

Would certainly like to see Ryzen core clock = DF (1200MHz with 3200MHz DDR4) vs i7 8700K 1200MHz + NB 1200MHz with DDR4 3200. Anyone did that kind of test? To see how well does ryzen CCX technology at lower clocks?

Could AMD simply improve DF bandwidth? More B/Cycle?

moinmoin · Oct 9, 2017

DrMrLordX said:
No clue. It depends on whether it will be a straight die shrink, or whether Pinnacle Ridge will have any fundamental design changes vs Summit Ridge.

DrMrLordX said:
If I recall correctly, AMD won't update their Epyc lineup on 12nm anyway (or if they do, it comes later than Pinnacle Ridge), so I'm not sure that they're going to make many moves aimed at server farms.

Indeed, Epyc/Threadripper will be updated with Zen 2 again. So for 12LP/Zen+/Pinnacle Ridge (likely also Raven Ridge, but the silence there is deafening) they will remove the now superfluous parts of the uncore, like the multi-die multi-sockets interfaces etc. I'd expect some low hanging fruits fixes on the cores and IMC as well, but there the actual design improvements should come with Zen 2.

raghu78 · Oct 9, 2017

IRobot23 said:
Thanks for reply.
Since DF speeds is biggest problem to with ryzen memory latency. Some say that DF speed should be at 1:1 (DDR4 MT/s), other say that locked around 2GHz should be enough. Since I do not know much about ryzen DF except speed and bandwidth and basically it is new NB.

Would certainly like to see Ryzen core clock = DF (1200MHz with 3200MHz DDR4) vs i7 8700K 1200MHz + NB 1200MHz with DDR4 3200. Anyone did that kind of test? To see how well does ryzen CCX technology at lower clocks?

Could AMD simply improve DF bandwidth? More B/Cycle?

AMD could definitely improve DF speeds but they have to balance the power increase with performance gains. I think there is a chance that Pinnacle Ridge might support DDR4 4000+ speeds and thus have higher DF speeds. I would like to see a fixed DF speed of 2-2.4 Ghz on Pinnacle Ridge if its possible. With 7nm Zen 2 AMD could push the DF speeds in the 3 Ghz range. AMD has quite a few levers to work with to improve performance.

IRobot23 · Oct 10, 2017

Well, DF has good bandwidth for those clocks already. Since its also meant for server market I assume that they are going to stick for low power. For desktop they could optimized higher clocks.
1600MHz+ would be "killer".

raghu78 · Oct 10, 2017

IRobot23 said:
Well, DF has good bandwidth for those clocks already. Since its also meant for server market I assume that they are going to stick for low power. For desktop they could optimized higher clocks.
1600MHz+ would be "killer".

https://semiaccurate.com/2017/05/17/amds-details-epyc-server-ambitions/

EPYC will remain on 14LPP . Ryzen and most likely ThreadRipper will get a update on 12LP as it really needs higher clock frequency to compete with Coffeelake. Since Pinnacle Ridge dies are desktop only AMD could tweak fabric speeds specifically for reducing memory latency and improving gaming performance.

DrMrLordX · Oct 10, 2017

Improving DF speeds will also improve interCCX communication (thread jumping) and reduce the effects of "false sharing". Which would be nice for those titles that inexplicably run like crap on Ryzen.

Ryzen: Strictly technical

Diamond Member

Platinum Member

Senior member

Diamond Member

Junior Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Senior member

Senior member

Junior Member

Golden Member

Diamond Member

Junior Member

Senior member

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer