Ryzen: Strictly technical

ryzenmaster · Jul 23, 2018

Not sure if this has been done before, but I got bored and decided to do a little bit of benchmarking of second gen Ryzen 2700X memory bus congestion.

The setup is simple: I have n threads each allocating 64MB of RAM, they then get synchronized and all start doing an in-memory copy of the 64MB buffer simultaniously to temporarily generate some load. I then time how long it takes for each to copy their region of memory and take the mean over multiple runs.

As you can see from the chart, I tested up to 8 threads with speeds of 2133, 2666 and 2933 at CL12. I did also test CL16 but it hardly makes a difference here. Also another thing I tested is if it matters whether I keep the load to single CCX or balance between the two. It didn't really seem to make a difference either.

So what can we gain from this? Well, to most of you it doesn't come as a surprise that with more cores you need faster RAM to keep feeding them under load. I guess the most important thing is to avoid 2133MHz RAM even on quad core Ryzens.

tl;dr. Buy fast RAM

Shivansps · Jul 23, 2018

That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.

Despoiler · Jul 24, 2018

Shivansps said:
That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.

Maybe AMD's plans are to align Zen 2 with DDR5 availability.

LightningZ71 · Jul 24, 2018

ryzenmaster said:
Not sure if this has been done before, but I got bored and decided to do a little bit of benchmarking of second gen Ryzen 2700X memory bus congestion.

The setup is simple: I have n threads each allocating 64MB of RAM, they then get synchronized and all start doing an in-memory copy of the 64MB buffer simultaniously to temporarily generate some load. I then time how long it takes for each to copy their region of memory and take the mean over multiple runs.

As you can see from the chart, I tested up to 8 threads with speeds of 2133, 2666 and 2933 at CL12. I did also test CL16 but it hardly makes a difference here. Also another thing I tested is if it matters whether I keep the load to single CCX or balance between the two. It didn't really seem to make a difference either.

So what can we gain from this? Well, to most of you it doesn't come as a surprise that with more cores you need faster RAM to keep feeding them under load. I guess the most important thing is to avoid 2133MHz RAM even on quad core Ryzens.

tl;dr. Buy fast RAM

While I respect your work on this test, I must point out that this is assuredly a worst case scenario for the processor. The working sets for each thread are in excess of the L3 cache, and all threads are basically running non-stop memory accesses. Even in trivial cases in the real world, there is work being done on the data once its loaded and then it is stored or reused. While your test does point out that in worst case scenarios, there will absolutely be memory bandwidth contention, it isn't strictly representative of real world loads. In addition, once you start adding in the delay of reads and writes alternating on the memory bus, you start to expose access latency as a performance determinant. That difference between CL12 and CL16 will then become relevant. In more practical tests, it has been shown that performance uplift from faster RAM while holding the memory access latency constant (in miliseconds, not cycles) is sublinear. This means that the processor is not heavily memory bandwidth bound (but still influenced by it).

In any case, though, more bandwidth and lower latencies are good. It just matters what the actual program is really doing.

Cerb · Jul 24, 2018

Shivansps said:
That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.

Not that big a deal, really. In RAM-limited scenarios, HEDT will perform much faster. 16C32T on mainstream desktop wouldn't obsolete 16C32T on HEDT. Likewise, Intel isn't going to be too far behind, if they keep bumping up core counts.

Despoiler said:
Maybe AMD's plans are to align Zen 2 with DDR5 availability.

Zen2 looks to be coming out way before that, on DDR4. But, a lot of people will have no issues. Go find reviews of Intel's HEDT with just dual-channel RAM, FI. Scenarios where 3-4 channels come in handy certainly exist, but most of the time, 2 is fine, given that most of the time, people are in one program, with a lot of shared memory between threads, or the program at hand can churn on the cache for a bit, in each thread. Even when scaling due to RAM becomes an issue, if it's $150-300 for 12C or 16C, but without the more expensive HEDT board, it will likely still present a good value for gaming, workstation, and distributed computing use, without cannibalizing the niche but high-profit-margin HEDT market.

ryzenmaster · Jul 24, 2018

LightningZ71 said:
While I respect your work on this test, I must point out that this is assuredly a worst case scenario for the processor. The working sets for each thread are in excess of the L3 cache, and all threads are basically running non-stop memory accesses. Even in trivial cases in the real world, there is work being done on the data once its loaded and then it is stored or reused. While your test does point out that in worst case scenarios, there will absolutely be memory bandwidth contention, it isn't strictly representative of real world loads. In addition, once you start adding in the delay of reads and writes alternating on the memory bus, you start to expose access latency as a performance determinant. That difference between CL12 and CL16 will then become relevant. In more practical tests, it has been shown that performance uplift from faster RAM while holding the memory access latency constant (in miliseconds, not cycles) is sublinear. This means that the processor is not heavily memory bandwidth bound (but still influenced by it).

In any case, though, more bandwidth and lower latencies are good. It just matters what the actual program is really doing.

Absolutely it needs to be made clear this isn't typical workload. Indeed the point was to specifically test for memory congestion. As contention goes up with number of cores though, it brings up an interesting question: how will AMD deal with it in future Zen iterations. If they are going to increase core count on mainstream chips, then I suspect they're going to increase L3 size as well. It's either that or add more memory channels.. which of course brings us to the upcoming 32 core TR. If they are releasing quad die TR with only quad channel RAM, it's not going to scale anywhere nearly as well as EPYC. Not just because of only two of the dies having local memory access, but also due to increased memory contention.

One real world scenario where you would expect lots of memory traffic is gaming. Copying data to/from GPU is bound to keep the channels busy. I would really love to see some tests done on just how much are the performance improvements with faster RAM in gaming due to increased IF speeds and how much due to memory contention being compensated by raw speed. Here's to hoping someone more familiar with the subject will pick it up.

moinmoin · Jul 24, 2018

ryzenmaster said:
If they are going to increase core count on mainstream chips, then I suspect they're going to increase L3 size as well.

Unless they re-balance the cache sizes in Zen 2 (like adding a L4$) I'd expect 8MB more L3$ with every additional CCX.

William Gaatjes · Jul 24, 2018

I am happy.
This was with 2933MHz, testing with Memory Latency Checker from Intel .
Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 89.2

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 44364.0
3:1 Reads-Writes : 34486.9
2:1 Reads-Writes : 32615.3
1:1 Reads-Writes : 24145.2
Stream-triad like: 37302.3

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 44400.8

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 134.76 43856.9
00002 134.61 43854.5
00008 131.12 43876.3
00015 128.84 43855.2
00050 105.81 43521.7
00100 91.85 33567.1
00200 87.24 20049.5
00300 84.10 14393.7
00400 82.17 11326.2
00500 80.60 9389.6
00700 79.21 7078.9
01000 77.92 5283.9
01300 77.27 4292.9
01700 76.79 3498.2
02500 76.14 2667.5
03500 75.94 2152.9
05000 75.39 1768.7
09000 74.66 1369.6
20000 74.51 1090.0

Measuring cache-to-cache transfer latency (in ns)...
Using small pages for allocating buffers
Local Socket L2->L2 HIT latency 20.5
Local Socket L2->L2 HITM latency 34.8

Now it is all of course improved a bit while running at 3200MHz.
But i do not understand that first idle latency number.

Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 90.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 46938.7
3:1 Reads-Writes : 34587.3
2:1 Reads-Writes : 30554.1
1:1 Reads-Writes : 25058.1
Stream-triad like: 38166.4

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 47540.2

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 126.47 46703.5
00002 125.72 46766.6
00008 122.28 46786.5
00015 120.17 46780.9
00050 97.33 46057.0
00100 86.33 33382.5
00200 81.87 19985.4
00300 79.16 14391.4
00400 77.38 11343.7
00500 76.27 9413.3
00700 74.76 7114.2
01000 73.81 5320.4
01300 73.18 4330.3
01700 72.76 3541.3
02500 72.22 2709.3
03500 71.92 2196.7
05000 71.54 1812.0
09000 71.10 1411.9
20000 70.60 1137.0

Measuring cache-to-cache transfer latency (in ns)...
Using small pages for allocating buffers
Local Socket L2->L2 HIT latency 20.6
Local Socket L2->L2 HITM latency 31.7

I do wish there was an el cheapo version of AIDA64.
And i wish AMD would come with a memory latency checker version.

Hans Gruber · Jul 24, 2018

I was thinking the OCing of ram or getting speeds 3200mhz or more was motherboard dependent. Not sure what memory sticks I have but 3200mhz became a non issue with motherboard bios revisions.

bfun_x1 · Jul 25, 2018

Can I assume CPUID HWMonitor doesn't always record core speed accuratly?

The Stilt · Jul 25, 2018

Correctly assumed.

bfun_x1 · Jul 25, 2018

The Stilt said:
Correctly assumed.

Damn. What about voltage? Will Ryzen really hit 1.5V?

The Stilt · Jul 25, 2018

bfun_x1 said:
Damn. What about voltage? Will Ryzen really hit 1.5V?

Not unconmmon on X-SKUs during the single core boost.
The maximum voltage varies between each chip and ever between the different cores of the CPU.

Jan Olšan · Jul 26, 2018

Apparently Lisa Su confirmed in conference call for Q2 results that they make Epyc 2 at TSMC. Crazy...

I wonder if this means separate die for Epyc and Ryzen, because AMD still needs to meet mandatory wafer orders from GloFo and they won't be able to if they don't mass produce AM4 Ryzens there.

So, Matt, on your first question relative to the manufacturing of the second generation of EPYC, so as I said earlier, we are working with both the TSMC and GLOBALFOUNDRIES in 7-nanometer. As for the 7-nanometer Rome that we're currently sampling, that's being manufactured at TSMC.

Rome = Epyc 2

https://seekingalpha.com/article/41...-amd-q2-2018-results-earnings-call-transcript

NTMBK · Jul 26, 2018

Jan Olšan said:
Apparently Lisa Su confirmed in conference call for Q2 results that they make Epyc 2 at TSMC. Crazy...

I wonder if this means separate die for Epyc and Ryzen, because AMD still needs to meet mandatory wafer orders from GloFo and they won't be able to if they don't mass produce AM4 Ryzens there.

Potentially... though they might meet the quota with next-gen console APUs and GPUs.

tamz_msc · Jul 26, 2018

Curious to know whether TSMC has ever fabbed a high-performance x86 chip?

Hitman928 · Jul 26, 2018

tamz_msc said:
Curious to know whether TSMC has ever fabbed a high-performance x86 chip?

Not that I'm aware of. Some of AMD's APUs in the past (e.g. console chips), but nothing high frequency / high performance that I know of.

Shivansps · Jul 27, 2018

I have a question about Ryzen 1000 turbos, today a ran a test using a fully ST game, a R5 2600 on a Gigabyte A320 ran the game while maintaining 3900 on the core that was running the game, in contrast, this is my 1700 on a Gigabyte AB350-Gaming running the same game.

I dont remember my 1700 boosting past 3200, ever. From what i did see from tests to 2200G, 2400G and 2600s turbo seems to be working really really well on the 2000 series, but on the 1000 series it is this bad?

moinmoin · Jul 27, 2018

Turbo on the 1xxx series only kicks in when enough cores are not in use in any way. That limitation being lifted is indeed a major improvement in the 2xxx series.

virpz · Aug 13, 2018

I am a little late here but I tried the "FIT" consultation described here by The Stilt. For that I disabled C-states, CPB and PBO. Then tried it again, this time with CPB and PBO enabled.
I took notes on the maximum sustained Vcore and clock by monitoring the clock and “VDDCR_CPU SVI2 TFN” in the HWinfo64. Anything else I need to enabled/disable ?

CPB/PBO disabled:

Multi-threaded load
Vcore: 1.231V
Clock: 3974MHz

Single-Threaded load
Vcore: 1.300V
Clock: 4100MHz

CPB/PBO enabled:

Multi-threaded load
Vcore: 1.369V
Clock: 4174MHz

Single-Threaded load
Vcore: 1.419V
Clock: 4324MHz

Anyone else want to try this and share results ??

virpz · Aug 16, 2018

@The Stilt or anybody else can confirm whether AMD has removed or not PBO adjustments ( PPT, EDC, PTC ) for agesa 1.0.0.4 ?

Thanks !

Jjoshua2 · Sep 28, 2018

I know this is pretty good timings, but what should I work on next? This is a 1950x with 3200 cas 14 RAM, but it seems to be mem stable at whatever I give it pretty much. I can't go run these timings at 3600 or use cas 13 here, but what I can I do with subtimings from here? Are these ones reasonable? Which ones can I try to lower? Maybe tRFC only? I think tRC and tRas are already as low as they should be according to formulas?

gupsterg · Oct 31, 2018

I reckon you'd see very diminishing gains for any further time/effort placed to improve on what you have already.

I've opted for 3400MHz C15 1T on my TR+ZE as uses SOC: 1.05V VDIMM: 1.35V and passed a lot of high hour count usage/stability testing. Screenshot link , ProcODT 53, CAD Bus timings 0/0, 0/0, 0/0, RTT Off Off 48, CAD Bus drive strength 24, 24, 24, 24.

cdimauro · May 30, 2019

The Stilt said:
This is by no means a full blown review. It just provides some of the more in-depth information, along with some test results.

I know that it's more than two years now, but I will be grateful if you can add more information about the test configurations.

Test configuration for the Ryzen 1800X platform
Motherboard: Crosshair VI Hero
o.s.: Windows 10
Memory: 2666Mhz, but which model(s)?
HD/SSD: which one?

What are the configurations for the other systems (Excavator, Haswell, Kabylake)?

BlackScholes: which implementation was used? Repository? Compilation options?
embree: which compilation options?
gcc 6.3: which compilation options used for compiling ffmpeg?
linpack: which implementation was used? Repository? Compilation options?
mcrt: which implementation was used? Repository? Compilation options?

Thanks

Ajay · May 30, 2019

cdimauro said:
I know that it's more than two years now, but I will be grateful if you can add more information about the test configurations.

Test configuration for the Ryzen 1800X platform
Motherboard: Crosshair VI Hero
o.s.: Windows 10
Memory: 2666Mhz, but which model(s)?
HD/SSD: which one?

What are the configurations for the other systems (Excavator, Haswell, Kabylake)?

BlackScholes: which implementation was used? Repository? Compilation options?
embree: which compilation options?
gcc 6.3: which compilation options used for compiling ffmpeg?
linpack: which implementation was used? Repository? Compilation options?
mcrt: which implementation was used? Repository? Compilation options?

Thanks

The Stilt is no more "This account has been deactivated."

Ryzen: Strictly technical

Member

Diamond Member

Golden Member

Platinum Member

Elite Member

Member

Diamond Member

Lifer

Platinum Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Junior Member

Junior Member

Senior member

Junior Member

Member

Lifer