Ryzen: Strictly technical

Page 85 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

ryzenmaster

Member
Mar 19, 2017
40
89
61
Not sure if this has been done before, but I got bored and decided to do a little bit of benchmarking of second gen Ryzen 2700X memory bus congestion.

The setup is simple: I have n threads each allocating 64MB of RAM, they then get synchronized and all start doing an in-memory copy of the 64MB buffer simultaniously to temporarily generate some load. I then time how long it takes for each to copy their region of memory and take the mean over multiple runs.

gzMKJDo.png


As you can see from the chart, I tested up to 8 threads with speeds of 2133, 2666 and 2933 at CL12. I did also test CL16 but it hardly makes a difference here. Also another thing I tested is if it matters whether I keep the load to single CCX or balance between the two. It didn't really seem to make a difference either.

So what can we gain from this? Well, to most of you it doesn't come as a surprise that with more cores you need faster RAM to keep feeding them under load. I guess the most important thing is to avoid 2133MHz RAM even on quad core Ryzens.

tl;dr. Buy fast RAM
 

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.
 

Despoiler

Golden Member
Nov 10, 2007
1,966
770
136
That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.

Maybe AMD's plans are to align Zen 2 with DDR5 availability.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Not sure if this has been done before, but I got bored and decided to do a little bit of benchmarking of second gen Ryzen 2700X memory bus congestion.

The setup is simple: I have n threads each allocating 64MB of RAM, they then get synchronized and all start doing an in-memory copy of the 64MB buffer simultaniously to temporarily generate some load. I then time how long it takes for each to copy their region of memory and take the mean over multiple runs.

gzMKJDo.png


As you can see from the chart, I tested up to 8 threads with speeds of 2133, 2666 and 2933 at CL12. I did also test CL16 but it hardly makes a difference here. Also another thing I tested is if it matters whether I keep the load to single CCX or balance between the two. It didn't really seem to make a difference either.

So what can we gain from this? Well, to most of you it doesn't come as a surprise that with more cores you need faster RAM to keep feeding them under load. I guess the most important thing is to avoid 2133MHz RAM even on quad core Ryzens.

tl;dr. Buy fast RAM

While I respect your work on this test, I must point out that this is assuredly a worst case scenario for the processor. The working sets for each thread are in excess of the L3 cache, and all threads are basically running non-stop memory accesses. Even in trivial cases in the real world, there is work being done on the data once its loaded and then it is stored or reused. While your test does point out that in worst case scenarios, there will absolutely be memory bandwidth contention, it isn't strictly representative of real world loads. In addition, once you start adding in the delay of reads and writes alternating on the memory bus, you start to expose access latency as a performance determinant. That difference between CL12 and CL16 will then become relevant. In more practical tests, it has been shown that performance uplift from faster RAM while holding the memory access latency constant (in miliseconds, not cycles) is sublinear. This means that the processor is not heavily memory bandwidth bound (but still influenced by it).

In any case, though, more bandwidth and lower latencies are good. It just matters what the actual program is really doing.
 
  • Like
Reactions: lightmanek

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
That is what has me worried when people mention the posibility of a 8 core CCX or a single die with 4, 4-Core CCX, and 2 mem controllers. On top of latency i dont think there is a fast enoght memory in the world to feed that.
Not that big a deal, really. In RAM-limited scenarios, HEDT will perform much faster. 16C32T on mainstream desktop wouldn't obsolete 16C32T on HEDT. Likewise, Intel isn't going to be too far behind, if they keep bumping up core counts.
Maybe AMD's plans are to align Zen 2 with DDR5 availability.
Zen2 looks to be coming out way before that, on DDR4. But, a lot of people will have no issues. Go find reviews of Intel's HEDT with just dual-channel RAM, FI. Scenarios where 3-4 channels come in handy certainly exist, but most of the time, 2 is fine, given that most of the time, people are in one program, with a lot of shared memory between threads, or the program at hand can churn on the cache for a bit, in each thread. Even when scaling due to RAM becomes an issue, if it's $150-300 for 12C or 16C, but without the more expensive HEDT board, it will likely still present a good value for gaming, workstation, and distributed computing use, without cannibalizing the niche but high-profit-margin HEDT market.
 

ryzenmaster

Member
Mar 19, 2017
40
89
61
While I respect your work on this test, I must point out that this is assuredly a worst case scenario for the processor. The working sets for each thread are in excess of the L3 cache, and all threads are basically running non-stop memory accesses. Even in trivial cases in the real world, there is work being done on the data once its loaded and then it is stored or reused. While your test does point out that in worst case scenarios, there will absolutely be memory bandwidth contention, it isn't strictly representative of real world loads. In addition, once you start adding in the delay of reads and writes alternating on the memory bus, you start to expose access latency as a performance determinant. That difference between CL12 and CL16 will then become relevant. In more practical tests, it has been shown that performance uplift from faster RAM while holding the memory access latency constant (in miliseconds, not cycles) is sublinear. This means that the processor is not heavily memory bandwidth bound (but still influenced by it).

In any case, though, more bandwidth and lower latencies are good. It just matters what the actual program is really doing.

Absolutely it needs to be made clear this isn't typical workload. Indeed the point was to specifically test for memory congestion. As contention goes up with number of cores though, it brings up an interesting question: how will AMD deal with it in future Zen iterations. If they are going to increase core count on mainstream chips, then I suspect they're going to increase L3 size as well. It's either that or add more memory channels.. which of course brings us to the upcoming 32 core TR. If they are releasing quad die TR with only quad channel RAM, it's not going to scale anywhere nearly as well as EPYC. Not just because of only two of the dies having local memory access, but also due to increased memory contention.

One real world scenario where you would expect lots of memory traffic is gaming. Copying data to/from GPU is bound to keep the channels busy. I would really love to see some tests done on just how much are the performance improvements with faster RAM in gaming due to increased IF speeds and how much due to memory contention being compensated by raw speed. Here's to hoping someone more familiar with the subject will pick it up.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
If they are going to increase core count on mainstream chips, then I suspect they're going to increase L3 size as well.
Unless they re-balance the cache sizes in Zen 2 (like adding a L4$) I'd expect 8MB more L3$ with every additional CCX.
 
May 11, 2008
19,555
1,194
126
I am happy.
This was with 2933MHz, testing with Memory Latency Checker from Intel .
Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 89.2

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 44364.0
3:1 Reads-Writes : 34486.9
2:1 Reads-Writes : 32615.3
1:1 Reads-Writes : 24145.2
Stream-triad like: 37302.3

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 44400.8

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 134.76 43856.9
00002 134.61 43854.5
00008 131.12 43876.3
00015 128.84 43855.2
00050 105.81 43521.7
00100 91.85 33567.1
00200 87.24 20049.5
00300 84.10 14393.7
00400 82.17 11326.2
00500 80.60 9389.6
00700 79.21 7078.9
01000 77.92 5283.9
01300 77.27 4292.9
01700 76.79 3498.2
02500 76.14 2667.5
03500 75.94 2152.9
05000 75.39 1768.7
09000 74.66 1369.6
20000 74.51 1090.0

Measuring cache-to-cache transfer latency (in ns)...
Using small pages for allocating buffers
Local Socket L2->L2 HIT latency 20.5
Local Socket L2->L2 HITM latency 34.8

Now it is all of course improved a bit while running at 3200MHz.
But i do not understand that first idle latency number.

Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 90.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 46938.7
3:1 Reads-Writes : 34587.3
2:1 Reads-Writes : 30554.1
1:1 Reads-Writes : 25058.1
Stream-triad like: 38166.4

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 47540.2

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 126.47 46703.5
00002 125.72 46766.6
00008 122.28 46786.5
00015 120.17 46780.9
00050 97.33 46057.0
00100 86.33 33382.5
00200 81.87 19985.4
00300 79.16 14391.4
00400 77.38 11343.7
00500 76.27 9413.3
00700 74.76 7114.2
01000 73.81 5320.4
01300 73.18 4330.3
01700 72.76 3541.3
02500 72.22 2709.3
03500 71.92 2196.7
05000 71.54 1812.0
09000 71.10 1411.9
20000 70.60 1137.0

Measuring cache-to-cache transfer latency (in ns)...
Using small pages for allocating buffers
Local Socket L2->L2 HIT latency 20.6
Local Socket L2->L2 HITM latency 31.7

I do wish there was an el cheapo version of AIDA64.
And i wish AMD would come with a memory latency checker version.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,135
1,089
136
I was thinking the OCing of ram or getting speeds 3200mhz or more was motherboard dependent. Not sure what memory sticks I have but 3200mhz became a non issue with motherboard bios revisions.
 

Jan Olšan

Senior member
Jan 12, 2017
278
297
136
Apparently Lisa Su confirmed in conference call for Q2 results that they make Epyc 2 at TSMC. Crazy...

I wonder if this means separate die for Epyc and Ryzen, because AMD still needs to meet mandatory wafer orders from GloFo and they won't be able to if they don't mass produce AM4 Ryzens there.

So, Matt, on your first question relative to the manufacturing of the second generation of EPYC, so as I said earlier, we are working with both the TSMC and GLOBALFOUNDRIES in 7-nanometer. As for the 7-nanometer Rome that we're currently sampling, that's being manufactured at TSMC.

Rome = Epyc 2

https://seekingalpha.com/article/41...-amd-q2-2018-results-earnings-call-transcript
 

NTMBK

Lifer
Nov 14, 2011
10,237
5,020
136
Apparently Lisa Su confirmed in conference call for Q2 results that they make Epyc 2 at TSMC. Crazy...

I wonder if this means separate die for Epyc and Ryzen, because AMD still needs to meet mandatory wafer orders from GloFo and they won't be able to if they don't mass produce AM4 Ryzens there.

Potentially... though they might meet the quota with next-gen console APUs and GPUs.
 
  • Like
Reactions: coercitiv

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
I have a question about Ryzen 1000 turbos, today a ran a test using a fully ST game, a R5 2600 on a Gigabyte A320 ran the game while maintaining 3900 on the core that was running the game, in contrast, this is my 1700 on a Gigabyte AB350-Gaming running the same game.

138pCfb.jpg


I dont remember my 1700 boosting past 3200, ever. From what i did see from tests to 2200G, 2400G and 2600s turbo seems to be working really really well on the 2000 series, but on the 1000 series it is this bad?
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,661
136
Turbo on the 1xxx series only kicks in when enough cores are not in use in any way. That limitation being lifted is indeed a major improvement in the 2xxx series.
 

virpz

Junior Member
Sep 11, 2014
13
12
81
I am a little late here but I tried the "FIT" consultation described here by The Stilt. For that I disabled C-states, CPB and PBO. Then tried it again, this time with CPB and PBO enabled.
I took notes on the maximum sustained Vcore and clock by monitoring the clock and “VDDCR_CPU SVI2 TFN” in the HWinfo64. Anything else I need to enabled/disable ?

CPB/PBO disabled:

Multi-threaded load
Vcore: 1.231V
Clock: 3974MHz

Single-Threaded load
Vcore: 1.300V
Clock: 4100MHz

CPB/PBO enabled:

Multi-threaded load
Vcore: 1.369V
Clock: 4174MHz

Single-Threaded load
Vcore: 1.419V
Clock: 4324MHz


Anyone else want to try this and share results ??
 

virpz

Junior Member
Sep 11, 2014
13
12
81
@The Stilt or anybody else can confirm whether AMD has removed or not PBO adjustments ( PPT, EDC, PTC ) for agesa 1.0.0.4 ?

Thanks !
 
Last edited:

Jjoshua2

Senior member
Mar 24, 2006
635
1
76
3466_14.png

I know this is pretty good timings, but what should I work on next? This is a 1950x with 3200 cas 14 RAM, but it seems to be mem stable at whatever I give it pretty much. I can't go run these timings at 3600 or use cas 13 here, but what I can I do with subtimings from here? Are these ones reasonable? Which ones can I try to lower? Maybe tRFC only? I think tRC and tRas are already as low as they should be according to formulas?
 
  • Like
Reactions: Drazick

gupsterg

Junior Member
Mar 4, 2017
9
3
51
I reckon you'd see very diminishing gains for any further time/effort placed to improve on what you have already.

I've opted for 3400MHz C15 1T on my TR+ZE as uses SOC: 1.05V VDIMM: 1.35V and passed a lot of high hour count usage/stability testing. Screenshot link , ProcODT 53, CAD Bus timings 0/0, 0/0, 0/0, RTT Off Off 48, CAD Bus drive strength 24, 24, 24, 24.
 

cdimauro

Member
Sep 14, 2016
163
14
61
This is by no means a full blown review. It just provides some of the more in-depth information, along with some test results.
I know that it's more than two years now, but I will be grateful if you can add more information about the test configurations.

Test configuration for the Ryzen 1800X platform
Motherboard: Crosshair VI Hero
o.s.: Windows 10
Memory: 2666Mhz, but which model(s)?
HD/SSD: which one?

What are the configurations for the other systems (Excavator, Haswell, Kabylake)?

BlackScholes: which implementation was used? Repository? Compilation options?
embree: which compilation options?
gcc 6.3: which compilation options used for compiling ffmpeg?
linpack: which implementation was used? Repository? Compilation options?
mcrt: which implementation was used? Repository? Compilation options?

Thanks
 

Ajay

Lifer
Jan 8, 2001
15,451
7,861
136
I know that it's more than two years now, but I will be grateful if you can add more information about the test configurations.

Test configuration for the Ryzen 1800X platform
Motherboard: Crosshair VI Hero
o.s.: Windows 10
Memory: 2666Mhz, but which model(s)?
HD/SSD: which one?

What are the configurations for the other systems (Excavator, Haswell, Kabylake)?

BlackScholes: which implementation was used? Repository? Compilation options?
embree: which compilation options?
gcc 6.3: which compilation options used for compiling ffmpeg?
linpack: which implementation was used? Repository? Compilation options?
mcrt: which implementation was used? Repository? Compilation options?

Thanks
The Stilt is no more "This account has been deactivated."
 
  • Like
Reactions: maddogmcgee
Status
Not open for further replies.