Ryzen: Strictly technical

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Hopefully AMD can fix this in "Pinnacle Ridge". If they can spare the resources, it might be a good idea for them to design their own IMC instead of using one licensed from Rambus.

I'm pretty certain the DRAM IP isn't supplied by Rambus.
That's because the IPs used in Steamroller and Excavator weren't and Zeppelin has almost identical controller structure (at interface register level) as SR and XV had.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,684
1,267
136
Techreport ran an interesting article showing histograms of frame times with the 1800X and 7700. An interesting thing is that Ryzen's histograms look bimodal versus the unimodal results from the 7700.

I'm sure there are a few things that could be causing the bimodal presentation, although my money is on the scheduler. If it's possible to get rid of the first spike on the histograms, game performance should improve a lot.
 

lopri

Elite Member
Jul 27, 2002
13,209
594
126
^ I have no idea how he can draw this conclusion:

That's interesting, we think. The Core i7-7700K produces a nice, fat dromedarian hump with most of its frames clustered to the right of the chart, while the bactrian Ryzen 7 1800X exhibits a seemingly-more-bimodal distribution. Not only is the Core i7-7700K faster, but its frame delivery is more consistent—and our frame-time data bears that out.

From these graphs:

gtav-ryzen-normalized2.png
gtav-7700K-normalized2.png

At least not without another axis showing timeline/frame numbers. It seems more like a game-dependent result and It is not uncommon a CPU produces higher/lower FPS at a certain level or a scene in a game. That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."

P.S. I had no idea game FPS tally was supposed to follow a Bell curve. Game developers do not care much once their bottom line is achieved. (e.g. 1080p/60FPS)
 

Elixer

Lifer
May 7, 2002
10,376
762
126
Certainly Linux is not perfect, it has a scheduler, too. Time will tell if some patches can be applied to inform the scheduler of Ryzen's new architecture.
On linux, the scheduler changes started on https://git.kernel.org/cgit/linux/k.../?id=79a8b9aa388b0620cc1d525d7c0f0d9a8a85e08e
Then they add the Zen specific one the same day, https://git.kernel.org/cgit/linux/k.../?id=08b259631b5a1d912af4832847b5642f377d9101

...
That problem stems most likely from the fact that the CU threads share resources within one CU and when we schedule to a thread of a different compute unit, this incurs latency due to migrating the working set to a different CU through the caches.
When the thread siblings mask mirrors that aspect of the CUs and threads, the scheduler pays attention to it and tries to schedule within one CU first. Which takes care of the latency, of course.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
P.S. I had no idea game FPS tally was supposed to follow a Bell curve. Game developers do not care much once their bottom line is achieved. (e.g. 1080p/60FPS)
1. Frametimes are affected by some many independent factors, some of which are pseudo random, that normal distribution can act as good approximation.
2. Perfect FPS tally would be 1 FPS value on entire run, so yes, perfect FPS tally would be a bell curve with little deviation.
That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."
It is not data point but a bucket to throw frametimes in. If you ask me, it would be better if the distribution was displayed for frametimes, because right now the leftmost buckets are many times larger than the rightmost.
 
Last edited:

zir_blazer

Golden Member
Jun 6, 2013
1,164
406
136
2x x16 PCie lanes per Zeppelin die.
All of them are usable, however 8 of them are consumed by the eFCH (Promontory), when present.
Wait, I'm missing something here. I have been arguing a lot about this, going through any Block Diagram I can come across from AM4 platforms, and I thought it was 24.

16 PCIe Lanes for a main PCIe 16x Slot, which on X300 or X370 based Motherboards you can bifurcate as 8x/8x (Sold as a "Chipset feature")
4 PCIe Lanes, which can be configured (Link is quite old, from 2015, but can't find better) as 4 General Purpose (Unless U.2 and M.2 are special, which as far that I know, they aren't, just generic PCIe), or 2 SATA-III + 2 PCIe Lanes. Theorically, a single SATA Express should consume all 4 Lanes, as it should be the same 2 SATA-III + 2 PCIe Lanes configuration, and not SATA Express + 2 GP as in the link, which would mean 6 Lanes (Unless there is some black magic going on there, or the SATAe is crippled in some way).
4 PCIe Lanes for Chipset (Promontory), which can also be GP if used in X300
Where are the other 8? I don't recall neither any mentions regarding Promontory to use 8, only 4 (Including this Block Diagram of the ASUS Crosshair VI Hero). Do the 4 integrated USB 3.0 consumes Lanes as Intel Chipsets FlexIO (And should be fixed that way in AM4 platforms)? Can't really get to 32.

The other way I take it: Zeppelin does have 32 Lanes, but AM4 is not using the full Zeppelin I/O because late time change in plans for Ryzen while Socket AM4 was already set in stone.


About Promontory, how much configuration possibilities it actually has? I was looking at this, where it says that X370 has 8 PCIe 2.0 Lanes and 2 SATAe, and in footnote 1 it says that each SATAe can be used either as 2 SATA-III or 2 PCIe 3.0 Lanes. It also says that it can be combined with 2 GP Lanes to make a 4x Slot, but it would be 3.0 or 2.0? Can I get 2 4x PCIe 3.0 Slots from that? What would be the max PCIe Lanes/Slots configuration with no SATAe? 8 PCIe 2.0 (4x/4x) + 4 PCIe 3.0 (4x) (And only 4 SATA-III)?

I think that part of the confusion with SATAe is because it seems that AMD is using just two configurable PCIe Lanes for it, not 2 PCIe Lanes + 2 SATA-III, which would otherwise be worth 4.
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."

To be fair the bimodal nature of the distribution is far more clear in Crysis 3:
cry3-1800x-normalized2.png

It is however worth noting that the 7700K also has a bimodal distribution here (albeit significantly less pronounced):
cry3-7700K-normalized.png

But as lolfail9001 said above, this really needs to be done with frametimes instead of frame rates. For instance the 4-8 FPS bin represents a span in frametime of 125 ms (from 250 ms for 4 FPS to 125 ms for 8 FPS), whereas say the 196-200 represents only a span of 0.1 ms (from 5.1 ms for 196 FPS to 5 ms for 200 FPS). So over three orders of magnitude difference in bin size.
 
  • Like
Reactions: BlahBleeBlahBlah

guachi

Senior member
Nov 16, 2010
761
415
136
I find those histograms incredibly useful.

I think all reviewers should no longer use average fps. I think they should use median (aka 50th percentile) and other percentiles like 1% or even .1%.

Converting the histograms above gives the following for Crysis 3:

1800X
.1% 58 fps
1% 74 fps
10% 98 fps
25% 114 fps
50% 126 fps
75% 138 fps
90% 150 fps
99% 194 fps

7700K
.1% 74 fps
1% 86 fps
10% 106 fps
25% 118 fps
50% 126 fps
75% 142 fps
90% 154 fps
99% 202 fps

In other words, Ryzen has a problem at the lower end. It's numbers at the 25th % and up are fine. In fact, both have the same median. It's between .1% and 25% that the problem exists.

GTA V

1800X
.1% 70 fps
1% 74 fps
10% 82 fps
25% 82 fps
50% 90 fps
75% 98 fps
90% 106 fps
99% 126 fps

7700K
.1% 82 fps
1% 90 fps
10% 102 fps
25% 106 fps
50% 110 fps
75% 118 fps
90% 126 fps
99% 138 fps

Here, the 1800X is consistently about 80-85% as fast as the 7700K and the results are far more clustered for both chips.

Personally, I'd find these kind of numbers with .1%, 1%, and 50% and an included histogram more useful than what we get now. And with Excel, it's not hard to do.

EDIT: I converted the fps to frame time to get more accurate results as antihelten suggested. Then converted back to fps for the post.
 
  • Like
Reactions: Dresdenboy

dfk7677

Member
Sep 6, 2007
64
21
81
I don't know if this is useful or not but I will post it. This is a histogram of the frametimes in Battlefield 1 multiplayer, using Cam 1 of Spectator, in a full Amiens (map) 64 player Conquest server, using a 4590 (3.5GHz MC turbo), 8GB DDR3-1600. The CPU usage was always 98%+ and the GPU usage <80%. All GFX settings set to low or off @1080p.

If anyone owning a Ryzen CPU could do the same it would be great.
 
Last edited:
  • Like
Reactions: lopri

unseenmorbidity

Golden Member
Nov 27, 2016
1,395
967
96
A Diagram of Ryzen’s Clock Domains
For the past few days, there have been talks around tech forums about Ryzen’s clock domains. Specifically about how some parts of the internal fabric of Ryzen run at memory clockspeed, or half effective transfer speed. Thanks to hardware.fr, we now have a diagram of the clock domains in Ryzen.





From the above diagram, we can conclude there are three major clock domains in Ryzen:

  • CClk (Core Clock): The speed advertised on the box, and the speed at which the cores and cache run at
  • LClk (Data Launch Clock): A fixed frequency at which the IO Hub Controller (PCI-E and friends) operates
  • MemClk (Memory Clock): The memory clock speed, half of the number you see advertised (1.3GHz for 2667, as an example)

As can be seen in the diagram, the internal Data Fabric, part of the Infinity Fabric advertised by AMD, runs in the MemClk domain.

The Data Fabric is reponsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each contianing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creation data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.



Memory Scaling
To put things into perspective, communication to and from cores to L3 cache inside the same CCX happens at around 200GB/s. Combine this with the massive difference in latency you would expect from having to go through extra hoops to reach the other CCX, and the communication speed between CCX’s is simply not anywhere near intra-CCX communication speeds. This is why changing the Windows scheduler to keep data from a single thread in the same CCX is so important, as otherwise you incur a performance penalty, as observed in gaming performance tests.

As observed in multiple tests, Ryzen appears to scale quite well with memory speeds, and this knowledge sheds light on why. It’s not necessarily the increased speeds with the DDR4 kits themselves, though it certainly helps. Rather, it’s the increased internal bandwidth for inter-CCX communication, which alleviates some of the performance issues when threads have to communicate between CCX’s.

Due to this, if you’re picking up a Ryzen system, it’s highly recommended to get a decently fast memory kit, as it will help performance more than you would otherwise expect.



Why is it built this way?
Well, the answer is quite simple. AMD as a semi-custom company, would like to have modularity in their designs. Instead of building a large monolithic 8 core complex that would be relatively hard to scale, they’ve created a quad core building block, from which they can scale core counts as as they see fit. This reduces design costs and time to market for their customers.

There is a trade off, as detailed above. However, as seen in the impressive performance of Ryzen, it’s not too hampering. As soon as the Windows scheduler gets updated to minimize movement of data between CCX’s, and developers release patches in the few games where that will not be enough to solve issues, the gaming performance should be similar to the performance of a 6900K, just like with professional workloads.

https://thetechaltar.com/amd-ryzen-clock-domains-detailed/
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
A Diagram of Ryzen’s Clock Domains
For the past few days, there have been talks around tech forums about Ryzen’s clock domains. Specifically about how some parts of the internal fabric of Ryzen run at memory clockspeed, or half effective transfer speed. Thanks to hardware.fr, we now have a diagram of the clock domains in Ryzen.





From the above diagram, we can conclude there are three major clock domains in Ryzen:

  • CClk (Core Clock): The speed advertised on the box, and the speed at which the cores and cache run at
  • LClk (Data Launch Clock): A fixed frequency at which the IO Hub Controller (PCI-E and friends) operates
  • MemClk (Memory Clock): The memory clock speed, half of the number you see advertised (1.3GHz for 2667, as an example)

As can be seen in the diagram, the internal Data Fabric, part of the Infinity Fabric advertised by AMD, runs in the MemClk domain.

The Data Fabric is reponsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each contianing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creation data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.



Memory Scaling
To put things into perspective, communication to and from cores to L3 cache inside the same CCX happens at around 200GB/s. Combine this with the massive difference in latency you would expect from having to go through extra hoops to reach the other CCX, and the communication speed between CCX’s is simply not anywhere near intra-CCX communication speeds. This is why changing the Windows scheduler to keep data from a single thread in the same CCX is so important, as otherwise you incur a performance penalty, as observed in gaming performance tests.

As observed in multiple tests, Ryzen appears to scale quite well with memory speeds, and this knowledge sheds light on why. It’s not necessarily the increased speeds with the DDR4 kits themselves, though it certainly helps. Rather, it’s the increased internal bandwidth for inter-CCX communication, which alleviates some of the performance issues when threads have to communicate between CCX’s.

Due to this, if you’re picking up a Ryzen system, it’s highly recommended to get a decently fast memory kit, as it will help performance more than you would otherwise expect.



Why is it built this way?
Well, the answer is quite simple. AMD as a semi-custom company, would like to have modularity in their designs. Instead of building a large monolithic 8 core complex that would be relatively hard to scale, they’ve created a quad core building block, from which they can scale core counts as as they see fit. This reduces design costs and time to market for their customers.

There is a trade off, as detailed above. However, as seen in the impressive performance of Ryzen, it’s not too hampering. As soon as the Windows scheduler gets updated to minimize movement of data between CCX’s, and developers release patches in the few games where that will not be enough to solve issues, the gaming performance should be similar to the performance of a 6900K, just like with professional workloads.

https://thetechaltar.com/amd-ryzen-clock-domains-detailed/
I wrote that \o/
 

gupsterg

Junior Member
Mar 4, 2017
9
3
51
@The Stilt

So putting aside BCLK dropping PCI-E gen, etc. Raising BCLK would mean CPU stays in "Normal mode" if multiplier not raised past stock. So "normal" operation occurs plus PB/XFR works, but then these constraints would apply as they are for "out-of-box" scenario and not lifted as "OC mode" has not been enabled?

R7 1700

In a heavily-multithreaded “all cores boost” scenario, this user-focused performance tuning permits the 1700 to ramp peak power draw up to its fused package power limit of approximately 90W electrical (note: AM4 reference power limit is 128W). Precision Boost and/or XFR will level off at 72.3tCase°C or ~90W of electrical power (whichever comes first).

R7 1700X/1800X

In a heavily-multithreaded “all cores boost” scenario, this user-focused performance tuning permits the 1700X/1800X to ramp peak power draw up to the AMD Socket AM4 reference limit of 128W. Precision Boost and/or XFR will level off at 60tCase°C or 128W of electrical power (whichever comes first).
 

bluepx

Junior Member
Mar 9, 2017
1
1
41
Asrock X370 Taichi has reached reviewers and allows changing the P-states from the BIOS, so no need to mess around with BCLK. Will this keep the negative voltage offsets active?

Custom_pstate.jpg


I'm not sure what FID/DID/VID are or why the freq and voltage are greyed out though.

Link to the (light) Taichi review: https://www.youtube.com/watch?v=VjGWqTFultI (no benchmarks)
 
  • Like
Reactions: Dresdenboy

iBoMbY

Member
Nov 23, 2016
175
103
86
You can do that on the C6H as well, only when you do it, the P-states are not used (it seems). I set Pstate0 to 4 Ghz, then it was running at Pstate1 with 3.6 Ghz all the time ...
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
You can do that on the C6H as well, only when you do it, the P-states are not used (it seems). I set Pstate0 to 4 Ghz, then it was running at Pstate1 with 3.6 Ghz all the time ...

Try setting the "Balanced" power profile. I don't think Precision Boost obeys the custom P-states.

I anticipate Precision Boost/XFR and all that jazz to be worthless. Custom P-states and incurring the 30ms frequency shift penalty will probably be worth it, IMHO, if you can then set a higher all core turbo.
 
  • Like
Reactions: Drazick

iBoMbY

Member
Nov 23, 2016
175
103
86
It says 32B/cycle right there...

Yeah, okay.

What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?

DDR4-2666: 41.7 GB/s
DDR4-3000: 46.9 GB/s
DDR4-3200: 50.0 GB/s
DDR4-3400: 53.1 GB/s
DDR4-3600: 56.3 GB/s
DDR4-4000: 62.5 GB/s
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Yeah, okay.

What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?

DDR4-2666: 41.7 GB/s
DDR4-3000: 46.9 GB/s
DDR4-3200: 50.0 GB/s
DDR4-3400: 53.1 GB/s
DDR4-3600: 56.3 GB/s
DDR4-4000: 62.5 GB/s

AMD's first high performance DDR4 IMC... how good was Intel's first DDR4 IMC? Oh, right, you lost performance with higher frequencies. Ouch.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?
Well, to begin with, using DDR4-2933 or higher would be troublesome in servers or mobile, so making controller handle that with ease already looks like unnecessary struggle. The enthusiasts are minority, always have to bear that in mind.
AMD's first high performance DDR4 IMC... how good was Intel's first DDR4 IMC? Oh, right, you lost performance with higher frequencies. Ouch.
Wait, Haswell-E lost performance with higher DDR4 freqs? Curious...
 

Rifter

Lifer
Oct 9, 1999
11,522
751
126
Yep:

http://www.anandtech.com/show/8959/...-3200-with-gskill-corsair-adata-and-crucial/4

Not consistently, mind you, but often enough DDR4-3000 CL15 performed identically to - or worse than - DDR4-2133 CL15.

Yeah even with the memory issues AMD is still doing better than intels first at bat with DDR4.

The big question i have is how high can we push the memory controller, what will the final stable speeds be? 3200, 3400, 3600? even 4000? i really want to know lol.
 
Status
Not open for further replies.