Ryzen: Strictly technical

The Stilt · Mar 8, 2017

JDG1980 said:
Hopefully AMD can fix this in "Pinnacle Ridge". If they can spare the resources, it might be a good idea for them to design their own IMC instead of using one licensed from Rambus.

I'm pretty certain the DRAM IP isn't supplied by Rambus.
That's because the IPs used in Steamroller and Excavator weren't and Zeppelin has almost identical controller structure (at interface register level) as SR and XV had.

iBoMbY · Mar 9, 2017

The Stilt said:
I'm pretty certain the DRAM IP isn't supplied by Rambus.
That's because the IPs used in Steamroller and Excavator weren't and Zeppelin has almost identical controller structure (at interface register level) as SR and XV had.

As I said some time ago, to me it is very likely they are still working with Synopsys IPs.

HurleyBird · Mar 9, 2017

Techreport ran an interesting article showing histograms of frame times with the 1800X and 7700. An interesting thing is that Ryzen's histograms look bimodal versus the unimodal results from the 7700.

I'm sure there are a few things that could be causing the bimodal presentation, although my money is on the scheduler. If it's possible to get rid of the first spike on the histograms, game performance should improve a lot.

lopri · Mar 9, 2017

^ I have no idea how he can draw this conclusion:

That's interesting, we think. The Core i7-7700K produces a nice, fat dromedarian hump with most of its frames clustered to the right of the chart, while the bactrian Ryzen 7 1800X exhibits a seemingly-more-bimodal distribution. Not only is the Core i7-7700K faster, but its frame delivery is more consistent—and our frame-time data bears that out.

From these graphs:

At least not without another axis showing timeline/frame numbers. It seems more like a game-dependent result and It is not uncommon a CPU produces higher/lower FPS at a certain level or a scene in a game. That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."

P.S. I had no idea game FPS tally was supposed to follow a Bell curve. Game developers do not care much once their bottom line is achieved. (e.g. 1080p/60FPS)

Elixer · Mar 9, 2017

crashtech said:
Certainly Linux is not perfect, it has a scheduler, too. Time will tell if some patches can be applied to inform the scheduler of Ryzen's new architecture.

On linux, the scheduler changes started on https://git.kernel.org/cgit/linux/k.../?id=79a8b9aa388b0620cc1d525d7c0f0d9a8a85e08e
Then they add the Zen specific one the same day, https://git.kernel.org/cgit/linux/k.../?id=08b259631b5a1d912af4832847b5642f377d9101

...
That problem stems most likely from the fact that the CU threads share resources within one CU and when we schedule to a thread of a different compute unit, this incurs latency due to migrating the working set to a different CU through the caches.
When the thread siblings mask mirrors that aspect of the CUs and threads, the scheduler pays attention to it and tries to schedule within one CU first. Which takes care of the latency, of course.

looncraz · Mar 9, 2017

Elixer said:
On linux, the scheduler changes started on https://git.kernel.org/cgit/linux/k.../?id=79a8b9aa388b0620cc1d525d7c0f0d9a8a85e08e
Then they add the Zen specific one the same day, https://git.kernel.org/cgit/linux/k.../?id=08b259631b5a1d912af4832847b5642f377d9101

And they had already added support for multi-level LLC and will set it at the CCX level or node level appropriately.

lolfail9001 · Mar 9, 2017

lopri said:
P.S. I had no idea game FPS tally was supposed to follow a Bell curve. Game developers do not care much once their bottom line is achieved. (e.g. 1080p/60FPS)

1. Frametimes are affected by some many independent factors, some of which are pseudo random, that normal distribution can act as good approximation.
2. Perfect FPS tally would be 1 FPS value on entire run, so yes, perfect FPS tally would be a bell curve with little deviation.

lopri said:
That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."

It is not data point but a bucket to throw frametimes in. If you ask me, it would be better if the distribution was displayed for frametimes, because right now the leftmost buckets are many times larger than the rightmost.

zir_blazer · Mar 9, 2017

The Stilt said:
2x x16 PCie lanes per Zeppelin die.
All of them are usable, however 8 of them are consumed by the eFCH (Promontory), when present.

Wait, I'm missing something here. I have been arguing a lot about this, going through any Block Diagram I can come across from AM4 platforms, and I thought it was 24.

16 PCIe Lanes for a main PCIe 16x Slot, which on X300 or X370 based Motherboards you can bifurcate as 8x/8x (Sold as a "Chipset feature")
4 PCIe Lanes, which can be configured (Link is quite old, from 2015, but can't find better) as 4 General Purpose (Unless U.2 and M.2 are special, which as far that I know, they aren't, just generic PCIe), or 2 SATA-III + 2 PCIe Lanes. Theorically, a single SATA Express should consume all 4 Lanes, as it should be the same 2 SATA-III + 2 PCIe Lanes configuration, and not SATA Express + 2 GP as in the link, which would mean 6 Lanes (Unless there is some black magic going on there, or the SATAe is crippled in some way).
4 PCIe Lanes for Chipset (Promontory), which can also be GP if used in X300
Where are the other 8? I don't recall neither any mentions regarding Promontory to use 8, only 4 (Including this Block Diagram of the ASUS Crosshair VI Hero). Do the 4 integrated USB 3.0 consumes Lanes as Intel Chipsets FlexIO (And should be fixed that way in AM4 platforms)? Can't really get to 32.

The other way I take it: Zeppelin does have 32 Lanes, but AM4 is not using the full Zeppelin I/O because late time change in plans for Ryzen while Socket AM4 was already set in stone.

About Promontory, how much configuration possibilities it actually has? I was looking at this, where it says that X370 has 8 PCIe 2.0 Lanes and 2 SATAe, and in footnote 1 it says that each SATAe can be used either as 2 SATA-III or 2 PCIe 3.0 Lanes. It also says that it can be combined with 2 GP Lanes to make a 4x Slot, but it would be 3.0 or 2.0? Can I get 2 4x PCIe 3.0 Slots from that? What would be the max PCIe Lanes/Slots configuration with no SATAe? 8 PCIe 2.0 (4x/4x) + 4 PCIe 3.0 (4x) (And only 4 SATA-III)?

I think that part of the confusion with SATAe is because it seems that AMD is using just two configurable PCIe Lanes for it, not 2 PCIe Lanes + 2 SATA-III, which would otherwise be worth 4.

antihelten · Mar 9, 2017

lopri said:
That alleged "bimodality" occured only at one data point (80~83 FPS) over 20 data points, and I do not know if you can call that "bimodal."

To be fair the bimodal nature of the distribution is far more clear in Crysis 3:

It is however worth noting that the 7700K also has a bimodal distribution here (albeit significantly less pronounced):

But as lolfail9001 said above, this really needs to be done with frametimes instead of frame rates. For instance the 4-8 FPS bin represents a span in frametime of 125 ms (from 250 ms for 4 FPS to 125 ms for 8 FPS), whereas say the 196-200 represents only a span of 0.1 ms (from 5.1 ms for 196 FPS to 5 ms for 200 FPS). So over three orders of magnitude difference in bin size.

guachi · Mar 9, 2017

I find those histograms incredibly useful.

I think all reviewers should no longer use average fps. I think they should use median (aka 50th percentile) and other percentiles like 1% or even .1%.

Converting the histograms above gives the following for Crysis 3:

1800X
.1% 58 fps
1% 74 fps
10% 98 fps
25% 114 fps
50% 126 fps
75% 138 fps
90% 150 fps
99% 194 fps

7700K
.1% 74 fps
1% 86 fps
10% 106 fps
25% 118 fps
50% 126 fps
75% 142 fps
90% 154 fps
99% 202 fps

In other words, Ryzen has a problem at the lower end. It's numbers at the 25th % and up are fine. In fact, both have the same median. It's between .1% and 25% that the problem exists.

GTA V

1800X
.1% 70 fps
1% 74 fps
10% 82 fps
25% 82 fps
50% 90 fps
75% 98 fps
90% 106 fps
99% 126 fps

7700K
.1% 82 fps
1% 90 fps
10% 102 fps
25% 106 fps
50% 110 fps
75% 118 fps
90% 126 fps
99% 138 fps

Here, the 1800X is consistently about 80-85% as fast as the 7700K and the results are far more clustered for both chips.

Personally, I'd find these kind of numbers with .1%, 1%, and 50% and an included histogram more useful than what we get now. And with Excel, it's not hard to do.

EDIT: I converted the fps to frame time to get more accurate results as antihelten suggested. Then converted back to fps for the post.

dfk7677 · Mar 9, 2017

I don't know if this is useful or not but I will post it. This is a histogram of the frametimes in Battlefield 1 multiplayer, using Cam 1 of Spectator, in a full Amiens (map) 64 player Conquest server, using a 4590 (3.5GHz MC turbo), 8GB DDR3-1600. The CPU usage was always 98%+ and the GPU usage <80%. All GFX settings set to low or off @1080p.

If anyone owning a Ryzen CPU could do the same it would be great.

unseenmorbidity · Mar 9, 2017

A Diagram of Ryzen’s Clock Domains
For the past few days, there have been talks around tech forums about Ryzen’s clock domains. Specifically about how some parts of the internal fabric of Ryzen run at memory clockspeed, or half effective transfer speed. Thanks to hardware.fr, we now have a diagram of the clock domains in Ryzen.

From the above diagram, we can conclude there are three major clock domains in Ryzen:

CClk (Core Clock): The speed advertised on the box, and the speed at which the cores and cache run at
LClk (Data Launch Clock): A fixed frequency at which the IO Hub Controller (PCI-E and friends) operates
MemClk (Memory Clock): The memory clock speed, half of the number you see advertised (1.3GHz for 2667, as an example)

As can be seen in the diagram, the internal Data Fabric, part of the Infinity Fabric advertised by AMD, runs in the MemClk domain.

The Data Fabric is reponsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each contianing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creation data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.

Memory Scaling
To put things into perspective, communication to and from cores to L3 cache inside the same CCX happens at around 200GB/s. Combine this with the massive difference in latency you would expect from having to go through extra hoops to reach the other CCX, and the communication speed between CCX’s is simply not anywhere near intra-CCX communication speeds. This is why changing the Windows scheduler to keep data from a single thread in the same CCX is so important, as otherwise you incur a performance penalty, as observed in gaming performance tests.

As observed in multiple tests, Ryzen appears to scale quite well with memory speeds, and this knowledge sheds light on why. It’s not necessarily the increased speeds with the DDR4 kits themselves, though it certainly helps. Rather, it’s the increased internal bandwidth for inter-CCX communication, which alleviates some of the performance issues when threads have to communicate between CCX’s.

Due to this, if you’re picking up a Ryzen system, it’s highly recommended to get a decently fast memory kit, as it will help performance more than you would otherwise expect.

Why is it built this way?
Well, the answer is quite simple. AMD as a semi-custom company, would like to have modularity in their designs. Instead of building a large monolithic 8 core complex that would be relatively hard to scale, they’ve created a quad core building block, from which they can scale core counts as as they see fit. This reduces design costs and time to market for their customers.

There is a trade off, as detailed above. However, as seen in the impressive performance of Ryzen, it’s not too hampering. As soon as the Windows scheduler gets updated to minimize movement of data between CCX’s, and developers release patches in the few games where that will not be enough to solve issues, the gaming performance should be similar to the performance of a 6900K, just like with professional workloads.

https://thetechaltar.com/amd-ryzen-clock-domains-detailed/

CatMerc · Mar 9, 2017

unseenmorbidity said:
A Diagram of Ryzen’s Clock Domains
For the past few days, there have been talks around tech forums about Ryzen’s clock domains. Specifically about how some parts of the internal fabric of Ryzen run at memory clockspeed, or half effective transfer speed. Thanks to hardware.fr, we now have a diagram of the clock domains in Ryzen.

From the above diagram, we can conclude there are three major clock domains in Ryzen:

CClk (Core Clock): The speed advertised on the box, and the speed at which the cores and cache run at

LClk (Data Launch Clock): A fixed frequency at which the IO Hub Controller (PCI-E and friends) operates

MemClk (Memory Clock): The memory clock speed, half of the number you see advertised (1.3GHz for 2667, as an example)

As can be seen in the diagram, the internal Data Fabric, part of the Infinity Fabric advertised by AMD, runs in the MemClk domain.

The Data Fabric is reponsible for the core’s communication with the memory controller, and more importantly, inter-CCX communication. As previously explained, AMD’s Ryzen is built in modular blocks called CCX’s, each contianing four cores and its own bank of L3 cache. An 8 core chip like Ryzen contains two of these. In order for CCX to CCX communication to take place, such as when a core from CCX 0 attempts to access data in the L3 cache of CCX 1, it has to do so through the Data Fabric. Assuming a standard 2667MT/s DDR4 kit, the Data Fabric has a bandwidth of 41.6GB/s in a single direction, or 83.2GB/s when transfering in both directions. This bandwidth has to be shared between both inter-CCX communication, and DRAM access, quickly creation data contention whenever a lot of data is being transfered from CCX to CCX at the same time as reading or writing to and from memory.

Memory Scaling
To put things into perspective, communication to and from cores to L3 cache inside the same CCX happens at around 200GB/s. Combine this with the massive difference in latency you would expect from having to go through extra hoops to reach the other CCX, and the communication speed between CCX’s is simply not anywhere near intra-CCX communication speeds. This is why changing the Windows scheduler to keep data from a single thread in the same CCX is so important, as otherwise you incur a performance penalty, as observed in gaming performance tests.

As observed in multiple tests, Ryzen appears to scale quite well with memory speeds, and this knowledge sheds light on why. It’s not necessarily the increased speeds with the DDR4 kits themselves, though it certainly helps. Rather, it’s the increased internal bandwidth for inter-CCX communication, which alleviates some of the performance issues when threads have to communicate between CCX’s.

Due to this, if you’re picking up a Ryzen system, it’s highly recommended to get a decently fast memory kit, as it will help performance more than you would otherwise expect.

Why is it built this way?
Well, the answer is quite simple. AMD as a semi-custom company, would like to have modularity in their designs. Instead of building a large monolithic 8 core complex that would be relatively hard to scale, they’ve created a quad core building block, from which they can scale core counts as as they see fit. This reduces design costs and time to market for their customers.

There is a trade off, as detailed above. However, as seen in the impressive performance of Ryzen, it’s not too hampering. As soon as the Windows scheduler gets updated to minimize movement of data between CCX’s, and developers release patches in the few games where that will not be enough to solve issues, the gaming performance should be similar to the performance of a 6900K, just like with professional workloads.

https://thetechaltar.com/amd-ryzen-clock-domains-detailed/

I wrote that \o/

gupsterg · Mar 9, 2017

@The Stilt

So putting aside BCLK dropping PCI-E gen, etc. Raising BCLK would mean CPU stays in "Normal mode" if multiplier not raised past stock. So "normal" operation occurs plus PB/XFR works, but then these constraints would apply as they are for "out-of-box" scenario and not lifted as "OC mode" has not been enabled?

R7 1700

In a heavily-multithreaded “all cores boost” scenario, this user-focused performance tuning permits the 1700 to ramp peak power draw up to its fused package power limit of approximately 90W electrical (note: AM4 reference power limit is 128W). Precision Boost and/or XFR will level off at 72.3tCase°C or ~90W of electrical power (whichever comes first).

R7 1700X/1800X

In a heavily-multithreaded “all cores boost” scenario, this user-focused performance tuning permits the 1700X/1800X to ramp peak power draw up to the AMD Socket AM4 reference limit of 128W. Precision Boost and/or XFR will level off at 60tCase°C or 128W of electrical power (whichever comes first).

bluepx · Mar 9, 2017

Asrock X370 Taichi has reached reviewers and allows changing the P-states from the BIOS, so no need to mess around with BCLK. Will this keep the negative voltage offsets active?

I'm not sure what FID/DID/VID are or why the freq and voltage are greyed out though.

Link to the (light) Taichi review: https://www.youtube.com/watch?v=VjGWqTFultI (no benchmarks)

iBoMbY · Mar 9, 2017

You can do that on the C6H as well, only when you do it, the P-states are not used (it seems). I set Pstate0 to 4 Ghz, then it was running at Pstate1 with 3.6 Ghz all the time ...

DisEnchantment · Mar 9, 2017

nvm, I did not read few posts above

iBoMbY · Mar 9, 2017

CatMerc said:
I wrote that \o/

And how exactly do you know how many Infinity Fabric Links AMD is using?

looncraz · Mar 9, 2017

iBoMbY said:
You can do that on the C6H as well, only when you do it, the P-states are not used (it seems). I set Pstate0 to 4 Ghz, then it was running at Pstate1 with 3.6 Ghz all the time ...

Try setting the "Balanced" power profile. I don't think Precision Boost obeys the custom P-states.

I anticipate Precision Boost/XFR and all that jazz to be worthless. Custom P-states and incurring the 30ms frequency shift penalty will probably be worth it, IMHO, if you can then set a higher all core turbo.

CatMerc · Mar 9, 2017

iBoMbY said:
And how exactly do you know how many Infinity Fabric Links AMD is using?

It says 32B/cycle right there...

iBoMbY · Mar 9, 2017

CatMerc said:
It says 32B/cycle right there...

Yeah, okay.

What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?

DDR4-2666: 41.7 GB/s
DDR4-3000: 46.9 GB/s
DDR4-3200: 50.0 GB/s
DDR4-3400: 53.1 GB/s
DDR4-3600: 56.3 GB/s
DDR4-4000: 62.5 GB/s

looncraz · Mar 9, 2017

iBoMbY said:
Yeah, okay.

What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?

DDR4-2666: 41.7 GB/s
DDR4-3000: 46.9 GB/s
DDR4-3200: 50.0 GB/s
DDR4-3400: 53.1 GB/s
DDR4-3600: 56.3 GB/s
DDR4-4000: 62.5 GB/s

AMD's first high performance DDR4 IMC... how good was Intel's first DDR4 IMC? Oh, right, you lost performance with higher frequencies. Ouch.

lolfail9001 · Mar 9, 2017

iBoMbY said:
What I don't get then is, why they didn't see to make the memory controller much more beefy, when you can get 1 GB/s extra for every 32 MHz (64 at DDR)?

Well, to begin with, using DDR4-2933 or higher would be troublesome in servers or mobile, so making controller handle that with ease already looks like unnecessary struggle. The enthusiasts are minority, always have to bear that in mind.

looncraz said:
AMD's first high performance DDR4 IMC... how good was Intel's first DDR4 IMC? Oh, right, you lost performance with higher frequencies. Ouch.

Wait, Haswell-E lost performance with higher DDR4 freqs? Curious...

looncraz · Mar 9, 2017

lolfail9001 said:
Wait, Haswell-E lost performance with higher DDR4 freqs? Curious...

Yep:

http://www.anandtech.com/show/8959/...-3200-with-gskill-corsair-adata-and-crucial/4

Not consistently, mind you, but often enough DDR4-3000 CL15 performed identically to - or worse than - DDR4-2133 CL15.

Rifter · Mar 9, 2017

looncraz said:
Yep:

http://www.anandtech.com/show/8959/...-3200-with-gskill-corsair-adata-and-crucial/4

Not consistently, mind you, but often enough DDR4-3000 CL15 performed identically to - or worse than - DDR4-2133 CL15.

Yeah even with the memory issues AMD is still doing better than intels first at bat with DDR4.

The big question i have is how high can we push the memory controller, what will the final stable speeds be? 3200, 3400, 3600? even 4000? i really want to know lol.

Ryzen: Strictly technical

Golden Member

Member

Platinum Member

Elite Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member

Senior member

Member

Golden Member

Golden Member

Junior Member

Junior Member

Member

Golden Member

Member

Senior member

Golden Member

Member

Senior member

Golden Member

Senior member

Lifer