Ryzen: Strictly technical

IRobot23 · Apr 27, 2018

mat9v said:
There is no way to test such because bios setting between 2+2 and 4+0 are ignored. You could conceivably test it by forcing program to use specific cores but that would require R7 CPU to even try and the results would be tainted. You can't test 2+2 / 4+0 on 1500X anyway - bad example.

Yes you can. You disable cores on R7 1700.

IRobot23 · Apr 27, 2018

mat9v said:
And increased L2 cache sizes - in all, small decrease. You ask, why would I be left with cross-CCX traffic - because some programs (games) use more cores and while Windows is aware of CCX topology, when the game requests 6 threads and they can't be on 4 cores because of load balancing, it transfers 2 threads to another CCX. If the game is not CCX aware then the threads transferred may be the ones that require a lot of communication hence the latency of crossing IF causing problems. 6 cores per CCX would work much better in such cases and remember that there is no reason AMD could not add more L3 cache to compensate for more cores in CCX. In fact there were rumors that 7nm based ZEN will have 64MB of cache per module (and 256MB per CPU in biggest EPYC) so maybe there is something to that.

That's is not true.. you can effectively use more than 8* threads without cross CCX latency penalty.

mat9v · Apr 27, 2018

IRobot23 said:
Yes you can. You disable cores on R7 1700.

Nope, even on R7 you can't get 4+0 config. I know, I have tried on few boards. You only get 2+2 even if you select 4+0. Ping core to core shows that it is not working.

Yes, I can use more then 4 threads, but not cores. Windows prefers cores to threads and it won't put a 5th thread on SMT core to the one used if it has free cores on another CCX - this is how Windows load balancing works. So 5 heavier threads results on two CCXs used. Unless you force an application to only use certain cores.

IRobot23 · Apr 27, 2018

mat9v said:
Nope, even on R7 you can't get 4+0 config. I know, I have tried on few boards. You only get 2+2 even if you select 4+0. Ping core to core shows that it is not working.

I am a bit confused. There was big difference at launch in some games between 2+2 (16MB of L3) and 4+0 (8MB of L3), but then they fixed. Consoles have same setup with 4 threads per CCX and they effectively use over 6 threads.

mat9v · Apr 27, 2018

IRobot23 said:
I am a bit confused. There was big difference at launch in some games between 2+2 (16MB of L3) and 4+0 (8MB of L3), but then they fixed. Consoles have same setup with 4 threads per CCX and they effectively use over 6 threads.

I have no idea what you are implying - 4+0 (8MB of L3) as in Ryzen 1200/1300X/1400 is in fact 2+2, it is not 4+0. They are Ryzen cores that have errors in "lower" or "upper" halve of L3 cache (look at CCX schematics how that works). And yes, differences in games are there but due to cache size and not active core distribution between CCXs. I remember AMD stating that they are not selling CPU in 4+0 organisation.

William Gaatjes · Apr 28, 2018

This got me wondering, reading this old article.
https://arstechnica.com/gadgets/2017/03/amds-moment-of-zen-finally-an-architecture-that-can-compete/

The design of the core was also extensively optimized to use less power. Integrated circuits are built of a variety of standard units such as NAND and NOT logic gates, flipflops, and even more complex elements such as half and full adders. For each of these components (called standard cells), a range of designs is possible with different trade-offs between performance, size, and power consumption.

Enlarge / Fast flipflops, on the left, are large and power-hungry. Most of Zen uses slower, more efficient ones, on the right.
AMD
AMD built a large library of standard cells with different characteristics. For example, it has five different flipflop designs. The fastest is twice as fast as the slowest, but it takes about 80 percent more space and uses more than twice as much power. Armed with this library, Zen was optimized to use the smaller, slower, more efficient parts where it can and the faster, larger, high-performance parts when it must. In Zen, the high-performance design is used for fewer than 10 percent of the flipflops, with the efficient one used about 60 percent of the time.

It is just a weird idea and unsure if it is relevant at all.
But it makes sense that they continue this road, because it is just a great solution to have so much choice to solve design issues.
7nm for zen 2 could mean that because of possible power savings of the 7nm process, AMD might go more for faster flops( but bigger and more power hungry) for the design where it is propagation time limited so that it is easier to reach higher clocks besides architectural advancements. And the 7nm process would perhaps alleviate the size problem and increased power consumption.

Abwx · Apr 28, 2018

William Gaatjes said:
7nm for zen 2 could mean that because of possible power savings of the 7nm process, AMD might go more for faster flops( but bigger and more power hungry) for the design where it is propagation time limited so that it is easier to reach higher clocks besides architectural advancements. And the 7nm process would perhaps alleviate the size problem and increased power consumption.

They dont need to change anything because the characteristics of the 7LP process transistors will be homogeneously extended to the basic cells...

William Gaatjes · Apr 28, 2018

Abwx said:
They dont need to change anything because the characteristics of the 7LP process transistors will be homogeneously extended to the basic cells...

Would it then not be the same as zen+. A shrink and that is it ?

Abwx · Apr 28, 2018

William Gaatjes said:
Would it then not be the same as zen+. A shrink and that is it ?

It means that the cells speed will increase accordingly, if 7LP is say 20% faster at 70% of the power then all cells will see this improvement, so no need to use faster cells (from the library) at a given place.

IRobot23 · Apr 28, 2018

mat9v said:
I have no idea what you are implying - 4+0 (8MB of L3) as in Ryzen 1200/1300X/1400 is in fact 2+2, it is not 4+0. They are Ryzen cores that have errors in "lower" or "upper" halve of L3 cache (look at CCX schematics how that works). And yes, differences in games are there but due to cache size and not active core distribution between CCXs. I remember AMD stating that they are not selling CPU in 4+0 organisation.

To be clear:

If I select 4+0 (x+0) configuration I have only 8MB of L3 cache, so basically I can disable 1 CCX. With 2+2 I will get 16MB of L3$.

I am going to do 1+1 vs 2+0 comparison later this day... (BF1, BF4)

I think CCX is non issue here... problem with gaming is IF latency, which connect everything (DRAM).

FYI you are the only one who is saying that CCX is impossible to disable/isolate.

William Gaatjes · Apr 28, 2018

Abwx said:
It means that the cells speed will increase accordingly, if 7LP is say 20% faster at 70% of the power then all cells will see this improvement, so no need to use faster cells (from the library) at a given place.

Yes of course. I understand that. I was more thinking in the line that if the process already has power savings then it might also be useful to change the design a bit where possible by using different flipflops if that is relevant to reach higher clocks without needing to increase the voltage a lot (if applicable) as we see now. I mean perhaps there is also a part of the design that just is not able to reach high clocks.
And that makes sense since AMD uses a lot of different cell libraries for all the internal logic to save as much power as possible for an efficient design. Meaning they have used the slowest flip flops where possible as much as possible.

Abwx · Apr 28, 2018

William Gaatjes said:
Yes of course. I understand that. I was more thinking in the line that if the process already has power savings then it might also be useful to change the design a bit where possible by using different flipflops if that is relevant to reach higher clocks without needing to increase the voltage a lot (if applicable) as we see now. I mean perhaps there is also a part of the design that just is not able to reach high clocks.
And that makes sense since AMD uses a lot of different cell libraries for all the internal logic to save as much power as possible for an efficient design. Meaning they have used the slowest flip flops where possible as much as possible.

You have to consider that it s the speed ratio between two attached parts that matter, if speed is increased at a given place this will require other proportionaly speed up circuitries in the pipeline.

One place where it s beneficial is the IMC, as was done in Zen+, because one of the end of this circuit is connected to an external circuit (the RAM) wich is clocked at lower frequency than the IMC seen from its other end, this allow lower latency since the limitation is actually internal (a given number of cycles delay for any request from the L3)..

el etro · Apr 28, 2018

According to the GF graph 7LP-mobile should offer 30% more FMAX at 80% of the power.

William Gaatjes · Apr 28, 2018

Abwx said:
You have to consider that it s the speed ratio between two attached parts that matter, if speed is increased at a given place this will require other proportionaly speed up circuitries in the pipeline.

One place where it s beneficial is the IMC, as was done in Zen+, because one of the end of this circuit is connected to an external circuit (the RAM) wich is clocked at lower frequency than the IMC seen from its other end.

Aja, That makes me think of synchronization circuits and asynchronous fifo's to buffer data inside the IMC between 2 clock domains.
If the ram speed goes up towards ddr 4000, faster flipflops are needed.
But i read that nothing was done to the design of zen+. That it is a carbon copy compared ot the die used for threadripper. Now i am confused.

I was more thinking of the cores and cache. But it makes sense that everything has to run faster or the bottleneck just shifts to another part of the design.
That makes it a real challenge.

IRobot23 · Apr 28, 2018

Okay I did test in BF1 and BF4. Since BF1 is a bit broken for 2C/4T don't mind the graph. Both run pretty bad.

C6H has some problems when enabling different cores, it obviously doesn't do cold boot and settings stayed same. So after restarting I needed to shut down and power up manually to get different configurations 2+0 (8MB ofd L3$) vs 1+1 (16MB of L3$) and then back to 8C/16T.

I did run custom settings in BF1 and ultra settings (expect msaa off and effect on high) in BF4.

I would like to add that in empty map with setup would be faster than FX 8350 4.5GHz at least in BF4 (mantle). I am not sure about BF1, because FPS are all over the place... in empty map already.

Why I had different settings on GPU? Because of manual cold boot.

So I don't know why is more cores per CCX goiod idea.. because AMD could simply do 1 ccx for Ryzen AM4 platform with 8 Cores in it. L3 cache latency would be higher and probably it would need more power to maintain great average latency. Good thing is that core would have access to 16MB of L3, but as it is obvious games simply doesn't take advantage of it.

Main problem is DRAM latency, caused by low speed IF. This makes RYZEN hard to compete in high FPS range and will never cache i7 8700K at 5GHz with 5GHz LLC/ring.

The Stilt · Apr 28, 2018

Downcoring is working fine in the most recent AGESA.
In 4+0 mode the latency between C0 - C1/C2/C3 is 26ns, meaning they are located in the same CCX.
If the cores were located in different CCXs, the latency would be > 100ns higher.

Each core can access a L3 in every CCX, however if the L3 is located in a different CCX then you'll pay the SDF latency penalty.
=< 8MB has always fast latency for every core, on SKUs which have full L3 available.

mat9v · Apr 28, 2018

The Stilt said:
Downcoring is working fine in the most recent AGESA.
In 4+0 mode the latency between C0 - C1/C2/C3 is 26ns, meaning they are located in the same CCX.
If the cores were located in different CCXs, the latency would be > 100ns higher.

Each core can access a L3 in every CCX, however if the L3 is located in a different CCX then you'll pay the SDF latency penalty.
=< 8MB has always fast latency for every core, on SKUs which have full L3 available.

So it seems my error was in not completely powering down my PC between changes. How stupid of me...

IRobot23 · Apr 28, 2018

mat9v said:
So it seems my error was in not completely powering down my PC between changes. How stupid of me...

As you can there is near zero penalty.

mat9v · Apr 29, 2018

IRobot23 said:
As you can there is near zero penalty.

So it seems. I will test it myself in various games in 2+2 and 4+0 config then in Windows with forced cores association with 8 cores active.

IRobot23 · Apr 29, 2018

mat9v said:
So it seems. I will test it myself in various games in 2+2 and 4+0 config then in Windows with forced cores association with 8 cores active.

Why forcing? there is zero difference in bench where CPU is at 100%. I don't see the problem.

Game engines are design to work that way and simply this is how consoles are design. Only 2MB of LLC... You don't need LLC to be shared. Every Core has 2MB of own LLC cache and can access up to 8MB of LLC.

mat9v · Apr 29, 2018

IRobot23 said:
Why forcing? there is zero difference in bench where CPU is at 100%. I don't see the problem.
Game engines are design to work that way and simply this is how consoles are design. Only 2MB of LLC... You don't need LLC to be shared. Every Core has 2MB of own LLC cache and can access up to 8MB of LLC.

What does console design have to do with it? Console CPUs do not have CCXs. I was talking about forcing games to run on 4 cores belonging to one CCX or 2+2 from different CCXs in Windows by pinning them to cores - you can test any configuration that way - it's just that you can't simulate real 4 core CPU because system processes would run on free cores and not impact one of those used by the game in real life scenario (when using physical 4 core CPU).

IRobot23 · Apr 29, 2018

mat9v said:
What does console design have to do with it? Console CPUs do not have CCXs. I was talking about forcing games to run on 4 cores belonging to one CCX or 2+2 from different CCXs in Windows by pinning them to cores - you can test any configuration that way - it's just that you can't simulate real 4 core CPU because system processes would run on free cores and not impact one of those used by the game in real life scenario (when using physical 4 core CPU).

Consoles do have "ccx" and latency is even worse.

My test was 1+1 (SMT ON). So, if CCX latency is a problem this would be worst case scenario.

mat9v · Apr 29, 2018

IRobot23 said:
Consoles do have "ccx" and latency is even worse.

My test was 1+1 (SMT ON). So, if CCX latency is a problem this would be worst case scenario.

No, consoles do not have CCX, they are based on Jaguar core, if anything they have 2-core complexes, four of them.
You are right, you were testing worst case scenario 🙂 Now I have to see that for myself on my PC 🙂 so just have to find some time to do the tests.

Wall Street · Apr 29, 2018

mat9v said:
No, consoles do not have CCX, they are based on Jaguar core, if anything they have 2-core complexes, four of them.

https://www.anandtech.com/show/6976...wering-xbox-one-playstation-4-kabini-temash/4

"Both designs incorporate two quad-core Jaguar modules"

Consoles do basically have 2 quad-core CCX.

mat9v · Apr 29, 2018

Wall Street said:
https://www.anandtech.com/show/6976...wering-xbox-one-playstation-4-kabini-temash/4

"Both designs incorporate two quad-core Jaguar modules"

Consoles do basically have 2 quad-core CCX.

They are, 4 2-core complexes (grouped by 2) all connected by L2 cache - long comparison to CCX connected by Infinity Fabric even if "communication between them is not ideal".
But ok, you have a point there that it resembles CCX in Ryzen 🙂

Ryzen: Strictly technical

Senior member

Senior member

Member

Senior member

Member

Lifer

Lifer

Lifer

Lifer

Senior member

Lifer

Lifer

Golden Member

Lifer

Senior member

Golden Member

Member

Senior member

Member

Senior member

Member

Senior member

Member

Senior member

Member