• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."
  • Community Question: What makes a good motherboard?

Ryzen: Strictly technical

Page 79 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

IRobot23

Senior member
Jul 3, 2017
601
183
76
There is no way to test such because bios setting between 2+2 and 4+0 are ignored. You could conceivably test it by forcing program to use specific cores but that would require R7 CPU to even try and the results would be tainted. You can't test 2+2 / 4+0 on 1500X anyway - bad example.
Yes you can. You disable cores on R7 1700.
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
And increased L2 cache sizes - in all, small decrease. You ask, why would I be left with cross-CCX traffic - because some programs (games) use more cores and while Windows is aware of CCX topology, when the game requests 6 threads and they can't be on 4 cores because of load balancing, it transfers 2 threads to another CCX. If the game is not CCX aware then the threads transferred may be the ones that require a lot of communication hence the latency of crossing IF causing problems. 6 cores per CCX would work much better in such cases and remember that there is no reason AMD could not add more L3 cache to compensate for more cores in CCX. In fact there were rumors that 7nm based ZEN will have 64MB of cache per module (and 256MB per CPU in biggest EPYC) so maybe there is something to that.

That's is not true.. you can effectively use more than 8* threads without cross CCX latency penalty.
 

mat9v

Member
Mar 17, 2017
25
0
11
Yes you can. You disable cores on R7 1700.
Nope, even on R7 you can't get 4+0 config. I know, I have tried on few boards. You only get 2+2 even if you select 4+0. Ping core to core shows that it is not working.

Yes, I can use more then 4 threads, but not cores. Windows prefers cores to threads and it won't put a 5th thread on SMT core to the one used if it has free cores on another CCX - this is how Windows load balancing works. So 5 heavier threads results on two CCXs used. Unless you force an application to only use certain cores.
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
Nope, even on R7 you can't get 4+0 config. I know, I have tried on few boards. You only get 2+2 even if you select 4+0. Ping core to core shows that it is not working.
I am a bit confused. There was big difference at launch in some games between 2+2 (16MB of L3) and 4+0 (8MB of L3), but then they fixed. Consoles have same setup with 4 threads per CCX and they effectively use over 6 threads.
 
Last edited:

mat9v

Member
Mar 17, 2017
25
0
11
I am a bit confused. There was big difference at launch in some games between 2+2 (16MB of L3) and 4+0 (8MB of L3), but then they fixed. Consoles have same setup with 4 threads per CCX and they effectively use over 6 threads.
I have no idea what you are implying - 4+0 (8MB of L3) as in Ryzen 1200/1300X/1400 is in fact 2+2, it is not 4+0. They are Ryzen cores that have errors in "lower" or "upper" halve of L3 cache (look at CCX schematics how that works). And yes, differences in games are there but due to cache size and not active core distribution between CCXs. I remember AMD stating that they are not selling CPU in 4+0 organisation.
 
May 11, 2008
18,310
829
126
This got me wondering, reading this old article.
https://arstechnica.com/gadgets/2017/03/amds-moment-of-zen-finally-an-architecture-that-can-compete/

The design of the core was also extensively optimized to use less power. Integrated circuits are built of a variety of standard units such as NAND and NOT logic gates, flipflops, and even more complex elements such as half and full adders. For each of these components (called standard cells), a range of designs is possible with different trade-offs between performance, size, and power consumption.


Enlarge
/ Fast flipflops, on the left, are large and power-hungry. Most of Zen uses slower, more efficient ones, on the right.
AMD
AMD built a large library of standard cells with different characteristics. For example, it has five different flipflop designs. The fastest is twice as fast as the slowest, but it takes about 80 percent more space and uses more than twice as much power. Armed with this library, Zen was optimized to use the smaller, slower, more efficient parts where it can and the faster, larger, high-performance parts when it must. In Zen, the high-performance design is used for fewer than 10 percent of the flipflops, with the efficient one used about 60 percent of the time.
It is just a weird idea and unsure if it is relevant at all.
But it makes sense that they continue this road, because it is just a great solution to have so much choice to solve design issues.
7nm for zen 2 could mean that because of possible power savings of the 7nm process, AMD might go more for faster flops( but bigger and more power hungry) for the design where it is propagation time limited so that it is easier to reach higher clocks besides architectural advancements. And the 7nm process would perhaps alleviate the size problem and increased power consumption.
 

Abwx

Diamond Member
Apr 2, 2011
9,117
902
126
7nm for zen 2 could mean that because of possible power savings of the 7nm process, AMD might go more for faster flops( but bigger and more power hungry) for the design where it is propagation time limited so that it is easier to reach higher clocks besides architectural advancements. And the 7nm process would perhaps alleviate the size problem and increased power consumption.
They dont need to change anything because the characteristics of the 7LP process transistors will be homogeneously extended to the basic cells...
 

Abwx

Diamond Member
Apr 2, 2011
9,117
902
126
Would it then not be the same as zen+. A shrink and that is it ?
It means that the cells speed will increase accordingly, if 7LP is say 20% faster at 70% of the power then all cells will see this improvement, so no need to use faster cells (from the library) at a given place.
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
I have no idea what you are implying - 4+0 (8MB of L3) as in Ryzen 1200/1300X/1400 is in fact 2+2, it is not 4+0. They are Ryzen cores that have errors in "lower" or "upper" halve of L3 cache (look at CCX schematics how that works). And yes, differences in games are there but due to cache size and not active core distribution between CCXs. I remember AMD stating that they are not selling CPU in 4+0 organisation.
To be clear:

If I select 4+0 (x+0) configuration I have only 8MB of L3 cache, so basically I can disable 1 CCX. With 2+2 I will get 16MB of L3$.

I am going to do 1+1 vs 2+0 comparison later this day... (BF1, BF4)

I think CCX is non issue here... problem with gaming is IF latency, which connect everything (DRAM).

FYI you are the only one who is saying that CCX is impossible to disable/isolate.
 
May 11, 2008
18,310
829
126
It means that the cells speed will increase accordingly, if 7LP is say 20% faster at 70% of the power then all cells will see this improvement, so no need to use faster cells (from the library) at a given place.
Yes of course. I understand that. I was more thinking in the line that if the process already has power savings then it might also be useful to change the design a bit where possible by using different flipflops if that is relevant to reach higher clocks without needing to increase the voltage a lot (if applicable) as we see now. I mean perhaps there is also a part of the design that just is not able to reach high clocks.
And that makes sense since AMD uses a lot of different cell libraries for all the internal logic to save as much power as possible for an efficient design. Meaning they have used the slowest flip flops where possible as much as possible.
 

Abwx

Diamond Member
Apr 2, 2011
9,117
902
126
Yes of course. I understand that. I was more thinking in the line that if the process already has power savings then it might also be useful to change the design a bit where possible by using different flipflops if that is relevant to reach higher clocks without needing to increase the voltage a lot (if applicable) as we see now. I mean perhaps there is also a part of the design that just is not able to reach high clocks.
And that makes sense since AMD uses a lot of different cell libraries for all the internal logic to save as much power as possible for an efficient design. Meaning they have used the slowest flip flops where possible as much as possible.
You have to consider that it s the speed ratio between two attached parts that matter, if speed is increased at a given place this will require other proportionaly speed up circuitries in the pipeline.

One place where it s beneficial is the IMC, as was done in Zen+, because one of the end of this circuit is connected to an external circuit (the RAM) wich is clocked at lower frequency than the IMC seen from its other end, this allow lower latency since the limitation is actually internal (a given number of cycles delay for any request from the L3)..
 
Last edited:

el etro

Golden Member
Jul 21, 2013
1,581
14
81
According to the GF graph 7LP-mobile should offer 30% more FMAX at 80% of the power.
 
May 11, 2008
18,310
829
126
You have to consider that it s the speed ratio between two attached parts that matter, if speed is increased at a given place this will require other proportionaly speed up circuitries in the pipeline.

One place where it s beneficial is the IMC, as was done in Zen+, because one of the end of this circuit is connected to an external circuit (the RAM) wich is clocked at lower frequency than the IMC seen from its other end.
Aja, That makes me think of synchronization circuits and asynchronous fifo's to buffer data inside the IMC between 2 clock domains.
If the ram speed goes up towards ddr 4000, faster flipflops are needed.
But i read that nothing was done to the design of zen+. That it is a carbon copy compared ot the die used for threadripper. Now i am confused.

I was more thinking of the cores and cache. But it makes sense that everything has to run faster or the bottleneck just shifts to another part of the design.
That makes it a real challenge.
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
Okay I did test in BF1 and BF4. Since BF1 is a bit broken for 2C/4T don't mind the graph. Both run pretty bad.


C6H has some problems when enabling different cores, it obviously doesn't do cold boot and settings stayed same. So after restarting I needed to shut down and power up manually to get different configurations 2+0 (8MB ofd L3$) vs 1+1 (16MB of L3$) and then back to 8C/16T.

I did run custom settings in BF1 and ultra settings (expect msaa off and effect on high) in BF4.

I would like to add that in empty map with setup would be faster than FX 8350 4.5GHz at least in BF4 (mantle). I am not sure about BF1, because FPS are all over the place... in empty map already.

Why I had different settings on GPU? Because of manual cold boot.

So I don't know why is more cores per CCX goiod idea.. because AMD could simply do 1 ccx for Ryzen AM4 platform with 8 Cores in it. L3 cache latency would be higher and probably it would need more power to maintain great average latency. Good thing is that core would have access to 16MB of L3, but as it is obvious games simply doesn't take advantage of it.

Main problem is DRAM latency, caused by low speed IF. This makes RYZEN hard to compete in high FPS range and will never cache i7 8700K at 5GHz with 5GHz LLC/ring.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Downcoring is working fine in the most recent AGESA.
In 4+0 mode the latency between C0 - C1/C2/C3 is 26ns, meaning they are located in the same CCX.
If the cores were located in different CCXs, the latency would be > 100ns higher.

Each core can access a L3 in every CCX, however if the L3 is located in a different CCX then you'll pay the SDF latency penalty.
=< 8MB has always fast latency for every core, on SKUs which have full L3 available.
 

mat9v

Member
Mar 17, 2017
25
0
11
Downcoring is working fine in the most recent AGESA.
In 4+0 mode the latency between C0 - C1/C2/C3 is 26ns, meaning they are located in the same CCX.
If the cores were located in different CCXs, the latency would be > 100ns higher.

Each core can access a L3 in every CCX, however if the L3 is located in a different CCX then you'll pay the SDF latency penalty.
=< 8MB has always fast latency for every core, on SKUs which have full L3 available.
So it seems my error was in not completely powering down my PC between changes. How stupid of me...
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
So it seems. I will test it myself in various games in 2+2 and 4+0 config then in Windows with forced cores association with 8 cores active.
Why forcing? there is zero difference in bench where CPU is at 100%. I don't see the problem.

Game engines are design to work that way and simply this is how consoles are design. Only 2MB of LLC... You don't need LLC to be shared. Every Core has 2MB of own LLC cache and can access up to 8MB of LLC.
 

mat9v

Member
Mar 17, 2017
25
0
11
Why forcing? there is zero difference in bench where CPU is at 100%. I don't see the problem.
Game engines are design to work that way and simply this is how consoles are design. Only 2MB of LLC... You don't need LLC to be shared. Every Core has 2MB of own LLC cache and can access up to 8MB of LLC.
What does console design have to do with it? Console CPUs do not have CCXs. I was talking about forcing games to run on 4 cores belonging to one CCX or 2+2 from different CCXs in Windows by pinning them to cores - you can test any configuration that way - it's just that you can't simulate real 4 core CPU because system processes would run on free cores and not impact one of those used by the game in real life scenario (when using physical 4 core CPU).
 

IRobot23

Senior member
Jul 3, 2017
601
183
76
What does console design have to do with it? Console CPUs do not have CCXs. I was talking about forcing games to run on 4 cores belonging to one CCX or 2+2 from different CCXs in Windows by pinning them to cores - you can test any configuration that way - it's just that you can't simulate real 4 core CPU because system processes would run on free cores and not impact one of those used by the game in real life scenario (when using physical 4 core CPU).
Consoles do have "ccx" and latency is even worse.

My test was 1+1 (SMT ON). So, if CCX latency is a problem this would be worst case scenario.
 
Last edited:

mat9v

Member
Mar 17, 2017
25
0
11
Consoles do have "ccx" and latency is even worse.

My test was 1+1 (SMT ON). So, if CCX latency is a problem this would be worst case scenario.
No, consoles do not have CCX, they are based on Jaguar core, if anything they have 2-core complexes, four of them.
You are right, you were testing worst case scenario :) Now I have to see that for myself on my PC :) so just have to find some time to do the tests.
 

mat9v

Member
Mar 17, 2017
25
0
11
Status
Not open for further replies.

ASK THE COMMUNITY