Ryzen: Strictly technical

Page 37 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

iBoMbY

Member
Nov 23, 2016
175
103
86
We could use Process Lasso to bind all those background tasks and programs that are normally running to cores on CCX1 instead of CCX0 and get much better chance of game threads remaining on CCX0 where it belongs, because Windows scheduler will see cores on CCX1 running more tasks so it will schedule the game to run on CCX0. It may only help, it is not a solution but I think worthy of trying.

You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
While I agree with what you're saying, the peculiarities of Ryzen 7's latency issues are not fully understood yet. Going back to the Hardware.fr investigation, they speculate:

The cache issues are even more telling: they designed a benchmark that tests different sizes sequentially, and the results for sizes >=6MB are particularly problematic. In this case, they say it is because sequential access disables the other CCX, and in this benchmark, it behaves as if it only has 8MB of L3.
We surely will learn more over time, esp. I'm going to do some microbenchmarking.

The 2nd L3 is of no help. As a victim cache it would only hold evicted data from cores of the 2nd CCX.
 
  • Like
Reactions: looncraz

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
I do wonder what the effects of having 1:1 DF and RAM speed would be.
A way to extrapolate it would be to run memory at, say, 1066MT/s, measure latency, and then run memory at 2133MT/s and measure again.
If the cycles it takes for data to travel across the DF are identical at higher frequencies, then we can extrapolate what a 1:1 clock would do.

Sadly I do not have a Ryzen sample, nor the know how to build a program to test it.
 

mat9v

Member
Mar 17, 2017
25
0
66
You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.

I don't think that would work, because Windows scheduler would place all those processes as it sees fit on CCX0 or CCX1 according to how busy cores are.
If I bind all those less important processes to cores from CCX1 then it leaves CCX0 less busy and it is more probable that my game will run on cores from CCX0.
Not certain of course because once the game start CCX0 will be much more busy.
It is easier then introducing problems with NUMA (games not spanning NUMA processor groups for example) and more likely to work how we want.
By using groups in bcdedit you did mean forcing NUMA, right?
 

iBoMbY

Member
Nov 23, 2016
175
103
86
By using groups in bcdedit you did mean forcing NUMA, right?

I mean forcing Processor Groups, which is the only option Microsoft offers. Processor Groups are a higher order than NUMA nodes, and NUMA nodes are implicitly forced by this, because NUMA nodes cannot span over two Processor Groups. If you have more than one Processor Group the Task Manager gets a new Affinity setting for the Processor Groups, and a normal Process can only use Logical Cores assigned to his assigned Processor Group.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.

This is broken on Windows with Ryzen. It simply disables half the cores.

I had some indication, however, this is by AMD's own doing - this is how they disable cores when you select how many to disable in Ryzen Master. A very dumb solution.

Ryzen Master, in fact, mostly works in very idiotic ways - though fairly safe ways. Uninstall it and throw it away.

I didn't get a chance to test without it installed to see if the problem goes away, but I know the affinity bugs are present without it.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Wahoo!

A little trip through the oven and I'm back in action!

EDIT: Scratch that, bad caps - worked because of an overnight bleed-down, but froze and stopped working in the same way after a few minutes.

RMA it is :-(
 
Last edited:

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Wahoo!

A little trip through the oven and I'm back in action!
My ancestors would be pleased to hear humans have evolved to survive ovens. :D

On a more serious note, could you perhaps test this?
I do wonder what the effects of having 1:1 DF and RAM speed would be.
A way to extrapolate it would be to run memory at, say, 1066MT/s, measure latency, and then run memory at 2133MT/s and measure again.
If the cycles it takes for data to travel across the DF are identical at higher frequencies, then we can extrapolate what a 1:1 clock would do.

Sadly I do not have a Ryzen sample, nor the know how to build a program to test it.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
My ancestors would be pleased to hear humans have evolved to survive ovens. :D

On a more serious note, could you perhaps test this?

I'm a Texan, I can survive an oven just fine ;-)

Sadly, my early excitement over having remedied my problem was premature - the board is toast. As soon as the system is running for a good while, no matter what it is doing, it freezes and then won't respond to power on triggers at all. If I drain down the caps it will come back up.

It is rock solid until the moment it freezes - and I can hear some good coil whine from the board when that happens. Not going to chance running it any more - RMAing it with NewEgg.
 
  • Like
Reactions: Drazick and CatMerc

unseenmorbidity

Golden Member
Nov 27, 2016
1,395
967
96
Wahoo!

A little trip through the oven and I'm back in action!

EDIT: Scratch that, bad caps - worked because of an overnight bleed-down, but froze and stopped working in the same way after a few minutes.

RMA it is :-(
What did you blow up?
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
I'm a Texan, I can survive an oven just fine ;-)

Sadly, my early excitement over having remedied my problem was premature - the board is toast. As soon as the system is running for a good while, no matter what it is doing, it freezes and then won't respond to power on triggers at all. If I drain down the caps it will come back up.

It is rock solid until the moment it freezes - and I can hear some good coil whine from the board when that happens. Not going to chance running it any more - RMAing it with NewEgg.
And that ladies and gentlemen, is a technological rooster block.
 
  • Like
Reactions: looncraz

looncraz

Senior member
Sep 12, 2011
722
1,651
136
What did you blow up?

I didn't "blow up" anything, it appears the standby power actually going to the CPU is not present. If I knew the pinout for the CPU I could verify that... and possibly would just run a bodge wire (I'm old-school).

The magic smoke, as they say, is still contained within the board. But it takes a good few hours of no power going to the board before it will even attempt to power on.

This is the second time I've had this nearly exact type of failure with a board in the last couple of years - I wish I could point to something I'm doing to cause it, I'd not hesitate to buy another board and chock up to a learning experience. I'd have the replacement board in just a couple days - now I have to wait a week+.
 
  • Like
Reactions: Drazick

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
This really isn't a Microsoft issue. This is an evolutionary issue - multi-core CPUs weren't really a thing until 2005. Microsoft added GetLogicalProcessorInformation (a terrible API, but useful) in XP SP3 as well as improving threading throughout its own products. They couldn't really do much more.

There's now real way for a kernel scheduler to fully accommodate Ryzen's design without the application being involved or without creating per-application profiles... it can help out, but it will never do better than an application developer's own optimizations.

I don't think it is entirely a case of MS not being able to make changes, but they choose to make them in power management vs the scheduler. Large ISVs were already rolling their own optimizations using a more predictable lightweight scheduler. A statistical scheduler which created app profiles, or a neural net based scheduler would be more time consuming and forced developers of system software to rewrite their code (which would cause allot of backlash). I'm sure embedded developers don't want a smart scheduler at all. It will be interesting to see what direction Linux takes.

Because of these 'golden handcuffs', MS is stuck and had to added advanced processor features somewhere else. It does suck though. AMD will have to fix these issues in hardware. Seems like Intel wasn't too dumb when they went for larger monolithic structures. For the best performance, AMD may have to do the same (except for larger core count server processors where even Intel needed to split into Multicore clusters connected by high speed data channels).
 
  • Like
Reactions: redwoodz

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Uhhh, i wonder why they went with it.

Most likely cost and TTM. Now AMD has a structure, the CCX, which can be used across its whole product line with a lot of reuse and just two dies (raven and summit ridge).
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Most likely cost and TTM. Now AMD has a structure, the CCX, which can be used across its whole product line with a lot of reuse and just two dies (raven and summit ridge).
I meant, why AMD went with such an abuse of such Windows behavior.
 

coffeemonster

Senior member
Apr 18, 2015
241
86
101
So I have noticed a performance loss issue with my system after running for more than a day or 2. In single thread heavy games core0 is 90+% with less load spread across other cores. I lose around 12-20FPS depending on game/location. palemoon browser lags noticeably too. Restarting alleviates the issue and core activity is back to being more evenly distributed with the highest being 75-80% on the same games.
Only setting in UEFI I have changed from default is SMT disabled.
 
  • Like
Reactions: Kromaatikse

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?
Intel has no monolithic block either with some data hyperloops between each core and the IMC. There is a ring bus.

I'd wait for further data points before putting all the blame on the data fabric latency. On Reddit I saw similar comments, resembling something like the CCX would match Intel's separate quadcore dies connected via FSB. It's actually not that bad. And the ring bus based designs also don't have direct connections between each core and the mem controller. In all those cases, the access requests and returning data (or store addresses and data) have to pass one or more hops/ring bus stops to get to the UMC/IMC, and again on the way back to the core.

That's what I assume to be happening:
CCX mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check L3 tags (L3 IF/CCX XBar) -> [clock domain crossing] send request to DF (router or XBar?) -> (1+ hops?) -> UMC -> access DRAMs -> data received at UMC -> transmit 64B line to DF (2 cycles) -> (1+ hops) -> receive data at CCX [clock domain crossing] -> move data to requesting core.
Intel ring bus mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check local L3 tags (L3 IF) -> [clock domain crossing to core like ring bus clock] send request via ring bus -> (1 to n hops) -> IMC -> access DRAMs -> data received at IMC -> transmit 64B line to DF -> (1 to n hops) -> receive data at core

So the UMC accesses via DF should add at least one hop (no direct connection), or 0.5 to 0.9 ns per direction (address, then data) depending on DF clock.
On Intel's 8C ring bus SoCs the avg. distance should be 2.5 hops (1 to 4 hops per 4 core half), but at clocks as high as core clocks -> 0.6 to 0.8 ns per direction.
 
Status
Not open for further replies.