Ryzen: Strictly technical

iBoMbY · Mar 17, 2017

mat9v said:
We could use Process Lasso to bind all those background tasks and programs that are normally running to cores on CCX1 instead of CCX0 and get much better chance of game threads remaining on CCX0 where it belongs, because Windows scheduler will see cores on CCX1 running more tasks so it will schedule the game to run on CCX0. It may only help, it is not a solution but I think worthy of trying.

You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.

Dresdenboy · Mar 17, 2017

tamz_msc said:
While I agree with what you're saying, the peculiarities of Ryzen 7's latency issues are not fully understood yet. Going back to the Hardware.fr investigation, they speculate:

The cache issues are even more telling: they designed a benchmark that tests different sizes sequentially, and the results for sizes >=6MB are particularly problematic. In this case, they say it is because sequential access disables the other CCX, and in this benchmark, it behaves as if it only has 8MB of L3.

We surely will learn more over time, esp. I'm going to do some microbenchmarking.

The 2nd L3 is of no help. As a victim cache it would only hold evicted data from cores of the 2nd CCX.

CatMerc · Mar 17, 2017

I do wonder what the effects of having 1:1 DF and RAM speed would be.
A way to extrapolate it would be to run memory at, say, 1066MT/s, measure latency, and then run memory at 2133MT/s and measure again.
If the cycles it takes for data to travel across the DF are identical at higher frequencies, then we can extrapolate what a 1:1 clock would do.

Sadly I do not have a Ryzen sample, nor the know how to build a program to test it.

mat9v · Mar 17, 2017

iBoMbY said:
You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.

I don't think that would work, because Windows scheduler would place all those processes as it sees fit on CCX0 or CCX1 according to how busy cores are.
If I bind all those less important processes to cores from CCX1 then it leaves CCX0 less busy and it is more probable that my game will run on cores from CCX0.
Not certain of course because once the game start CCX0 will be much more busy.
It is easier then introducing problems with NUMA (games not spanning NUMA processor groups for example) and more likely to work how we want.
By using groups in bcdedit you did mean forcing NUMA, right?

iBoMbY · Mar 17, 2017

mat9v said:
By using groups in bcdedit you did mean forcing NUMA, right?

I mean forcing Processor Groups, which is the only option Microsoft offers. Processor Groups are a higher order than NUMA nodes, and NUMA nodes are implicitly forced by this, because NUMA nodes cannot span over two Processor Groups. If you have more than one Processor Group the Task Manager gets a new Affinity setting for the Processor Groups, and a normal Process can only use Logical Cores assigned to his assigned Processor Group.

looncraz · Mar 17, 2017

iBoMbY said:
You could make that easier with setting two processor groups using bcdedit. You could then assign everything from the system to group 1, and the game to group 0. Most applications are not aware of processor groups, and couldn't use the other on their own.

This is broken on Windows with Ryzen. It simply disables half the cores.

I had some indication, however, this is by AMD's own doing - this is how they disable cores when you select how many to disable in Ryzen Master. A very dumb solution.

Ryzen Master, in fact, mostly works in very idiotic ways - though fairly safe ways. Uninstall it and throw it away.

I didn't get a chance to test without it installed to see if the problem goes away, but I know the affinity bugs are present without it.

looncraz · Mar 17, 2017

Wahoo!

A little trip through the oven and I'm back in action!

EDIT: Scratch that, bad caps - worked because of an overnight bleed-down, but froze and stopped working in the same way after a few minutes.

RMA it is :-(

lolfail9001 · Mar 17, 2017

looncraz said:
Wahoo!

A little trip through the oven and I'm back in action!

How's attempts to breakdown Ryzen's latency going?

looncraz said:
I had some indication, however, this is by AMD's own doing - this is how they disable cores when you select how many to disable in Ryzen Master. A very dumb solution.

Uhhh, i wonder why they went with it.

Elixer · Mar 17, 2017

looncraz said:
Wahoo!

A little trip through the oven and I'm back in action!

That actually worked? 😵

Cold solder joint?

CatMerc · Mar 17, 2017

looncraz said:
Wahoo!

A little trip through the oven and I'm back in action!

My ancestors would be pleased to hear humans have evolved to survive ovens. 😀

On a more serious note, could you perhaps test this?

CatMerc said:
I do wonder what the effects of having 1:1 DF and RAM speed would be.
A way to extrapolate it would be to run memory at, say, 1066MT/s, measure latency, and then run memory at 2133MT/s and measure again.
If the cycles it takes for data to travel across the DF are identical at higher frequencies, then we can extrapolate what a 1:1 clock would do.

Sadly I do not have a Ryzen sample, nor the know how to build a program to test it.

looncraz · Mar 17, 2017

CatMerc said:
My ancestors would be pleased to hear humans have evolved to survive ovens. 😀

On a more serious note, could you perhaps test this?

I'm a Texan, I can survive an oven just fine ;-)

Sadly, my early excitement over having remedied my problem was premature - the board is toast. As soon as the system is running for a good while, no matter what it is doing, it freezes and then won't respond to power on triggers at all. If I drain down the caps it will come back up.

It is rock solid until the moment it freezes - and I can hear some good coil whine from the board when that happens. Not going to chance running it any more - RMAing it with NewEgg.

unseenmorbidity · Mar 17, 2017

looncraz said:
Wahoo!

A little trip through the oven and I'm back in action!

EDIT: Scratch that, bad caps - worked because of an overnight bleed-down, but froze and stopped working in the same way after a few minutes.

RMA it is :-(

What did you blow up?

CatMerc · Mar 17, 2017

looncraz said:
I'm a Texan, I can survive an oven just fine ;-)

Sadly, my early excitement over having remedied my problem was premature - the board is toast. As soon as the system is running for a good while, no matter what it is doing, it freezes and then won't respond to power on triggers at all. If I drain down the caps it will come back up.

It is rock solid until the moment it freezes - and I can hear some good coil whine from the board when that happens. Not going to chance running it any more - RMAing it with NewEgg.

And that ladies and gentlemen, is a technological rooster block.

formulav8 · Mar 17, 2017

unseenmorbidity said:
Does that mean there will be a 1750 released in the next few months?

I recall hearing about a 1900x coming out.

looncraz · Mar 17, 2017

unseenmorbidity said:
What did you blow up?

I didn't "blow up" anything, it appears the standby power actually going to the CPU is not present. If I knew the pinout for the CPU I could verify that... and possibly would just run a bodge wire (I'm old-school).

The magic smoke, as they say, is still contained within the board. But it takes a good few hours of no power going to the board before it will even attempt to power on.

This is the second time I've had this nearly exact type of failure with a board in the last couple of years - I wish I could point to something I'm doing to cause it, I'd not hesitate to buy another board and chock up to a learning experience. I'd have the replacement board in just a couple days - now I have to wait a week+.

formulav8 · Mar 17, 2017

looncraz said:
It is rock solid until the moment it freezes

haha you don't say? 😀

Ajay · Mar 17, 2017

looncraz said:
This really isn't a Microsoft issue. This is an evolutionary issue - multi-core CPUs weren't really a thing until 2005. Microsoft added GetLogicalProcessorInformation (a terrible API, but useful) in XP SP3 as well as improving threading throughout its own products. They couldn't really do much more.

There's now real way for a kernel scheduler to fully accommodate Ryzen's design without the application being involved or without creating per-application profiles... it can help out, but it will never do better than an application developer's own optimizations.

I don't think it is entirely a case of MS not being able to make changes, but they choose to make them in power management vs the scheduler. Large ISVs were already rolling their own optimizations using a more predictable lightweight scheduler. A statistical scheduler which created app profiles, or a neural net based scheduler would be more time consuming and forced developers of system software to rewrite their code (which would cause allot of backlash). I'm sure embedded developers don't want a smart scheduler at all. It will be interesting to see what direction Linux takes.

Because of these 'golden handcuffs', MS is stuck and had to added advanced processor features somewhere else. It does suck though. AMD will have to fix these issues in hardware. Seems like Intel wasn't too dumb when they went for larger monolithic structures. For the best performance, AMD may have to do the same (except for larger core count server processors where even Intel needed to split into Multicore clusters connected by high speed data channels).

Ajay · Mar 17, 2017

lolfail9001 said:
Uhhh, i wonder why they went with it.

Most likely cost and TTM. Now AMD has a structure, the CCX, which can be used across its whole product line with a lot of reuse and just two dies (raven and summit ridge).

lolfail9001 · Mar 17, 2017

Ajay said:
Most likely cost and TTM. Now AMD has a structure, the CCX, which can be used across its whole product line with a lot of reuse and just two dies (raven and summit ridge).

I meant, why AMD went with such an abuse of such Windows behavior.

Ajay · Mar 17, 2017

lolfail9001 said:
I meant, why AMD went with such an abuse of such Windows behavior.

Dunno - arrogance or masochism? Take your pick.

innociv · Mar 17, 2017

itsmydamnation said:
if its 256bits a cycle then peak @ 1333mhz would be more like 32Gb/s.

edit: to get 22Gb/s memory has to be running @ 860mhz (1720mhz) which is just odd.

I thought it was that much each way, though.

lolfail9001 · Mar 17, 2017

innociv said:
I thought it was that much each way, though.

You know what, if it was 32B for both ways (so 16B one-way), then AMD's number of 22GB/s makes perfect sense for DDR4-2666.

unseenmorbidity · Mar 17, 2017

Look at this, I don't know what to believe anymore...

How the Windows High Performance mode is limiting Ryzen performance

coffeemonster · Mar 17, 2017

So I have noticed a performance loss issue with my system after running for more than a day or 2. In single thread heavy games core0 is 90+% with less load spread across other cores. I lose around 12-20FPS depending on game/location. palemoon browser lags noticeably too. Restarting alleviates the issue and core activity is back to being more evenly distributed with the highest being 75-80% on the same games.
Only setting in UEFI I have changed from default is SMT disabled.

Dresdenboy · Mar 17, 2017

tamz_msc said:
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?

Intel has no monolithic block either with some data hyperloops between each core and the IMC. There is a ring bus.

I'd wait for further data points before putting all the blame on the data fabric latency. On Reddit I saw similar comments, resembling something like the CCX would match Intel's separate quadcore dies connected via FSB. It's actually not that bad. And the ring bus based designs also don't have direct connections between each core and the mem controller. In all those cases, the access requests and returning data (or store addresses and data) have to pass one or more hops/ring bus stops to get to the UMC/IMC, and again on the way back to the core.

That's what I assume to be happening:
CCX mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check L3 tags (L3 IF/CCX XBar) -> [clock domain crossing] send request to DF (router or XBar?) -> (1+ hops?) -> UMC -> access DRAMs -> data received at UMC -> transmit 64B line to DF (2 cycles) -> (1+ hops) -> receive data at CCX [clock domain crossing] -> move data to requesting core.
Intel ring bus mem access:
Core -> check L1 tags (LSU) -> check L2 tags (L2 IF) -> check local L3 tags (L3 IF) -> [clock domain crossing to core like ring bus clock] send request via ring bus -> (1 to n hops) -> IMC -> access DRAMs -> data received at IMC -> transmit 64B line to DF -> (1 to n hops) -> receive data at core

So the UMC accesses via DF should add at least one hop (no direct connection), or 0.5 to 0.9 ns per direction (address, then data) depending on DF clock.
On Intel's 8C ring bus SoCs the avg. distance should be 2.5 hops (1 to 4 hops per 4 core half), but at clocks as high as core clocks -> 0.6 to 0.8 ns per direction.

Ryzen: Strictly technical

Member

Golden Member

Golden Member

Member

Member

Senior member

Senior member

Golden Member

Lifer

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Member

Golden Member

Golden Member

Senior member

Golden Member