Ryzen: Strictly technical

powerrush · Mar 12, 2017

Group Size = 8 plus Core parking won't work. Just test it, if you make groupsize = 8 and parking min cores 50% then half of the threads of each CCX will be stopped until some workload will require additional threads from each CCX. 4 threads from the first CCX will be unparked , and 4 threads from the second CCX will be unparked at the same time. So groupsize is not the way to go. Sorry for my bad english.

powerrush · Mar 12, 2017

I think that the best way is 50% min cores unparked, and 50% parked (until workloads requires unpark more cores) and core overutilization threshold at 100% with high performance energy plan modified....

Bacon1 · Mar 13, 2017

malventano said:
With the addition of LFC to FreeSync panels that have a sufficient FPS range to support it, the playing field is mostly equal. The only real difference I've seen anymore is that most of the FreeSync panels still don't get overdrive as good as it could be (example 1 2 3), particularly when operating in the VRR range. G-Sync still does a better job at overdrive regardless of refresh rate. So long as you are ok with possibly imperfect overdrive, and you ensure the panel supports LFC, there's no longer a reason to have to jump ship on your GPU to get the panel that does what you want it to.

I guess the answer to your question is that there isn't a big reason to repeat the test, unless you can think of something more specific you were hoping to see from such a test?

Thank you for the reply

Well because no one has tested LFC as far as I know, and anyone viewing your old videos would never know it existed. I mean even most of the articles you linked were prior to LFC, and even in the latest (Dec 2015) it said this:

. This is a big change for AMD and for its monitor partners as it means that FreeSync is now very close to matching the quality and experience that you get from G-Sync.

What is still missing? What isn't the same quality?

Why did that monitor only get silver?

malventano · Mar 13, 2017

Bacon1 said:
Thank you for the reply

What is still missing? What isn't the same quality?

Why did that monitor only get silver?

Overdrive is not as 'tuned' to a particular panel on FreeSync as it is on G-Sync because the scaler/TCON alone has relatively simple logic as compared to a G-Sync module.

Ajay · Mar 13, 2017

Geez, can we take the Graphics discussion to VC&G : https://forums.anandtech.com/forums/video-cards-and-graphics.8/

Kromaatikse · Mar 13, 2017

deadhand said:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-11#post-38776963

There is a fairly severe latency bottleneck when accessing L2 caches on the opposite quad core module on the PS4. On the PS4 die there is a fairly large physical gap between the quad core modules.
Or do you mean in terms of programming for the PS4?

To rephrase: the PS4 has two major advantages with respect to using its hardware efficiently. It is a fixed target which can be specifically optimised for, including accounting for topology quirks such as quad-grouped LLCs. And it doesn't run Windows, so game devs aren't constantly swimming upstream when trying to make things work efficiently.

Of course, the hardware is much more limited than a gaming PC overall.

looncraz · Mar 13, 2017

malventano said:
You mean without LFC? It's painfully noticable once you dip below the lower VRR limit of the panel. Not as much of an issue if that number is as low as 30 Hz, but some panels have bottom ends that are far higher than that.

Yes, but that's not a FreeSync issue... that's just a panel choice issue.

FreeSync (technically, Adaptive Sync) is just a standard by which the video card can communicate to the monitor to trigger refreshes. GSync is little more than that, just within its very narrow, and expensive, ecosystem.

Ghosting isn't an inherent issue, neither are there framerate or performance limitations worth mentioning.

Companies aren't going to put a low quality panel with GSync - that'd be suicide.. why pay nVidia $80 for their logic board to use in your monitor if you're going to pair it with a cheap panel?

Adaptive sync is, literally, free to implement. Nothing more than setting the framerate tables should be all that's needed. One tech, twenty minutes or less, done. Bonus feature value-add. You can put that on any of your monitors - all of them, in fact, provided the panel has enough of a frequency range... heck, just 30~60FPS can be enough for most people.

iBoMbY · Mar 13, 2017

malventano said:
groupsize tweaks work as expected under Win 10, but do note that any given (non NUMA-aware) application will be restricted to half of the total logical cores (one NUMA node).

The problem is the "Processor group" split, not the "NUMA group" split. If you could create two NUMA groups, without creating two Processor groups, it might give a better result, but it doesn't look like Microsoft is offering that option.

Kromaatikse · Mar 13, 2017

I was just reminded of the big difference between Windows' process scheduler and a sane one, which fully explains why it migrates threads so often.

For reference, this easily findable book chapter explains how several different types of multiprocessor scheduler work. Pay particular attention to the "work stealing" balancing algorithm; it runs on an idle or lightly-loaded CPU, and looks for CPUs with greater load than itself. An alternative approach is for a heavily-loaded CPU to look for CPUs with *less* load than itself, in order to *give* them some of its excess work - this works better in cases where idle CPUs are not periodically woken (which is more power efficient).

Whichever approach Windows uses, it constantly attempts to move threads to less-loaded CPUs - even when it is the *only* runnable thread on its original CPU - and it counts the thread's own past load against its current CPU. This is inhibited only by the parking and affinity masks (which are clearly bolted-on afterthoughts), and makes no allowance whatsoever for SMT, NUMA, cache affinity, or the cost of context switches. The book chapter I linked doesn't mention SMT or NUMA (it may be a relatively old book, in which those concepts were not yet widespread), but it *does* talk about the other two factors as being key for efficiency.

This *should* be very easy for Microsoft to fix, if they can be bothered. Simply make any thread meeting all of the following criteria ineligible for migration:

It is the only thread currently in its CPU's run queue.
It currently satisfies its own affinity mask, if any.
Its CPU is not parked.
It shares the same LLC as all other threads in the same process.

This would make the precise behaviour of the core-parking algorithm much less important for enforcing short-term performance and efficiency goals. A useful additional parameter to the latter would then be an optimisation target, taking the following values:

Execution resources - the current behaviour, preferentially unparking just one thread per physical core.
Cache affinity - as above, but only within each LLC block. When all cores are unparked in one LLC, begin on the next.
Power efficiency - always unpark all virtual cores in the same physical core before proceeding to another physical core. Also unpark all physical cores in one LLC before proceeding to the next.

Well, we can dream.

Chl Pixo · Mar 13, 2017

All the various info I get about Ryzen points me to believe the biggest weakness is need support wide variety of RAM speeds.
Support needs to start from 800Mhz for 1600 modules which cripples IF to 22GB/s and horrible latency.
Highest I have seen was 1.6Ghz for 3200 modules which would put IF to 44GB/s and halves the latency.

Posts in this this thread mention clock multiplier for IF which could solve many problems.
Why its not in use may be related to max clock IF can operate on, multipliers it can use + marketing.
People would be pissed if their fast RAM would make IF slower with bigger latency and lower FPS in games.
Not supporting high speed RAM would make Ryzen look bad.

For servers/workstation its not a problem as max speed of RAM would be 1.2Ghz or whatever AMD says its max.
And they would pay for quadchannel to have good memory speed.
For them set the multiplier to 2 and have the go IF speed and latency.

I also saw die shot showing R7s having 32 pcie lanes. 8 could be disabled as to not create problems in IF with low speed RAMs.
Theoretical max 23.6GB/s for PCIe3 x24 is much closer to the lowest IF 22GB/s.
PCIe3 x32 hits 31.5GB/s and that would need 2133 modules min.

This is just my theorycrafting.
I will know for sure once the Naples comes out.

Kromaatikse · Mar 13, 2017

There are apparently some internal clock multipliers and other subtimings which are not yet fully available through AGESA - that is, the library supplied by AMD to BIOS vendors. It could well be that support may improve over time, allowing better control of DRAM and IF frequencies than at present.

bjt2 · Mar 13, 2017

Kromaatikse said:
There are apparently some internal clock multipliers and other subtimings which are not yet fully available through AGESA - that is, the library supplied by AMD to BIOS vendors. It could well be that support may improve over time, allowing better control of DRAM and IF frequencies than at present.

Some day ago, AFTER the release of Zen, there was not yet the BIOS and Kernel developer guide on the AMD website. I haven't checked lately, but i will not be surprised if it is not still there...
I think that they are finalizing AGESA code and when the BKDG will be published, we could read all the knobs available to BIOS and other routines...

CatMerc · Mar 13, 2017

https://twitter.com/BitsAndChipsEng/status/841101209821990912

Maybe they do know something?

unseenmorbidity · Mar 13, 2017

1800x CCX (4+0)vs CCX (2+2) + latency test -Hardware.fr hardware.fr

roybotnik · Mar 13, 2017

Kromaatikse said:
In that video, I noticed him calling all the even-numbered cores "physical" and the odd-numbered ones "virtual". This is a misconception of SMT.

Sorry, I do know how SMT works, this is just my first time recording a video like this so the words came out a little weird

. I tried both even/odd OS CPUs just to illustrate that it made no difference and to give a good baseline. I am planning to look at doing the same sort of thing with 3dmark's API overhead benchmark since the app I was using is DX9. Either way it shows what happens when some threads are on different CCXs though.

Kromaatikse · Mar 13, 2017

Well, I'm certainly looking forward to how the next set of BIOS updates and software patches improves stuff.

Pookums · Mar 13, 2017

unseenmorbidity said:
1800x CCX (4+0)vs CCX (2+2) + latency test -Hardware.fr hardware.fr

Does anyone know if its possible to limit L3 cache to 4MB for one Application? It looks like using more than 4MB quickly causes performance degradation and even with a CCX turned off it suffers. Ryzen is taking as long to utilize the last batch of L3 as it is to access the DRAM itself based on those findings, thus making it pointless to use for a single heirarchy. Better to jump right to DRAM instead.

Perhaps turning off a CCX and limiting it to 4MB L3 will improve performance even further than simply switching off a single CCX.

I've also noticed in previous Comprehensive reviews, that 3200mb DRAM was slower than 2667 in a number of gaming applications at the moment with identical CL timings. The bios issues, even with BCLK improvements seem to present another unknown regarding this aspect of Ryzen performance. If this is the case 2667 cl 12 might be the best performance memory for ryzen.

Is there anyone able to do all of this with a ryzen processor? Create a benchmark with 2+2 CCX and 4+0 CCX with moderate memory timings, and comparing it to 4+0 CCX with 2667 cl 12 memory and if possible limiting L3 access to 4MB for a single application(assuming this is possible).

If it works and shows significant improvement, It would give an idea of its maximum gaming performance with all cores once all issues are resolved.

Kromaatikse · Mar 13, 2017

Pookums said:
Does anyone know if its possible to limit L3 cache to 4MB for one Application? It looks like using more than 4MB quickly causes performance degradation and even with a CCX turned off it suffers. Ryzen is taking as long to utilize the last batch of L3 as it is to access the DRAM itself based on those findings, thus making it pointless to use for a single heirarchy. Better to jump right to DRAM instead.

No - and that isn't how it works. If you're seeing latency as long as DRAM access, that means you're accessing DRAM already. It might, for example, take a bit of pressure over time to completely fill the L3 with data and bring the latency down over a full 8MB working set. Reducing the L3 cache to 4MB (if there's a mechanism for that, which is by no means certain) would just move the contention point in to 3MB or 2MB, leaving you worse off overall.

What I think is actually happening in the French results is that some other process than the benchmark is also putting stuff in the L3 cache, and thus interfering with the test.

powerrush · Mar 13, 2017

Someone did the test? Groupsize wont work because Windows will use one CCX or another randomly, sometimes for a four thread load will assign two threads from one CCX and two threads from another CCX. So the best solution until Microsoft launches a official patch will be the using of core parking at 50% of minimum cores unparked, and 100% core overutilization threshold.

Kromaatikse · Mar 13, 2017

powerrush said:
core parking at 50% of minimum cores unparked

That won't work either, because it'll leave all eight physical cores unparked (across both CCXes) with one virtual core each, even with a 1T workload. Win10 actually does a tolerable job of 1T with its default core-parking settings.

What you need to do is assign CPU affinity on a per-process basis to critical processes (ie. games). It's possible to use Process Lasso or Prio to automate this to some degree. I imagine several of the bigger game devs are planning to assign CPU affinity themselves.

Whatever707 · Mar 13, 2017

I suggest you take a look at those 2 videos for further win7 vs win10 comparison data:

https://www.youtube.com/watch?v=U9DE83lMVio
https://www.youtube.com/watch?v=XAXS8rYwGzg

innociv · Mar 13, 2017

I've been watching this discussion for a long time but couldn't bother to log in. But people keep glossing over something over and over that I'm surprised no one has pointed out:

The reason why disabling SMT seems to improve performance in Windows10 almost surely has to do with increasing the affinity at which Windows10 thread scheduler will move threads around.

Games have very uneven loads on their threads compared to most workstation tasks that load up threads heavily from start to finish on a task. When you disable SMT, you keep the 8 threads loaded more heavily than 16 in many cases.
With SMT enabled, it seems the performance drops have to do with creating situations where the scheduler more often moves threads across the CCX rather than AMD's SMT being in any way inherently worse for gaming despite the big performance increases it gives in other tasks. It also moves threads off of a turbo'd core and it takes some ms for the turbo to ramp back up on the new core.

It is weird that AMD isn't calling out MS on this issue, though.
Yes, some developers do need to specifically optimize their games for Ryzen, but the baseline performance without such optimizations is worse than it should be.
A game that only runs 4-8 threads or even more should not necessarily need Ryzen optimizations if the scheduler would just run all those threads on one CCX and run other application(s) threads on the other.
The expectation that every developer needs to patch their games to manage threads to keep them all on one CCX if the threads are interdependent is a crazy expectation. What Windows should be doing by default is keeping an applications threads on the same CCX unless all 8 threads are overloaded or if the application specifically requests thread(s) to be managed on another CCX. Far more games will work like the former than the later!
Only in the case of heavily CPU intensive games that need more power should developers need to specifically optimize cross-CCX, which conversely is not something the Windows scheduler would need to handle.

Kromaatikse · Mar 13, 2017

In other words, your theory is that ordinary inter-thread communication across CCXes would be less expensive than constantly migrating entire threads between CCXes. There's probably some logic in that.

The fix would therefore be very simple on Microsoft's end - as I have suggested, to stop migrating threads unnecessarily.

looncraz · Mar 13, 2017

unseenmorbidity said:
1800x CCX (4+0)vs CCX (2+2) + latency test -Hardware.fr hardware.fr

I would like to see those tests with SMT disabled. I'm wondering if AMD is partitioning the L3 with SMT enabled, because it acts like there is only 4MB of L3...

imported_jjj · Mar 13, 2017

looncraz said:
I would like to see those tests with SMT disabled. I'm wondering if AMD is partitioning the L3 with SMT enabled, because it acts like there is only 4MB of L3...

It is disabled.
"Comme pour au dessus, nous réalisons les tests à 3 GHz, le SMT est désactivé pour limiter la variabilité."

3GHz and SMT disabled

Ryzen: Strictly technical

Junior Member

Junior Member

Diamond Member

Junior Member

Lifer

Member

Senior member

Member

Member

Junior Member

Member

Senior member

Golden Member

Golden Member

Junior Member

Member

Member

Member

Junior Member

Member

Junior Member

Member

Member

Senior member

Senior member