Ryzen: Strictly technical

Mockingbird · Mar 5, 2017

DisEnchantment said:
Question though, would disabling a complete CCX actually help reduce the penalty, could someone with the Ryzen chips please try running the gaming bench with one of the CCX disabled and SMT disabled?

http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review-1222033/

Avalon · Mar 5, 2017

Thanks for the hard work Stilt!

roybotnik · Mar 5, 2017

I ran the draw call benchmark on my Ryzen 1800x (OC'd to 4ghz) and averaged 17 fps. I've been messing with this system for hours through numerous reboots, and at some point I was getting 14, but I'm pretty sure I was at 4ghz then as well. The BIOS for the Asus Crosshair is really not in a good state. My voltages have been screwy and I've given up on getting my 3200 ram working past 2666 for now.

Link: https://forums.anandtech.com/thread...call-performance.2499609/page-2#post-38776423

R0H1T · Mar 5, 2017

Ajay said:
AMD had to know this, just based on the design. No easy way to fix it without sort of breaking windows. I think Linux is smarter about threads and data locality - probably why it performs so well with Ryzen (haven't looked at Linux scheduling in a long time, so AFAIK). Even within one CCX this will be an issue, since prefetches are to the private caches.

I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.

I see another Process Lasso fan, MS should really be ashamed of themselves seeing how they can't seem to do half of what bitsum achieved with their software. Then again it's windows (10) & it just works.

starheap · Mar 5, 2017

roybotnik said:
I ran the draw call benchmark on my Ryzen 1800x (OC'd to 4ghz) and averaged 17 fps. I've been messing with this system for hours through numerous reboots, and at some point I was getting 14, but I'm pretty sure I was at 4ghz then as well. The BIOS for the Asus Crosshair is really not in a good state. My voltages have been screwy and I've given up on getting my 3200 ram working past 2666 for now.

Link: https://forums.anandtech.com/thread...call-performance.2499609/page-2#post-38776423

You are not the only one with... problems. Also have an 1800x, I cant get my memory past 2133 unless i do some convoluted process that involves swapping back and forth between two sets of ram...(then i can set the memory up with ryzen master to get 2666) I have a Gigabyte Gaming 5...

I've had 3 sets in this system all with similar results(one 4 stick kit I returned to newegg because half of the dimms were defective due to bad packaging, confirmed with a skylake system). I think i'm going to try to order a cheap B350 motherboard and a cheap sub 3000mhz 16gig ram kit(something on the supported memory list). This way i'll be able to better narrow down the problem. Bought the cpu + board at microcenter so if its my cpu or board i'll be exchanging them. If i have to exchange my cpu probably gonna get a 1700 or 1700x...

Kromaatikse · Mar 5, 2017

crashtech said:
Certainly Linux is not perfect, it has a scheduler, too. Time will tell if some patches can be applied to inform the scheduler of Ryzen's new architecture.

Linux has a *very good* scheduler. It already avoids moving threads around needlessly as Windows does, and it's already aware of a wide variety of topologies (NUMA and otherwise). The kernel devs responsible are almost certainly fine-tuning it for Ryzen as we speak, but it is already reasonable.

mvitkun · Mar 5, 2017

@Encrypted11 You might find some of what you're looking for in PCPER's review.

looncraz · Mar 5, 2017

Ajay said:
Hmm, all because of Windows odd behavior vis-a-vis thread allocation. Windows NT, at least back to 3.5, switches threads between cores (then CPUs) according to some algorithm (always looked random to me). I remember a spat on COMP.ARCH between some server dude and Dave Cutler over this on a dual CPU system - task manager showed exactly 50% utilization on each CPU when running a single threaded process. AMD, apparently, couldn't afford too design and implement two separate CPUs for client and server (with both being monolithic)

re: 1) AMD had to know this, just based on the design. No easy way to fix it without sort of breaking windows. I think Linux is smarter about threads and data locality - probably why it performs so well with Ryzen (haven't looked at Linux scheduling in a long time, so AFAIK). Even within one CCX this will be an issue, since prefetches are to the private caches.

re: 2) As Stilt pointed out, a ring 0 proggie can change this via core parking or from the command prompt with the /affinity switch can be used. I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.

I do wonder if AMD is planning on creating a monolithic design for 7nm? I don't know if their fabric can support > 12 cores with a single shared L3$ or if they can even afford to do that (since it must already be in development). If AMD is successful and is able afford to spend more on CPU development - I think we'll see something even more impressive than Zen (well, in absolute terms).

You can actually set cpu groupsize using bcdedit and force NUMA on in Windows. But this will limit most individual processes to just one CCX, rather than just the ones which have a problem.

The scheduler just needs to get smart enough to keep threads on the same CCX on which their data has likely remained.

Games, meanwhile, need to manually set affinities for some of their threads on Ryzen for the best results. Locking a control loop thread to a single core will allow that core's "AI" to adapt and a few bonus points to gained.

As for happens on 7nm... faster fabric, more CCXes, problem largely resolved. Maybe an L4.

imported_jjj · Mar 5, 2017

looncraz said:
As for happens on 7nm... faster fabric, more CCXes, problem largely resolved. Maybe an L4.

The Infinity fabric is at 512 Gbytes/s in Vega but unclear what latency.

rvborgh · Mar 5, 2017

interesting to read this...

i had this same issue on my 48 core overclocked quad Opteron setup... i had a game that had a max of 6 threads... 1 heavy and the rest light. Performance was alright until i got to about 1800 bots... After much tuning... and trying different things... i found that using Process Lasso and tying the game to the fast cores on the processor in the first socket (4 out of the 12 K10s run fast on my system for each socket)... upped the frame rate by about 80-90 fps at the start of the game (from low to mid 200s to 335... and i could run the game with acceptable fps up to around 2500 bots.

i really recommend folks trying Process Lasso on their Ryzen's... as it seems they have the same windows scheduling issue.

french toast · Mar 5, 2017

imported_jjj said:
The Infinity fabric is at 512 Gbytes/s in Vega but unclear what latency.

I gues the infiny fabric is not a fixed width, probably comes in links to scale up or down as needed.

Evil Azrael · Mar 5, 2017

@The Stilt,
so far you seem to be only one to get Win7 running, could you write a few words how you managed to do this. My own tries with my existing installation or with the win7 installation DVD (USB & SATA drive) had not much success, win7 always hangs during booting.

Could you please tell how you managed to get win7 running?

lolfail9001 · Mar 5, 2017

Anyways, i thought Phoronix data for Dota 2 Vulkan was pretty curious in regards to core scaling (SMT is enabled everywhere).
http://www.phoronix.com/scan.php?page=article&item=amd-ryzen-cores&num=2

imported_jjj said:
The Infinity fabric is at 512 Gbytes/s in Vega but unclear what latency.

Uhm, they made up fancy name for memory controller?

looncraz said:
The scheduler just needs to get smart enough to keep threads on the same CCX on which their data has likely remained.

Why would you do thread shuffling in age when the main issue most of the time is cache/memory access in the first place? Just curious.

P. S. At this point i am confident that Himeno and Prime95 have the same issue: for some reason they use K10 codepath without any AVX. For Prime95 i am confident because i have seen a screenshot that seems to claim as much, for Himeno it is the only viable explanation left.

Hi-Fi Man · Mar 5, 2017

This inter-CCX communication issue reminds me of the Core 2 Quads just without the MCM and FSB. I wonder how Windows dealt with that; Was XP ever patched to deal with that? Is the right solution to treat each CCX as a NUMA node?

KTE · Mar 5, 2017

@TheStilt .. Sorry if I've missed any of this. I can't stand Win 10, so don't like to use it except for a tablet.

Does parking a Core with Ryzen occur in logical pairs or one at a time?

What conditions have to be present in Win10 for the Core to be parked?

Are loads juggled about in Win 10? And seriously, why would they be if Cores are supposed to be idle->parked for efficiency? That makes no sense.

Core Park Manager and other tools can monitor and disable this feature, but so can the High Performance setting or changing min/max power values.

What's the DRAM/CCX inter bandwidth/latency with a few parked Cores? i.e. does it change?

Phenoms changed drastically.

What did you attain for L1, L2, L3 access latency? (tried Franck@CPUIDs tool?)

What is Ryzens idling frequency/voltage?

What are idle/load stock CPU temps like? Do the sensors seem accurate? (compared to K10 gen)

Can you monitor core throttling? Have you tried loading stock Ryzen to see if it throttles?

Trying to explain performance issues...

Mockingbird said:
Two questions for @The Stilt

1. Since the frequency of the data fabric is fixed to the memory frequency at a ratio of 1:2, does this mean that using faster memory would result in much faster performance and in what tasks would having faster fabric frequency be most beneficial?

It seems that fixing the data fabric frequency to the memory frequency impose significant restriction to the data fabric. In the future, could the data fabric frequency be decoupled from the frequency of the memory controller or perhaps the ratio could be changed for higher data fabric frequency?

1. In every ideal arch, these areas are on separate power planes and clock domains... Completely separate voltage islands.

This is a major shortsightedness by AMD with any 'IMC problems'. Phenom suffered due to IMC clocking and power shenanigans. That was 2006.

Having a linked CCX power plane, is even still backwards.

Its entirely possible that clocking might be impaired by their CCX more than purely 14nm LPP process. It was like that with Phenoms IMC/L3 previously.

AMD implemented decoupling characteristics back in 2007/2008 silicon and saved a ton of power and performance.

It is never a case of just decoupling alone tho and it just works. DRAM<->L3<->Fabric is very tricky to get right. Clocks and power generally pose major sync and timing issues to avoid corrupted data. Ryzens implementation is simply synonymous of a quick and easy job due to time constraints. It's obvious they couldn't afford to spend as much time as they needed on it.

Going forward, it would be the first area they would look to change.

2. These 'Windows issues' are AMD issues. We had them with Phenom for Christ's sake! When the workload bounced, with CnQ active, performance sucked and stuttered. AMD would have KNOWN about these issues during 'design considerations' 5 years back. It's AMD who has to adapt and get these fixed. Borked chip releases is only destroying their own image and income.

Secondly, Core Parking has to be implemented by AMD. If you don't have a working driver for your test OS, why allow this?

Telling reviewers to switch to the High Performance profile which doesn't park the core is just a band aid, at best and very misleading of your real world performance. Ryzen obviously has issues sleeping and waking the cores. Again, a Phenom issue.

3. Having two CCX at low interconnect bandwidth/high latency is an even bigger flaw, but this again will not be by design. This is going to pose a huge problem on Server workloads unless fixed. Forget HPC altogether.

4. Now you see why Server was not launched. AMD chose their smallest, least risk market to troubleshoot the chip.

5. Intellectual prediction won't magically gain 10-15% performance -- this is wishful thinking, like the pre-release hype. Every one of which turned out wrong.

Ajay said:
Hmm, all because of Windows odd behavior vis-a-vis thread allocation. Windows NT, at least back to 3.5, switches threads between cores (then CPUs) according to some algorithm (always looked random to me). I remember a spat on COMP.ARCH between some server dude and Dave Cutler over this on a dual CPU system - task manager showed exactly 50% utilization on each CPU when running a single threaded process. AMD, apparently, couldn't afford too design and implement two separate CPUs for client and server (with both being monolithic)

re: 2) As Stilt pointed out, a ring 0 proggie can change this via core parking or from the command prompt with the /affinity switch can be used. I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.

I'm sure K10 had an app that could change the skew when different PStates are entered, and even force certain PStates.

And it would show separate power planes portions.

Sent from HTC 10
(Opinions are own)

Greyguy1948 · Mar 5, 2017

The Stilt said:
All of the designs had their multithreading (SMT/CMT) enabled during the testing since that is what the end-users would be having.

I've verified some of the individual results without the multithreading being enabled on all of them, and the differences fell within the standard deviation of the results (final results being 3RA).
IIRC NBody illustrated a slight improvement on Excavator from having the CMT disabled, however that's pretty much irrelevant since such configuration is not allowed at default.

Regarding Excavator- can you disable CMT in BIOS?
Regarding Himeno and Nbody- is it likely another size of these BM would give a very different result?
3D Euler seem do be hard to predict- Caselab and CFD are very different.
3D Euler tested at techreport.com is even worse for Ryzen.

FlanK3r · Mar 5, 2017

Guys, whtas best temp monitoring with Ryzen? I wouldm like to use Coretemp...If it will be worked correctly. Or hwinfo sensoring? Thx.

iBoMbY · Mar 5, 2017

I guess the HWinfo temperatures are correct at least. But my Crosshair just bricked, so any further testing has to wait. This could be a serious bug btw. I'm not the only one it seems. Edit: And this one seems to be similar as well.

loccothan · Mar 5, 2017

Great Thread THX for the effort

William Gaatjes · Mar 5, 2017

imported_jjj said:
Computerbase got a 17% boost in avg FPS in Total War: Warhammer when disabling SMT.
At 1080p with not sure what GPU as i don't speak german but Titan X or GTX 1080.

I guess it is better if i ask this question in this thread.

I was wondering about something. I read that for some situations like gaming, it is better to disable SMT.
But when SMT is disabled, the 8 threads on the cores can just as often be stalled as with SMT enabled but other threads take over execution time, filling in the gaps. Only, difference is that when SMT is disabled, when the thread stalls, the core stalls as well with SMT disabled. This would mean that Ryzen would dissipate less heat with SMT disabled because there are moments when the core does less work, allowing for higher sustained overclocks and boost clocks.

Does that make any sense or am i overlooking something ?

I mean, with 8 cores, that would be sufficient for a lot of people.
Also, i have read that going for 2666MHz single sided (ranked, only memory chips on 1 side ) DDR4 modules is the best option to get high memory speeds.

Is that true ?

KompuKare · Mar 5, 2017

Evil Azrael said:
My own tries with my existing installation or with the win7 installation DVD (USB & SATA drive) had not much success, win7 always hangs during booting.

Could you please tell how you managed to get win7 running?

I would imagine the same way you'd get it working on any new hardware:

Either slipstream the drivers, or
Have them ready on a USB stick and tell Windows about it at the right moment.

The chipset drivers you should be able to get from most motherboard makers, for example:
https://www.asus.com/uk/Motherboards/ROG-CROSSHAIR-VI-HERO/HelpDesk_Download/
If you've never slipstreamed before, it's probably best if you follow a guide preferably starting with the SP1 ISO from Microsoft:
https://www.microsoft.com/en-gb/software-download/windows7 (that now requires your product key which is hassle especially if you want to make a universal disc)
This is a guide to slipstreaming the Intel RAID drivers:
http://www.win-raid.com/t750f25-Guide-Integration-of-drivers-into-a-Win-image.html
That uses NTLite which makes it easy but it is possible to do the same without third party tools.

JDG1980 · Mar 5, 2017

The Stilt said:
Regarding the ST: 1800X at default, with turbo & XFR enabled scores 162 in Cinebench 15. With the TDP (PPT) limited to 30W the score is 155.

Wow. That's incredibly impressive, especially when compared to results for existing laptops. The only ones that get a higher Cinebench ST score than that are two Clevo desktop replacements that use enthusiast K-series CPUs (and thus have high TDPs). For normal laptop CPUs, the best score recorded was 147.74, on the Dell XPS 15. That laptop uses a Core i7-6700HQ Skylake CPU with a TDP of 45W. And Ryzen can beat that by ~5% at two-thirds the TDP?!

The Stilt · Mar 5, 2017

gupsterg said:
@The Stilt

Thank you for the article .

Any chance you can fill in the x.xGHz ACXFC for a R7 1700?

For example, for the 1700 SKU the clock configuration is following: 3.0GHz all core frequency (MACF), 3.7GHz single core frequency (MSCF), x.xGHz maximum all core XFR ceiling (ACXFRC) and 3.75GHz maximum single core XFR ceiling (SCXFRC).

R7 1700 SHOULD have 3.05GHz ACXFRC (due being non-X), however there is indication that it's not necessarily the case.
Unfortunately I don't have the data available, to check it right now.

Mockingbird said:
Two questions for @The Stilt

1. Since the frequency of the data fabric is fixed to the memory frequency at a ratio of 1:2, does this mean that using faster memory would result in much faster performance and in what tasks would having faster fabric frequency be most beneficial?

It seems that fixing the data fabric frequency to the memory frequency impose significant restriction to the data fabric. In the future, could the data fabric frequency be decoupled from the frequency of the memory controller or perhaps the ratio could be changed for higher data fabric frequency?

2. Since low clock speed is the result of Samsung 14nm LPP and that increasing frequency beyond Critical 2 would require significantly higher voltage, would it be beneficial for AMD to instead move its production of high frequency products to TSMC?

Or rather, should AMD be focused on increasing its IPC at the given frequency?

Based on my own tests, the average performance benefit from higher than 2400MHz DRAM (i.e. 1200MHz DFICLK) is very marginal in 2D, even with 8C/16T config. Will smaller core count even 2133MHz is fine.

There are various interfaces / fabrics inside Zeppelin and their functionality and relations are not fully known at the moment. So eventhou the data fabric operates at half the effective MEMCLK, that doesn't necessarily mean that the inter-CCX connections are operating at the speculated width and speed. There are parts of the fabric which are 256-bit wide for example.

Moving to another node (such as 16nm FF+) is really a nobrainer, if it yields in <10% increased Fmax. Porting the design will be extremely expensive, however not NEARLY as expensive as trying to increase IPC of the µarch itself. The results in terms of the performance are always guaranteed when increasing the frequency on existing design, while that's not the case on a modified µarch featuring higher IPC. The modified µarch may necessarily not be able to reach same speeds as the old one did, so the actual performance may remain the same or even degrade.

The Stilt · Mar 5, 2017

Encrypted11 said:
There seem to be no public record of IO testing on the internet with rushed reviews.

Will you test the IO?

Unfortunately I don't have the equipment to do that.

DisEnchantment · Mar 5, 2017

Moving to another node (such as 16nm FF+) is really a nobrainer, if it yields in <10% increased Fmax. Porting the design will be extremely expensive, however not NEARLY as expensive as trying to increase IPC of the µarch itself. The results in terms of the performance are always guaranteed when increasing the frequency on existing design, while that's not the case on a modified µarch featuring higher IPC. The modified µarch may necessarily not be able to reach same speeds as the old one did, so the actual performance may remain the same or even degrade.

So does it makes more sense cost and time wise for AMD to use Samsung 14nm LPU or totally port their design to TSMC 16 nm FF+
TSMC's 16 FF+ seems to clock higher than samsung's 14nm LPP/LPC but isn't LPU supposed to catch up here?
Also Glo Fo will skip 10nm right?

Ryzen: Strictly technical

Senior member

Diamond Member

Junior Member

Platinum Member

Junior Member

Member

Junior Member

Senior member

Senior member

Member

Senior member

Junior Member

Golden Member

Senior member

Senior member

Member

Senior member

Member

Senior member

Lifer

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member