Ryzen: Strictly technical

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Mockingbird

Senior member
Feb 12, 2017
733
741
106
Question though, would disabling a complete CCX actually help reduce the penalty, could someone with the Ryzen chips please try running the gaming bench with one of the CCX disabled and SMT disabled?

WhigjzA.png

AUpPN6b.png

1tcjSyS.png

v2b3sF8.png


http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review-1222033/
 
Last edited:

roybotnik

Junior Member
Mar 4, 2017
6
9
36
I ran the draw call benchmark on my Ryzen 1800x (OC'd to 4ghz) and averaged 17 fps. I've been messing with this system for hours through numerous reboots, and at some point I was getting 14, but I'm pretty sure I was at 4ghz then as well. The BIOS for the Asus Crosshair is really not in a good state. My voltages have been screwy and I've given up on getting my 3200 ram working past 2666 for now.

Link: https://forums.anandtech.com/thread...call-performance.2499609/page-2#post-38776423
 

R0H1T

Platinum Member
Jan 12, 2013
2,582
162
106
AMD had to know this, just based on the design. No easy way to fix it without sort of breaking windows. I think Linux is smarter about threads and data locality - probably why it performs so well with Ryzen (haven't looked at Linux scheduling in a long time, so AFAIK). Even within one CCX this will be an issue, since prefetches are to the private caches.

I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.
I see another Process Lasso fan, MS should really be ashamed of themselves seeing how they can't seem to do half of what bitsum achieved with their software. Then again it's windows (10) & it just works.
 
  • Like
Reactions: Ajay

starheap

Junior Member
Mar 4, 2017
5
0
1
I ran the draw call benchmark on my Ryzen 1800x (OC'd to 4ghz) and averaged 17 fps. I've been messing with this system for hours through numerous reboots, and at some point I was getting 14, but I'm pretty sure I was at 4ghz then as well. The BIOS for the Asus Crosshair is really not in a good state. My voltages have been screwy and I've given up on getting my 3200 ram working past 2666 for now.

Link: https://forums.anandtech.com/thread...call-performance.2499609/page-2#post-38776423
You are not the only one with... problems. Also have an 1800x, I cant get my memory past 2133 unless i do some convoluted process that involves swapping back and forth between two sets of ram...(then i can set the memory up with ryzen master to get 2666) I have a Gigabyte Gaming 5...

I've had 3 sets in this system all with similar results(one 4 stick kit I returned to newegg because half of the dimms were defective due to bad packaging, confirmed with a skylake system). I think i'm going to try to order a cheap B350 motherboard and a cheap sub 3000mhz 16gig ram kit(something on the supported memory list). This way i'll be able to better narrow down the problem. Bought the cpu + board at microcenter so if its my cpu or board i'll be exchanging them. If i have to exchange my cpu probably gonna get a 1700 or 1700x...
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Certainly Linux is not perfect, it has a scheduler, too. Time will tell if some patches can be applied to inform the scheduler of Ryzen's new architecture.
Linux has a *very good* scheduler. It already avoids moving threads around needlessly as Windows does, and it's already aware of a wide variety of topologies (NUMA and otherwise). The kernel devs responsible are almost certainly fine-tuning it for Ryzen as we speak, but it is already reasonable.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Hmm, all because of Windows odd behavior vis-a-vis thread allocation. Windows NT, at least back to 3.5, switches threads between cores (then CPUs) according to some algorithm (always looked random to me). I remember a spat on COMP.ARCH between some server dude and Dave Cutler over this on a dual CPU system - task manager showed exactly 50% utilization on each CPU when running a single threaded process. AMD, apparently, couldn't afford too design and implement two separate CPUs for client and server (with both being monolithic)

re: 1) AMD had to know this, just based on the design. No easy way to fix it without sort of breaking windows. I think Linux is smarter about threads and data locality - probably why it performs so well with Ryzen (haven't looked at Linux scheduling in a long time, so AFAIK). Even within one CCX this will be an issue, since prefetches are to the private caches.

re: 2) As Stilt pointed out, a ring 0 proggie can change this via core parking or from the command prompt with the /affinity switch can be used. I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.

I do wonder if AMD is planning on creating a monolithic design for 7nm? I don't know if their fabric can support > 12 cores with a single shared L3$ or if they can even afford to do that (since it must already be in development). If AMD is successful and is able afford to spend more on CPU development - I think we'll see something even more impressive than Zen (well, in absolute terms).

You can actually set cpu groupsize using bcdedit and force NUMA on in Windows. But this will limit most individual processes to just one CCX, rather than just the ones which have a problem.

The scheduler just needs to get smart enough to keep threads on the same CCX on which their data has likely remained.

Games, meanwhile, need to manually set affinities for some of their threads on Ryzen for the best results. Locking a control loop thread to a single core will allow that core's "AI" to adapt and a few bonus points to gained.

As for happens on 7nm... faster fabric, more CCXes, problem largely resolved. Maybe an L4.
 

rvborgh

Member
Apr 16, 2014
195
94
101
interesting to read this...

i had this same issue on my 48 core overclocked quad Opteron setup... i had a game that had a max of 6 threads... 1 heavy and the rest light. Performance was alright until i got to about 1800 bots... After much tuning... and trying different things... i found that using Process Lasso and tying the game to the fast cores on the processor in the first socket (4 out of the 12 K10s run fast on my system for each socket)... upped the frame rate by about 80-90 fps at the start of the game (from low to mid 200s to 335... and i could run the game with acceptable fps up to around 2500 bots.

i really recommend folks trying Process Lasso on their Ryzen's... as it seems they have the same windows scheduling issue.
 

Evil Azrael

Junior Member
Mar 4, 2017
2
0
1
@The Stilt,
so far you seem to be only one to get Win7 running, could you write a few words how you managed to do this. My own tries with my existing installation or with the win7 installation DVD (USB & SATA drive) had not much success, win7 always hangs during booting.

Could you please tell how you managed to get win7 running?
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Anyways, i thought Phoronix data for Dota 2 Vulkan was pretty curious in regards to core scaling (SMT is enabled everywhere).
http://www.phoronix.com/scan.php?page=article&item=amd-ryzen-cores&num=2

The Infinity fabric is at 512 Gbytes/s in Vega but unclear what latency.
Uhm, they made up fancy name for memory controller?


The scheduler just needs to get smart enough to keep threads on the same CCX on which their data has likely remained.
Why would you do thread shuffling in age when the main issue most of the time is cache/memory access in the first place? Just curious.

P. S. At this point i am confident that Himeno and Prime95 have the same issue: for some reason they use K10 codepath without any AVX. For Prime95 i am confident because i have seen a screenshot that seems to claim as much, for Himeno it is the only viable explanation left.
 
  • Like
Reactions: lightmanek

Hi-Fi Man

Senior member
Oct 19, 2013
601
120
106
This inter-CCX communication issue reminds me of the Core 2 Quads just without the MCM and FSB. I wonder how Windows dealt with that; Was XP ever patched to deal with that? Is the right solution to treat each CCX as a NUMA node?
 

KTE

Senior member
May 26, 2016
478
130
76
@TheStilt .. Sorry if I've missed any of this. I can't stand Win 10, so don't like to use it except for a tablet.

Does parking a Core with Ryzen occur in logical pairs or one at a time?

What conditions have to be present in Win10 for the Core to be parked?

Are loads juggled about in Win 10? And seriously, why would they be if Cores are supposed to be idle->parked for efficiency? That makes no sense.

Core Park Manager and other tools can monitor and disable this feature, but so can the High Performance setting or changing min/max power values.

What's the DRAM/CCX inter bandwidth/latency with a few parked Cores? i.e. does it change?

Phenoms changed drastically.

What did you attain for L1, L2, L3 access latency? (tried Franck@CPUIDs tool?)

What is Ryzens idling frequency/voltage?

What are idle/load stock CPU temps like? Do the sensors seem accurate? (compared to K10 gen)

Can you monitor core throttling? Have you tried loading stock Ryzen to see if it throttles?

Trying to explain performance issues...

Two questions for @The Stilt

1. Since the frequency of the data fabric is fixed to the memory frequency at a ratio of 1:2, does this mean that using faster memory would result in much faster performance and in what tasks would having faster fabric frequency be most beneficial?

It seems that fixing the data fabric frequency to the memory frequency impose significant restriction to the data fabric. In the future, could the data fabric frequency be decoupled from the frequency of the memory controller or perhaps the ratio could be changed for higher data fabric frequency?

1. In every ideal arch, these areas are on separate power planes and clock domains... Completely separate voltage islands.

This is a major shortsightedness by AMD with any 'IMC problems'. Phenom suffered due to IMC clocking and power shenanigans. That was 2006.

Having a linked CCX power plane, is even still backwards.

Its entirely possible that clocking might be impaired by their CCX more than purely 14nm LPP process. It was like that with Phenoms IMC/L3 previously.

AMD implemented decoupling characteristics back in 2007/2008 silicon and saved a ton of power and performance.

It is never a case of just decoupling alone tho and it just works. DRAM<->L3<->Fabric is very tricky to get right. Clocks and power generally pose major sync and timing issues to avoid corrupted data. Ryzens implementation is simply synonymous of a quick and easy job due to time constraints. It's obvious they couldn't afford to spend as much time as they needed on it.

Going forward, it would be the first area they would look to change.

2. These 'Windows issues' are AMD issues. We had them with Phenom for Christ's sake! When the workload bounced, with CnQ active, performance sucked and stuttered. AMD would have KNOWN about these issues during 'design considerations' 5 years back. It's AMD who has to adapt and get these fixed. Borked chip releases is only destroying their own image and income.

Secondly, Core Parking has to be implemented by AMD. If you don't have a working driver for your test OS, why allow this?

Telling reviewers to switch to the High Performance profile which doesn't park the core is just a band aid, at best and very misleading of your real world performance. Ryzen obviously has issues sleeping and waking the cores. Again, a Phenom issue.

3. Having two CCX at low interconnect bandwidth/high latency is an even bigger flaw, but this again will not be by design. This is going to pose a huge problem on Server workloads unless fixed. Forget HPC altogether.

4. Now you see why Server was not launched. AMD chose their smallest, least risk market to troubleshoot the chip.

5. Intellectual prediction won't magically gain 10-15% performance -- this is wishful thinking, like the pre-release hype. Every one of which turned out wrong.


Hmm, all because of Windows odd behavior vis-a-vis thread allocation. Windows NT, at least back to 3.5, switches threads between cores (then CPUs) according to some algorithm (always looked random to me). I remember a spat on COMP.ARCH between some server dude and Dave Cutler over this on a dual CPU system - task manager showed exactly 50% utilization on each CPU when running a single threaded process. AMD, apparently, couldn't afford too design and implement two separate CPUs for client and server (with both being monolithic)

re: 2) As Stilt pointed out, a ring 0 proggie can change this via core parking or from the command prompt with the /affinity switch can be used. I've been using Process Lasso for years to deal with this issue in some, high performance, programs (and to manipulate priority levels). It sounds like MS may be waiting for the April launch of Redstone 2 ("creators update") to fix this within the windows scheduler when Zeppelin CPU is detected. I would guess that the fix may already be in the latest Windows "Fast Ring" updates or will be in the next couple of weeks.
I'm sure K10 had an app that could change the skew when different PStates are entered, and even force certain PStates.

And it would show separate power planes portions.


Sent from HTC 10
(Opinions are own)
 
Last edited:
  • Like
Reactions: T1beriu and Ajay

Greyguy1948

Member
Nov 29, 2008
156
16
91
All of the designs had their multithreading (SMT/CMT) enabled during the testing since that is what the end-users would be having.

I've verified some of the individual results without the multithreading being enabled on all of them, and the differences fell within the standard deviation of the results (final results being 3RA).
IIRC NBody illustrated a slight improvement on Excavator from having the CMT disabled, however that's pretty much irrelevant since such configuration is not allowed at default.

Regarding Excavator- can you disable CMT in BIOS?
Regarding Himeno and Nbody- is it likely another size of these BM would give a very different result?
3D Euler seem do be hard to predict- Caselab and CFD are very different.
3D Euler tested at techreport.com is even worse for Ryzen.
 

FlanK3r

Senior member
Sep 15, 2009
312
37
91
Guys, whtas best temp monitoring with Ryzen? I wouldm like to use Coretemp...If it will be worked correctly. Or hwinfo sensoring? Thx.
 
May 11, 2008
19,300
1,129
126
Computerbase got a 17% boost in avg FPS in Total War: Warhammer when disabling SMT.
At 1080p with not sure what GPU as i don't speak german but Titan X or GTX 1080.

I guess it is better if i ask this question in this thread.

I was wondering about something. I read that for some situations like gaming, it is better to disable SMT.
But when SMT is disabled, the 8 threads on the cores can just as often be stalled as with SMT enabled but other threads take over execution time, filling in the gaps. Only, difference is that when SMT is disabled, when the thread stalls, the core stalls as well with SMT disabled. This would mean that Ryzen would dissipate less heat with SMT disabled because there are moments when the core does less work, allowing for higher sustained overclocks and boost clocks.

Does that make any sense or am i overlooking something ?

I mean, with 8 cores, that would be sufficient for a lot of people.
Also, i have read that going for 2666MHz single sided (ranked, only memory chips on 1 side ) DDR4 modules is the best option to get high memory speeds.

Is that true ?
 

KompuKare

Golden Member
Jul 28, 2009
1,012
923
136
My own tries with my existing installation or with the win7 installation DVD (USB & SATA drive) had not much success, win7 always hangs during booting.

Could you please tell how you managed to get win7 running?

I would imagine the same way you'd get it working on any new hardware:
  1. Either slipstream the drivers, or
  2. Have them ready on a USB stick and tell Windows about it at the right moment.
The chipset drivers you should be able to get from most motherboard makers, for example:
https://www.asus.com/uk/Motherboards/ROG-CROSSHAIR-VI-HERO/HelpDesk_Download/
If you've never slipstreamed before, it's probably best if you follow a guide preferably starting with the SP1 ISO from Microsoft:
https://www.microsoft.com/en-gb/software-download/windows7 (that now requires your product key which is hassle especially if you want to make a universal disc)
This is a guide to slipstreaming the Intel RAID drivers:
http://www.win-raid.com/t750f25-Guide-Integration-of-drivers-into-a-Win-image.html
That uses NTLite which makes it easy but it is possible to do the same without third party tools.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Regarding the ST: 1800X at default, with turbo & XFR enabled scores 162 in Cinebench 15. With the TDP (PPT) limited to 30W the score is 155.

Wow. That's incredibly impressive, especially when compared to results for existing laptops. The only ones that get a higher Cinebench ST score than that are two Clevo desktop replacements that use enthusiast K-series CPUs (and thus have high TDPs). For normal laptop CPUs, the best score recorded was 147.74, on the Dell XPS 15. That laptop uses a Core i7-6700HQ Skylake CPU with a TDP of 45W. And Ryzen can beat that by ~5% at two-thirds the TDP?!
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
@The Stilt

Thank you for the article :).

Any chance you can fill in the x.xGHz ACXFC for a R7 1700?

For example, for the 1700 SKU the clock configuration is following: 3.0GHz all core frequency (MACF), 3.7GHz single core frequency (MSCF), x.xGHz maximum all core XFR ceiling (ACXFRC) and 3.75GHz maximum single core XFR ceiling (SCXFRC).

R7 1700 SHOULD have 3.05GHz ACXFRC (due being non-X), however there is indication that it's not necessarily the case.
Unfortunately I don't have the data available, to check it right now.

Two questions for @The Stilt

1. Since the frequency of the data fabric is fixed to the memory frequency at a ratio of 1:2, does this mean that using faster memory would result in much faster performance and in what tasks would having faster fabric frequency be most beneficial?

It seems that fixing the data fabric frequency to the memory frequency impose significant restriction to the data fabric. In the future, could the data fabric frequency be decoupled from the frequency of the memory controller or perhaps the ratio could be changed for higher data fabric frequency?

2. Since low clock speed is the result of Samsung 14nm LPP and that increasing frequency beyond Critical 2 would require significantly higher voltage, would it be beneficial for AMD to instead move its production of high frequency products to TSMC?

Or rather, should AMD be focused on increasing its IPC at the given frequency?

Based on my own tests, the average performance benefit from higher than 2400MHz DRAM (i.e. 1200MHz DFICLK) is very marginal in 2D, even with 8C/16T config. Will smaller core count even 2133MHz is fine.

There are various interfaces / fabrics inside Zeppelin and their functionality and relations are not fully known at the moment. So eventhou the data fabric operates at half the effective MEMCLK, that doesn't necessarily mean that the inter-CCX connections are operating at the speculated width and speed. There are parts of the fabric which are 256-bit wide for example.

Moving to another node (such as 16nm FF+) is really a nobrainer, if it yields in <10% increased Fmax. Porting the design will be extremely expensive, however not NEARLY as expensive as trying to increase IPC of the µarch itself. The results in terms of the performance are always guaranteed when increasing the frequency on existing design, while that's not the case on a modified µarch featuring higher IPC. The modified µarch may necessarily not be able to reach same speeds as the old one did, so the actual performance may remain the same or even degrade.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Moving to another node (such as 16nm FF+) is really a nobrainer, if it yields in <10% increased Fmax. Porting the design will be extremely expensive, however not NEARLY as expensive as trying to increase IPC of the µarch itself. The results in terms of the performance are always guaranteed when increasing the frequency on existing design, while that's not the case on a modified µarch featuring higher IPC. The modified µarch may necessarily not be able to reach same speeds as the old one did, so the actual performance may remain the same or even degrade.

So does it makes more sense cost and time wise for AMD to use Samsung 14nm LPU or totally port their design to TSMC 16 nm FF+
TSMC's 16 FF+ seems to clock higher than samsung's 14nm LPP/LPC but isn't LPU supposed to catch up here?
Also Glo Fo will skip 10nm right?
 
Status
Not open for further replies.