Separate names with a comma.
Discussion in 'CPUs and Overclocking' started by The Stilt, Mar 2, 2017.
Thanks for the hard work Stilt!
I ran the draw call benchmark on my Ryzen 1800x (OC'd to 4ghz) and averaged 17 fps. I've been messing with this system for hours through numerous reboots, and at some point I was getting 14, but I'm pretty sure I was at 4ghz then as well. The BIOS for the Asus Crosshair is really not in a good state. My voltages have been screwy and I've given up on getting my 3200 ram working past 2666 for now.
I see another Process Lasso fan, MS should really be ashamed of themselves seeing how they can't seem to do half of what bitsum achieved with their software. Then again it's windows (10) & it just works.
You are not the only one with... problems. Also have an 1800x, I cant get my memory past 2133 unless i do some convoluted process that involves swapping back and forth between two sets of ram...(then i can set the memory up with ryzen master to get 2666) I have a Gigabyte Gaming 5...
I've had 3 sets in this system all with similar results(one 4 stick kit I returned to newegg because half of the dimms were defective due to bad packaging, confirmed with a skylake system). I think i'm going to try to order a cheap B350 motherboard and a cheap sub 3000mhz 16gig ram kit(something on the supported memory list). This way i'll be able to better narrow down the problem. Bought the cpu + board at microcenter so if its my cpu or board i'll be exchanging them. If i have to exchange my cpu probably gonna get a 1700 or 1700x...
Linux has a *very good* scheduler. It already avoids moving threads around needlessly as Windows does, and it's already aware of a wide variety of topologies (NUMA and otherwise). The kernel devs responsible are almost certainly fine-tuning it for Ryzen as we speak, but it is already reasonable.
@Encrypted11 You might find some of what you're looking for in PCPER's review.
You can actually set cpu groupsize using bcdedit and force NUMA on in Windows. But this will limit most individual processes to just one CCX, rather than just the ones which have a problem.
The scheduler just needs to get smart enough to keep threads on the same CCX on which their data has likely remained.
Games, meanwhile, need to manually set affinities for some of their threads on Ryzen for the best results. Locking a control loop thread to a single core will allow that core's "AI" to adapt and a few bonus points to gained.
As for happens on 7nm... faster fabric, more CCXes, problem largely resolved. Maybe an L4.
The Infinity fabric is at 512 Gbytes/s in Vega but unclear what latency.
interesting to read this...
i had this same issue on my 48 core overclocked quad Opteron setup... i had a game that had a max of 6 threads... 1 heavy and the rest light. Performance was alright until i got to about 1800 bots... After much tuning... and trying different things... i found that using Process Lasso and tying the game to the fast cores on the processor in the first socket (4 out of the 12 K10s run fast on my system for each socket)... upped the frame rate by about 80-90 fps at the start of the game (from low to mid 200s to 335... and i could run the game with acceptable fps up to around 2500 bots.
i really recommend folks trying Process Lasso on their Ryzen's... as it seems they have the same windows scheduling issue.
I gues the infiny fabric is not a fixed width, probably comes in links to scale up or down as needed.
so far you seem to be only one to get Win7 running, could you write a few words how you managed to do this. My own tries with my existing installation or with the win7 installation DVD (USB & SATA drive) had not much success, win7 always hangs during booting.
Could you please tell how you managed to get win7 running?
Anyways, i thought Phoronix data for Dota 2 Vulkan was pretty curious in regards to core scaling (SMT is enabled everywhere).
Uhm, they made up fancy name for memory controller?
Why would you do thread shuffling in age when the main issue most of the time is cache/memory access in the first place? Just curious.
P. S. At this point i am confident that Himeno and Prime95 have the same issue: for some reason they use K10 codepath without any AVX. For Prime95 i am confident because i have seen a screenshot that seems to claim as much, for Himeno it is the only viable explanation left.
This inter-CCX communication issue reminds me of the Core 2 Quads just without the MCM and FSB. I wonder how Windows dealt with that; Was XP ever patched to deal with that? Is the right solution to treat each CCX as a NUMA node?
@TheStilt .. Sorry if I've missed any of this. I can't stand Win 10, so don't like to use it except for a tablet.
Does parking a Core with Ryzen occur in logical pairs or one at a time?
What conditions have to be present in Win10 for the Core to be parked?
Are loads juggled about in Win 10? And seriously, why would they be if Cores are supposed to be idle->parked for efficiency? That makes no sense.
Core Park Manager and other tools can monitor and disable this feature, but so can the High Performance setting or changing min/max power values.
What's the DRAM/CCX inter bandwidth/latency with a few parked Cores? i.e. does it change?
Phenoms changed drastically.
What did you attain for L1, L2, L3 access latency? (tried Franck@CPUIDs tool?)
What is Ryzens idling frequency/voltage?
What are idle/load stock CPU temps like? Do the sensors seem accurate? (compared to K10 gen)
Can you monitor core throttling? Have you tried loading stock Ryzen to see if it throttles?
Trying to explain performance issues...
1. In every ideal arch, these areas are on separate power planes and clock domains... Completely separate voltage islands.
This is a major shortsightedness by AMD with any 'IMC problems'. Phenom suffered due to IMC clocking and power shenanigans. That was 2006.
Having a linked CCX power plane, is even still backwards.
Its entirely possible that clocking might be impaired by their CCX more than purely 14nm LPP process. It was like that with Phenoms IMC/L3 previously.
AMD implemented decoupling characteristics back in 2007/2008 silicon and saved a ton of power and performance.
It is never a case of just decoupling alone tho and it just works. DRAM<->L3<->Fabric is very tricky to get right. Clocks and power generally pose major sync and timing issues to avoid corrupted data. Ryzens implementation is simply synonymous of a quick and easy job due to time constraints. It's obvious they couldn't afford to spend as much time as they needed on it.
Going forward, it would be the first area they would look to change.
2. These 'Windows issues' are AMD issues. We had them with Phenom for Christ's sake! When the workload bounced, with CnQ active, performance sucked and stuttered. AMD would have KNOWN about these issues during 'design considerations' 5 years back. It's AMD who has to adapt and get these fixed. Borked chip releases is only destroying their own image and income.
Secondly, Core Parking has to be implemented by AMD. If you don't have a working driver for your test OS, why allow this?
Telling reviewers to switch to the High Performance profile which doesn't park the core is just a band aid, at best and very misleading of your real world performance. Ryzen obviously has issues sleeping and waking the cores. Again, a Phenom issue.
3. Having two CCX at low interconnect bandwidth/high latency is an even bigger flaw, but this again will not be by design. This is going to pose a huge problem on Server workloads unless fixed. Forget HPC altogether.
4. Now you see why Server was not launched. AMD chose their smallest, least risk market to troubleshoot the chip.
5. Intellectual prediction won't magically gain 10-15% performance -- this is wishful thinking, like the pre-release hype. Every one of which turned out wrong.
I'm sure K10 had an app that could change the skew when different PStates are entered, and even force certain PStates.
And it would show separate power planes portions.
Sent from HTC 10
(Opinions are own)
Regarding Excavator- can you disable CMT in BIOS?
Regarding Himeno and Nbody- is it likely another size of these BM would give a very different result?
3D Euler seem do be hard to predict- Caselab and CFD are very different.
3D Euler tested at techreport.com is even worse for Ryzen.
Guys, whtas best temp monitoring with Ryzen? I wouldm like to use Coretemp...If it will be worked correctly. Or hwinfo sensoring? Thx.
I guess the HWinfo temperatures are correct at least. But my Crosshair just bricked, so any further testing has to wait. This could be a serious bug btw. I'm not the only one it seems. Edit: And this one seems to be similar as well.
Great Thread THX for the effort
I guess it is better if i ask this question in this thread.
I was wondering about something. I read that for some situations like gaming, it is better to disable SMT.
But when SMT is disabled, the 8 threads on the cores can just as often be stalled as with SMT enabled but other threads take over execution time, filling in the gaps. Only, difference is that when SMT is disabled, when the thread stalls, the core stalls as well with SMT disabled. This would mean that Ryzen would dissipate less heat with SMT disabled because there are moments when the core does less work, allowing for higher sustained overclocks and boost clocks.
Does that make any sense or am i overlooking something ?
I mean, with 8 cores, that would be sufficient for a lot of people.
Also, i have read that going for 2666MHz single sided (ranked, only memory chips on 1 side ) DDR4 modules is the best option to get high memory speeds.
Is that true ?
I would imagine the same way you'd get it working on any new hardware:
Either slipstream the drivers, or
Have them ready on a USB stick and tell Windows about it at the right moment.
The chipset drivers you should be able to get from most motherboard makers, for example:
If you've never slipstreamed before, it's probably best if you follow a guide preferably starting with the SP1 ISO from Microsoft:
https://www.microsoft.com/en-gb/software-download/windows7 (that now requires your product key which is hassle especially if you want to make a universal disc)
This is a guide to slipstreaming the Intel RAID drivers:
That uses NTLite which makes it easy but it is possible to do the same without third party tools.
Wow. That's incredibly impressive, especially when compared to results for existing laptops. The only ones that get a higher Cinebench ST score than that are two Clevo desktop replacements that use enthusiast K-series CPUs (and thus have high TDPs). For normal laptop CPUs, the best score recorded was 147.74, on the Dell XPS 15. That laptop uses a Core i7-6700HQ Skylake CPU with a TDP of 45W. And Ryzen can beat that by ~5% at two-thirds the TDP?!
R7 1700 SHOULD have 3.05GHz ACXFRC (due being non-X), however there is indication that it's not necessarily the case.
Unfortunately I don't have the data available, to check it right now.
Based on my own tests, the average performance benefit from higher than 2400MHz DRAM (i.e. 1200MHz DFICLK) is very marginal in 2D, even with 8C/16T config. Will smaller core count even 2133MHz is fine.
There are various interfaces / fabrics inside Zeppelin and their functionality and relations are not fully known at the moment. So eventhou the data fabric operates at half the effective MEMCLK, that doesn't necessarily mean that the inter-CCX connections are operating at the speculated width and speed. There are parts of the fabric which are 256-bit wide for example.
Moving to another node (such as 16nm FF+) is really a nobrainer, if it yields in <10% increased Fmax. Porting the design will be extremely expensive, however not NEARLY as expensive as trying to increase IPC of the µarch itself. The results in terms of the performance are always guaranteed when increasing the frequency on existing design, while that's not the case on a modified µarch featuring higher IPC. The modified µarch may necessarily not be able to reach same speeds as the old one did, so the actual performance may remain the same or even degrade.
Unfortunately I don't have the equipment to do that.
So does it makes more sense cost and time wise for AMD to use Samsung 14nm LPU or totally port their design to TSMC 16 nm FF+
TSMC's 16 FF+ seems to clock higher than samsung's 14nm LPP/LPC but isn't LPU supposed to catch up here?
Also Glo Fo will skip 10nm right?