Any chance you will do an updated version of that test? The lack of proper freesync vs gsync comparisons in the last year+ is pretty sad.
Any chance you will do an updated version of that test? The lack of proper freesync vs gsync comparisons in the last year+ is pretty sad.
The 'dragged tooth and nail' you are referring to was actually multiple of days worth of us testing and retesting, calls to vendors, trying to replicate the very unique worst case scenario that was being reported all over as some devious scandal. The reason for such additional testing was that we were not seeing the issue to the same extreme as other reports were indicating. Instead of just jumping the gun and posting 'we don't see it', we spent the additional time to push the systems / games / VRAM to the point where we could see it. In the end, we had to push upwards of 150% of 4k resolution, which I should remind you is still less than 1% of the current install base over two years later.
Going back to your original statement, if you are referring to Frame Rating as 'special tests' we busted out, I should point out that we have been using Frame Rating on pretty much every single GPU review since Jan 2013. We did not 'bust out' a special thing to investigate that issue. The then 2-year old tool set was simply used to show that the issue turned out to not be as drastic as some folks were making it out to be.
I don't see how investigating issues and reporting our findings makes us 'jokers', but whatever floats your boat I guess.
The "devious scandal" was that Nvidia flat-out lied to reviewers in its spec sheets, claiming that the GTX 970 had 64 ROPs and 2MB of L2 cache when in fact it had only 56 ROPs and 1.75 MB of L2 cache. This was illegal and immoral regardless of whether any of this had any real-world effect on performance or not.The 'dragged tooth and nail' you are referring to was actually multiple of days worth of us testing and retesting, calls to vendors, trying to replicate the very unique worst case scenario that was being reported all over as some devious scandal.
I am convinced the data fabric is multiple HyperTransport 3.1 links from various locations. Four links for the memory controller, the same going between the CCXes, a couple links for PCI-e, a link for all the standard I/O, and however many for the GMI links.Guys, Has it occured to anyone that the CCX switching and slower than expected gaming performance is all connected by one root cause?
The reason that we have been observing these anomalies have nothing to do with windows scheduling. It all gets down to the Data Fabric not having enough bandwidth to cope with the load placed on it when it is dealing with a a CPU having to do large amounts of calculations with the associated memory access, combined with the graphics related load being pushed to the PCIe controller, combined with storage access combined with the CCX switching.
When the frame rates get really high, like with a Titan X at 1080p, the number of calculations and the associated memory access requirement, the number of times the CPU needs to switch CCX, access to storage all combined with the numbers of instructions to the GPU to write 200fps all have to use the Data fabric at the same time leading to contention for the available bandwidth and increasing latency to each of the components in use so that the CPU or the GPU has to stop momentarily and wait before it can continue processing. All the added wait states combine to give the computer the reduced apparent performance.
Ethernet networking over dumb hubs instead of switches has the same issue, once traffic load on the wire hits about 40% load the number of packet collisions increases exponentially and network performance pretty much stops. With Gigabit networks and network switches, we do not see those sorts of performance drop offs on Lans now the way we used to on 10Mbs coax networks 25 years ago.
When benchmarking a compute only benchmark like cinebench for example, Without the added graphics load, there is not the same amount of traffic in the data fabric and it does not start getting close to the point on the bandwidth performance curve where the performance starts being impacted.
The Data Fabric is clocked at 1/2 the memory frequency as I note has been discussed already so 3200 or 3600Mhz ram should increase the amount of available bandwidth and help alleviate the issue. The reviews, because the memory support was poor on day one tended to test with slower memory speeds that exacerbated the DF bottleneck making the performance drops more pronounced.
If you want to see the contention/bandwidth issues in action. Take a look at Ryzen firestrike results and compare them to 6900K results with the same GPU, you will see the Ryzen Graphics score is good, Physics score is good then the combined score, that all the reviewers ignored, is terrible.
I'll say I've seen some odd behavior in Windows 10. With SMT disabled while doing a Cinebench single thread test, I've seen the load move from one core to another in the MIDDLE of a chunk of render... .which seems really odd and wasteful.In terms of Ethernet - Ethernet as used with "dumb hubs" is a half-duplex protocol. Data fabrics are designed to support multiple full-duplex data paths, and for that reason are used in Ethernet switches, which are very much more efficient than Ethernet hubs and can support Ethernet in full-duplex mode. In fact, it's impossible to use Gigabit Ethernet with dumb hubs, because it doesn't *have* a half-duplex mode.
So forget about 40% of anything.
Based on stuff I've found out in the past few days, here's how Windows appears to assign threads to cores. I'm basing this on experiments with Windows 7, but I believe Win10 is essentially similar.
1: After every timeslice, the core assignment for each thread is calculated anew, and is not stable. This results in frequent migrations of threads between cores, unless only one core is available to that thread (of which more below) or all other available cores are already busy.
2: The timeslice is based on the thread priority, which itself is influenced by the process priority. Higher priorities get longer timeslices - "realtime" gets a (pseudo?) infinite timeslice. You can therefore reduce the thread migration frequency just by increasing the process priority.
3: Core availability is determined solely by Core Parking. The scheduler itself **is not SMT aware**, but the Core Parking algorithm is. Core Parking is not part of the scheduler - it is part of the power management subsystem, and its tunables can be made available in the Power Options control panel with some registry tweaks. I emphasise: if all cores are unparked, Windows will happily migrate threads randomly onto both physical and virtual cores, just as if they were all physical.
4: If the thread or process has CPU Affinity set, that further limits the available set of cores for that thread, on top of Core Parking. However, if all affinity-set cores for a thread are parked, Core Parking is overridden and the thread is assigned to one of its affinity cores anyway. Core Parking can detect this and will unpark cores which see significant affinity activity. NB: a parked core is not necessarily shut down (in C-state) - but a parked core is *usually* idle and *therefore* usually shut down. C-stated cores are necessary to go beyond the all-core turbo speeds automatically.
5: On Windows 7, by default all physical cores have at least one virtual core unparked at all times. This might be different on Win10, based on screenshots. There is a toggle for this behaviour.
6: The Core Parking algorithm is very dumb - it doesn't seem to know how many threads are "runnable" at any given instant - and is therefore tuned to be very liberal about how many cores are unparked. I can fully believe that it will unpark about twice as many cores as there are full-time threads. This means that a 2T workload will appear to be spread evenly across 4 cores on a many-core system - but this **must not** be taken as evidence of NUMA awareness.
In short, it's a mess. As I noted over on Reddit, Linux does all of this about 500% better, because Linux people care about technical results, not risk-averse arse-covering.
Well, now the chickens have come home to roost in Redmond, as far as technical debt in the basic scheduler is concerned. Here are a couple of things that could be fixed by one competent engineer in a week:
1: Make the core assignment algorithm stable, ie. so that it prefers to assign the core that the thread is already running on, or that it last ran on (if it isn't busy). This will have immediate performance benefits on *all* SMP, SMT, CMT, and NUMA systems, not just Ryzen. Less context-switch overhead, fewer cache misses, better branch prediction, more truly-idle cores that can be C-stated. It's pure win.
2: Rework the Core Parking algorithm to use the runnable thread count to guide the number of unparked cores. On NUMA systems, unpark cores preferentially on the same node for runnable threads belonging to the same process, and preferentially on different nodes for runnable threads in different processes. Better yet, scrap Core Parking entirely and work on the best practical implementation of Point 1.
That, of course, is how Linux does it.
It's possible. Even if it isn't that big of a deal per se, there are obviously problems with CCX switching that can be alleviated through increasing data fabric speeds. It all points to a need for higher memory clocks.Guys, Has it occured to anyone that the CCX switching and slower than expected gaming performance is all connected by one root cause?
If you have an C6H board you do not need any AM4 cooler kit. Theyhave additional holes in the board to enablr mounting of AM3 coolers. My TR Macho cooler runs perfectly on the C6H with the AM3 attachment.I asked this before and I ask it again: Did anyone setup their own Windows power profiles?
And if so, do "Core Parking" settings have an effect on Ryzen scheduling? In my experience it's not a CPU feature, but an OS feature with a fancy name where the OS scheduler tries to keep threads from jumping cores in order to allow other cores to enter deep(er) sleep states without the extra penalty of waking them up too often. This maybe could help keep threads from jumping to another CCX?!
Asus CH6 board only started shipping from most German retailers yesterday (just got it), so I still could not build my own rig. And even then I may need to order another CPU cooler, because the Arctic Liquid Freezer gets its AM4 retention module as late as April.
AMD released a GCC patch over a year ago which contained all the information anyone needed to optimize for Zen aside from a few minor details. AMD then applied improvements to the Linux Kernel to better handle Ryzen - that should have given any kernel developer the information they needed.I am not suggesting that Ryzen has no other issues. It is a completely new architecture. The apparently unprepared motherboard manufacturers scrambling to catch up with bios revisions would suggest that AMD were rather secretive during development.
It would be point-to-point for critical paths, I'd imagine, as this was specifically mentioned in some documents (using an upper metal layer as a dedicated data bus). I believe it was The Stilt who said that there were areas that were 256-bit wide (the die shot would certainly allow plenty of room for that). That is EIGHT HT 3.1 links, assuming that is what they used (rather than some variant thereof), so they must be doing some dedicated pathways.While the total bandwidth is 40 GBs, multipoint networks cannot use all of it at once.
Several sources from AMD have said the 22GB/s figure specifically to various people, but I would not be surprised to see it be higher in aggregate. The figure, though, does make sense. 32 byte/cycle is just four HT 3.1 links (32 * 4 DDR = 256 / 8 = 32 byte).@looncraz where this 22 GB/s for IF comes from?
According to the AMD slide with the clock domain, the IF can do 32B/cycle.
Even using the slowest 800 MHz for DDR4-1600 it gives me 25GB/s in one direction.
The only thing I could think of lowering this is CRC.
As there is separate control and data parts it should not be lowered by signling.
Linux testing is next week Virtual machine performance testing is part of that, though I was mostly going to use that for my own purposes (I'm trying to get rid of Windows and go back to running anything else as my main OS... but Linux went in the wrong direction for quite a while... Cinnamon is nice, though ).
Let me know what MB you are using and how the iommu groups look.Linux testing is next week Virtual machine performance testing is part of that, though I was mostly going to use that for my own purposes (I'm trying to get rid of Windows and go back to running anything else as my main OS... but Linux went in the wrong direction for quite a while... Cinnamon is nice, though ).
They could use the slow motion camera they found just in time to show the Freesync ghosting issue, which was actually a panel issue and nothing wrong about FreeSync.I'm referring to all the 'special' test PCPer busts out on an AMD launch, yet has to be dragged tooth and nail to even acknowledge issue like the 4GB/3.5GB memory issue with the 970.
But since you are in the mood for investigating, i wonder if you could set up your 4 core test to replicate this scenario and show the results. I'd like to know where the choke points are on a 4 core system. Is there anything positive you can investigate on Ryzen, or are you only interested in exploiting a single lower than expected result (which is being rectified with developers)-gaming?
Anyway, Jayz2Cents seems quite surprised at the results here, like it is something he hasn't experienced before, which points to advantages with 8 cores. Any chance of some time spent on this?
And again I wonder why no one here tries to set up their own Windows power profile? It's not like Windows' behavior cannot be fine-tunes (partly even controlled) in many regards. Power profiles have been mentioned so often, yet no one reported trying to get their hands dirty on a custom profile.Not all games/apps are impacted by these issues the same way. Setting high performance power profile helps, but that isn't the end of it as that disabled higher turbo states!
Good to know, thanks. I only got the board and CPU yesterday and was mostly out, so I'm gonna try the AM3 retention kit today. Before I read your post I got my old 4790K stock cooler out of the basement, hoping it would use the old clamp design, but it does not. Thanks again for pointing me there.If you have an C6H board you do not need any AM4 cooler kit. Theyhave additional holes in the board to enablr mounting of AM3 coolers. My TR Macho cooler runs perfectly on the C6H with the AM3 attachment.
I've been trying that myself in Win7, centred around the unlockable tweaks to the Core Parking algorithm I mentioned above. Unfortunately it's very hard - and maybe impossible - to make this algorithm behave correctly using just these tweaks, even with respect to the much simpler Kaveri APU I'm currently stuck with. It just doesn't have the correct data as inputs.And again I wonder why no one here tries to set up their own Windows power profile? It's not like Windows' behavior cannot be fine-tunes (partly even controlled) in many regards. Power profiles have been mentioned so often, yet no one reported trying to get their hands dirty on a custom profile.
|Thread starter||Similar threads||Forum||Replies||Date|
|Question Ryzen 5000 series having worrying failure rate||CPUs and Overclocking||42|
|G||Question Why my Overclocked Ryzen 3700x not showing any performance increase in Game FPS as well as 3ds Max Render time?||CPUs and Overclocking||7|
|Question Intel or AMD for specific type of desktop||CPUs and Overclocking||26|
|L||Question Ryzen 7 5800x Temps||CPUs and Overclocking||26|
|C||Question Ryzen 5 3600 high temperatures||CPUs and Overclocking||18|
|Question Ryzen 5000 series having worrying failure rate|
|Question Why my Overclocked Ryzen 3700x not showing any performance increase in Game FPS as well as 3ds Max Render time?|
|Question Intel or AMD for specific type of desktop|
|Question Ryzen 7 5800x Temps|
|Question Ryzen 5 3600 high temperatures|