I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.
basically you are actually BETA testing.
Motherboard, Memory,Overclock and bios. nothing have given me issues - Product competes, feels finished to me.It looks right about on par, possibly just stability and power state issues at present.
AMD basically pulled an all-nighter with Zen.I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.
basically you are actually BETA testing.
I think you might be confusing process scheduling with instruction scheduling. The latter is a function of the compiler; Windows apps, like Linux distro packages, generally optimise for a generic CPU to maximise compatibility.I think you misread/misheard.
The only update for Linux, which happened right away, was to correctly assign SMT threads.
-znver1 is relying upon the btver1 scheduler model. Btver1 is for AMD's Bobcat.
People are finding using the Haswell scheduler model improves performance 5-10% in Linux, but they are working on a proper Zen scheduling model that should be more like 10-20% improvements in some cases.
And this is exactly what Windows needs, its own scheduler model for Ryzen...
Win10 scheduler already does that. Besides, didn't you see that explicitly setting R7 as 8 core in F1's case led to measly 3% improvement?Since we know of two different games that treat Ryzen as a 16 core processor it made me think. Is there a way to force a process to treat the CPU as 8/16 instead? If this is an issue with more games, could this be the cause of the less than optimal performance we are seeing with SMT enabled in some games?
Because SMT by the virtue of simply being enabled gimps few queues (uop queue, retire queue and store queue) with static partitioning. If you can harness the full throughput (and Zen apparently does have more non-AVX throughput than Skylake) with SMT it is great. Not so much otherwise.But if the scheduler already does that, how come there is any improvement at all?
How fast is the memory you're running? Do you see changes in latency? Someone (I forget who) was saying they were seeing occasional 300ns latencies. I'm wondering if there's a general problem between the CPU and the MC?One thing that does scale, in the real world, with frequency that is unexpected: memory performance. Usually changing the core frequency doesn't have much of an impact on memory reads and writes - maybe 500MB/s or so. I'm seeing 35GB/s changing to 43GB/s going from 3GHz to 3.8GHz - and Geekbench memory scores jumping from 3500 @ 3Ghz to 4000 at 3.8GHz.
This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.RAM acts as a last-level cache
so the problems may begin at the level of the last 2 points, right? The More requests misses and fall back to L3 or ram level, the more cycles are lost waiting for data because the fabric bandwidth is shared between all the requests.This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.
My understanding of the cache-lookup procedure on Ryzen is as follows:
- First cache lookup is to L1-D of same core. L2 isn't touched unless this fails. At this point TLB lookup (for virtual memory) has already succeeded, one way or another.
- Next the local L2 cache is checked. This takes a bit longer, as there are more ways to go through and it's further away.
- If the local L2 cache misses, the request is sent straight to the L3 cache. This holds, among other things, a partial copy of the L2 tags for other cores in the same CCX. If one of *those* hits, the request is then forwarded to that L2 cache in a partially-decoded state; the correct cache completes the lookup and supplies the data. Because the L2-L1 hierarchy is inclusive, it is not necessary to also perform a lookup in other L1 caches.
- If the L2 tag lookups fail, the L3 lookup was already in progress and now completes. If it hits, the data is supplied by the local L3 cache and promoted to the L2 and L1 caches.
- If the local L3 cache lookup fails, the request is broadcast over Infinity Fabric to the other L3 cache(s) in the system (possibly plural because of Naples & Snowy Owl). The hybrid L2/L3 lookup procedure above is thus repeated.
- If all of these lookups fail, the request is routed to the appropriate RAM controller. This is the final resort, and is only initiated after *all* possible cache lookups have proved fruitless.
This. The info is incomplete, this support for these memory speeds won't be for AM4, but another platform.
Not sure if that was in reference to my last post, but all those are AM4 and the same platform as the ones we have right now.
Why it's rather challenging to add new memory divider support for CPUs is because boards are designed with specific trace lengths and all signal routing is done for specific memory frequencies.
It's not all the same, from 2133 to 3,600MHz. So even though the dividers may be added via uCode. They may not work at all simply because the board track/trace layout cannot handle such frequencies.
Again, not saying it's impossible, but these frequencies are for another platform built with these memory dividers in mind from the beginning.
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.My understanding of the cache-lookup procedure on Ryzen is as follows:
they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
Could the latency of cross-CCX access be the same as the one of the memory?Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
It's very easy to test the wrong thing with a micro benchmark, should be very careful about such things.they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?
Ah, but that's expected given the above, and knowing that the L3 cache only contains data which has been evicted from its *local* L2 caches. So if you run a single-threaded microbenchmark, you'll see an 8MB LLC.Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
No, I think the CCX module is genuinely a tool for making larger multi-core designs than would otherwise by feasible for AMD. They are using the same Infinity Fabric based design for the larger Naples and Snowy Owl server/workstation parts, which are in fact MCM.I have a theory the dual CCX idea is basically a working development board for refining software (/hardware) for future AMD MCM (/"infinity fabric") systems. And further the dual CCX idea is currently a more applicable and "simplified" (i.e. lower initial complexity [software already knows SMP, now only 2 distinct sections needing communication]) version of Bulldozer's CMP idea; which I postulate had the same final goal (i.e. software/platform development on a competitive system) but was much too drastic a change for software to capitalize upon.
Did you say strictly technical or tin foil hats welcome?
To clarify all the CCX stuff, I've sent a few questions to my contact at AMD and hopefully we'll have some answers by the weekend. I specifically asked about latency to all points of the system, including from CCX0 to IMC1 and IMC2 in a Ryzen 7 processor.The latency of inter-CCX cache access is still less than going all the way to RAM, but is more complicated to measure.
|Thread starter||Similar threads||Forum||Replies||Date|
|Question How does the memory divider usually work at ryzen 5000?||CPUs and Overclocking||1|
|H||Question "Good Value" AM4 oc boards||CPUs and Overclocking||2|
|G||Question Will the $175 I5 11400f give the $200 Ryzen 5 3600 a run for the money?||CPUs and Overclocking||16|
|Question Which is better? Ryzen 5800x vs 3900||CPUs and Overclocking||24|
|Question Ryzen 7 3700X - CPU throttled?||CPUs and Overclocking||52|