Ryzen: Strictly technical

Blake_86 · Mar 15, 2017

Kromaatikse said:
The difference with Ryzen is that the "potential performance" does actually exist this time, and is easily demonstrated in productivity benchmarks. Games are achieving anomalously low performance in comparison, though it's still much better performance than anything from the Bulldozer family could even dream about. These are fixable problems.

the milion dollars question is "when" and how much time will take to these cpu's to fully be supported by game developers.
I'know, cpu demand is not 100% by gamers, but still an important part of the market share.
Most important, how much time will take to have motherboard whith decent BIOS possibly out of the box? cause ii is frustrating to buy a module of ram only because the one you chose/already have dont post with the current bios and you have to update first

richaron · Mar 15, 2017

Kromaatikse said:
No, I think the CCX module is genuinely a tool for making larger multi-core designs than would otherwise by feasible for AMD. They are using the same Infinity Fabric based design for the larger Naples and Snowy Owl server/workstation parts, which are in fact MCM.

Nope they could easily make more cores in a single segment. We are talking about a company with vast experience with GPUs and even more CPU cores in the past...

Kromaatikse said:
As for Bulldozer, it's been proved conclusively that there was no "potential performance" that could feasibly be unlocked by adopting better optimisations. Not only that, but K10 would in many workloads have achieved better performance in the same power and die size budgets on the same process, compared to *either* Bulldozer or Piledriver. Steamroller and Excavator might have been small improvements, but not nearly enough to justify the R&D costs and ecosystem disruption.

Performance comparisons of different individual cores is a moot discussion; I've not raised the subject, nor am I interested in arguing it. I would however be interested in seeing the conclusive proof "there was no "potential performance" that could feasibly be unlocked by adopting better optimisations" with the CMP line.

Kromaatikse said:
These are fixable problems.

I've heard something like this before...

looncraz · Mar 15, 2017

keymaster151 said:
Since we know of two different games that treat Ryzen as a 16 core processor it made me think. Is there a way to force a process to treat the CPU as 8/16 instead? If this is an issue with more games, could this be the cause of the less than optimal performance we are seeing with SMT enabled in some games?

Ignore the other guy - he clearly doesn't realize that games frequently use their own hardware detection methods.

If you use Windows' GetLogicalProcessorInformation() - the information for Ryzen, at least in terms of core counts, is accurate - it shows 8 cores and 16 threads. I haven't verified the caches, yet.

Games, however, frequently use the cpuid data manually - and if there's an AMD CPU, they don't do anything more than count cores. If it's Intel, they check to see if Hyper-threading is enabled. All they really did, with the game, was make note that Ryzen has SMT capabilities (and really good SMT capabilities at that - besting Intel almost across the board... simply amazing).

This manual hardware detection is done largely because games are designed to be at least somewhat portable (to consoles, for example) and the same basic code base is used on all systems. It doesn't help that Windows' C/++ API are a fractured mess.

lolfail9001 · Mar 15, 2017

looncraz said:
Ignore the other guy - he clearly doesn't realize that games frequently use their own hardware detection methods.

Do they like to shoot themselves in the foot that much? Can't say i am surprised...

looncraz said:
Games, however, frequently use the cpuid data manually - and if there's an AMD CPU, they don't do anything more than count cores.

Uhm, considering that on AMD CPUs since Bulldozer the odd numbered CPUs are best ignored in favor of even numbered ones, they clearly like to shoot themselves in the foot, if that's true.

looncraz · Mar 15, 2017

dnavas said:
How fast is the memory you're running? Do you see changes in latency? Someone (I forget who) was saying they were seeing occasional 300ns latencies. I'm wondering if there's a general problem between the CPU and the MC?
Your Windows 7 problems are troubling, given I'm using Win7, though I think I'll wait on making more than just memory system changes atm.

I tested with two memory configurations: DDR4-2133 CL15-15-15-35 2T and DDR4-2667 CL15-15-15-35-1T.

I don't have the results on this machine, but Ryzen is very sensitive to memory performance - to the point that it even impacts benchmarks that are usually unaffected by memory clocks. That, of course, is because the data fabric runs at the same speed as the IMC... a strange, and somewhat infuriating, choice.

What's strange is how core clock impacts memory latency and bandwidth. From 3Ghz to 3.8GHz on the CPU should see no more than 1~3% difference. I go from ~35GB/s to >43GB/s using DDR4-2667. Latency drops nearly 10ns (~10%).

Kind of seems like the requested data is sent directly to the L3, not the requesting core, and is then piped into the core... because L3 latencies drop with higher frequencies - as you'd expect.

That could also explain the high memory latencies...

IMC -> DF -> CCX DF -> L3 -> Core L2 -> Core Fetch

I'm writing an app to test this and trying to isolate the data fabric, caches, and IMC latencies.

coffeemonster · Mar 15, 2017

looncraz said:
Games, however, frequently use the cpuid data manually - and if there's an AMD CPU, they don't do anything more than count cores. If it's Intel, they check to see if Hyper-threading is enabled. All they really did, with the game, was make note that Ryzen has SMT capabilities (and really good SMT capabilities at that - besting Intel almost across the board... simply amazing).

This manual hardware detection is done largely because games are designed to be at least somewhat portable (to consoles, for example) and the same basic code base is used on all systems. It doesn't help that Windows' C/++ API are a fractured mess.

how easy is it to spoof all games/applications into thinking Ryzen is "genuine Intel"?

looncraz · Mar 15, 2017

lolfail9001 said:
Do they like to shoot themselves in the foot that much? Can't say i am surprised...

Uhm, considering that on AMD CPUs since Bulldozer the odd numbered CPUs are best ignored in favor of even numbered ones, they clearly like to shoot themselves in the foot, if that's true.

Most companies use the same code they've always used. Windows didn't even have GetLogicalProcessorInfo() until Windows 7, IIRC (correction: XP SP3)

Games would certainly not care about losing another 10% on a CPU as horrid as Bulldozer. AMD should have recognized that long ago.

looncraz · Mar 15, 2017

coffeemonster said:
how easy is it to spoof all games/applications into thinking Ryzen is "genuine Intel"?

Once upon a time it was very easy. Now, it's easiest to do it through a virtual machine.

https://communities.vmware.com/thread/467303?start=0&tstart=0

I've done this to make sure performance on Intel didn't drop because I had an AMD cpuid. Usually - it didn't. Sometimes, it did... sometimes by a lot.

Edit:

Then again, you could just edit the game binary and swap GenuineIntel for AuthenticAMD

But that is risky - and may trigger game protections (against hackers / cheaters).

Kromaatikse · Mar 15, 2017

richaron said:
I would however be interested in seeing the conclusive proof "there was no "potential performance" that could feasibly be unlocked by adopting better optimisations" with the CMP line.

That's actually quite straightforward, and we can do it with a direct comparison of K10 with the general Bulldozer-family blueprint.

Recall that K10 was the last in a long line of AMD's triple-pipeline CPUs, starting with the original Athlon. This made it a very mature and well-understood optimisation target. It was also exceptionally tolerant of badly optimised code, or code optimised for a completely different CPU such as the Pentium 4 family. Each of its three integer pipelines could handle the great majority of integer operations, including LEA and load-store ops (as they had an embedded AGU as well as an ALU). Memory and cache latencies were also respectably low, these having been a headline feature of K8. Thus, on any sort of integer code that wasn't memory-limited, it could get remarkably close to its theoretical IPC of 3.

K10 did have two serious weaknesses: primarily the FPU, which was very outdated and could execute only one (possibly SIMD) add plus one multiply per cycle. The three-wide design also extended to the retirement unit, which prevented the latter from clearing the pipelines any faster than the front-end could stuff new instructions into them. This could lead to front-end stalls when instructions of mixed latency were in play.

Now recall that each Bulldozer core (of which two per module, up to eight per die) has only two ALUs and two AGUs. Assuming the full four-wide front-end is available (ie. a single-threaded workload), this means the front-end can generate twice as many macro-ops as the back-end can execute (as each macro-op can use both an ALU and an AGU). The integer back-end is therefore 33% narrower than K10 and is obviously a potential bottleneck. Or, to put it another way, it needed 50% more clock speed to merely equal K10's integer throughput. If that sounds at all familiar, think of the Pentium 4.

Then we come to the memory performance, which was abysmal and remained so throughout the family series. The memory controller itself was fine, as evidenced by the perfectly good performance achieved by iGPUs sharing it. The problem was the cache hierarchy, which appeared to have been specified and designed by a committee of chimpanzees on mind-altering drugs. Or, more plausibly, it was the result of sticking religiously to synthesised SRAM cells (as opposed to hand-optimised ones) and still striving for that 50% higher core clock speed. Whatever the cause, it had bad latency, worse throughput and terrible hit rates (especially at L1, which was the only one actually running at core frequency). A lot of Excavator's improved IPC over Steamroller comes from having a 32KB L1-D instead of 16KB.

The one thing that Bulldozer did right was to address K10's two most serious bottlenecks, as described above. The retirement unit for each core was wider than the front-end, permanently solving mixed-latency instructions, and the FPU was substantially upgraded, capable of executing two multiply-adds, OR two multiplies, OR two adds per cycle - all in SIMD if required. For legacy code using separate multiplies and adds, however, the throughput could be merely equal to K10 at the same clock speed. Then AMD ruined it all by providing only one of these improved FPUs for two cores to share - and floating-point-heavy workloads also tend to be memory-heavy, so they ran face-first into the cache hierarchy.

So in theory, a heavy yet single-threaded FP workload, optimised using FMA instructions and with very modest memory-access requirements, could outperform K10 on Bulldozer. Anything else would find a bottleneck - *somewhere*. So shall we consider multithreaded workloads, which Bulldozer was allegedly designed for?

Straight away we notice that shared FPU, which eliminated Bulldozer's per-clock FP throughput advantage over K10, and the shared L1-I cache and decoders. Given a legacy workload, not using FMA instructions, this gives Bulldozer *half* the total FP throughput per core per clock as K10. By converting to FMA (the "unlocked capabilities"), we can get back up to parity. Big whoop. Any actual advantage here would have to come from higher clock speeds or core count.

Multithreading also makes the already-questionable L1-I cache in Bulldozer completely inadequate to sustain tolerable hit rates, unless by chance both cores are running the same code from the same location in memory. It also halves the decoders' throughput per core, so now they match that of the integer units we mentioned earlier, with nothing spare to keep the FPU properly fed. The wider retirement unit goes completely to waste in this scenario. Steamroller and Excavator have separate decoders (and are significantly better for it), but must still share fetch bandwidth and the L1-I cache.

As for the clock speed, this reached 5GHz if you are generous enough to include the absurd FX-9590 and count the maximum turbo clock speed. This is *not* 50% higher than the best K10 models, which could run all cores at their maximum speed all the time without violating a much more reasonable TDP - and on an older, larger process node to boot. Piledriver and Excavator did nudge north of 4GHz within reasonable TDPs, but that's not enough to overcome the other handicaps - and AMD had already given up on making Bulldozer "faster than the competition" by the time Excavator arrived.

The only trick left up Bulldozer's sleeve was core count. But we can directly compare Llano with Trinity - same process node, same core count, one with K10 and the other with Piledriver - to find that each Piledriver module takes up significantly more die space than a pair of hastily-shrunk K10s. Even after reducing the size of the iGPU (luckily it wasn't slower - they changed from VLIW5 to VLIW4 to save space), Trinity's total die size is larger than Llano's. We can extrapolate to say that simply shrinking K10 to 32nm should have allowed a Phenom II X8 within Vishera's die size, equalling the core count with much less R&D cost.

So from every angle, Bulldozer was a complete and utter failure, even after making optimistic assumptions about software evolution.

By contrast, Ryzen achieves excellent performance on *current* productivity software, with remarkably few exceptions. No arguments about "performance potential" are needed in that context. Only in games, which are a rather different type of workload, is there a question-mark.

formulav8 · Mar 15, 2017

looncraz said:
Windows didn't even have GetLogicalProcessorInfo() until Windows 7 SP1, IIRC

I think it came out in XP.

richaron · Mar 15, 2017

Kromaatikse said:
*SNIP*

Wow you just wrote a wonderful post, good game. But maybe I missed the point again, or maybe I just can't read between the lines, because you seem to keep on comparing it to other architectures.

Can you please point out the part where:

Kromaatikse said:
it's been proved conclusively that there was no "potential performance" that could feasibly be unlocked by adopting better optimisations

^ This is what I asked you to prove.

looncraz · Mar 15, 2017

formulav8 said:
I think it came out in XP.

You're right, it was made available on XP with SP3. Otherwise Windows Vista was required.

NostaSeronx · Mar 15, 2017

Kromaatikse said:
Now recall that each Bulldozer core (of which two per module, up to eight per die) has only two ALUs and two AGUs.

2 Complex ALUs & 2 Address Gen & Logic Units. Which is two Arithmetic units and two Address generation or four Logic units. Four is one more than three. As well as the PRF & Map+Scheduler OoOE is vastly larger than Stars OoOE. (It is also better than Zen's *sips tea* monstrosity)
https://www.google.com/patents/US20120144175
https://www.google.com/patents/US8990623

Kromaatikse said:
The one thing that Bulldozer did right was to address K10's two most serious bottlenecks, as described above. The retirement unit for each core was wider than the front-end, permanently solving mixed-latency instructions, and the FPU was substantially upgraded, capable of executing two multiply-adds, OR two multiplies, OR two adds per cycle - all in SIMD if required. For legacy code using separate multiplies and adds, however, the throughput could be merely equal to K10 at the same clock speed. Then AMD ruined it all by providing only one of these improved FPUs for two cores to share - and floating-point-heavy workloads also tend to be memory-heavy, so they ran face-first into the cache hierarchy.

Decode = Retirement in width. Four macro-ops in dispatch/decode/retire and in SR/XV it is slightly bigger to support loop. The FPU has one VIMUL or VIMAC and two VIADDs in BD/PD, in SR/XV it is one VIMUL/VIADD/VIMAC and one VIADD. Vector integer is more common in consumer workloads than vector floating point.

Kromaatikse said:
So from every angle, Bulldozer was a complete and utter failure...

Bulldozer was a success in mobile/Mainstream DT/embedded.

Trinity -> Richland // 10h-1Fh -> Kaveri -> Godavari // 30h-3Fh -> Carrizo -> Bristol Ridge/Stoney Ridge. // 60h-7Fh
Zambezi -> Vishera -> Centurion // Orochi (00h-0Fh) ... 20h-2Fh Cancelled, 40h-4Fh Cancelled. (Komodo PD != Vishera/Trinity PD)

Architecture wise... well those designs... were lost in a fire, or eaten by a doge. *sneaks into Sunnyvale then absconds with future 15h designs*

Kromaatikse · Mar 15, 2017

NostaSeronx said:
2 Complex ALUs & 2 Address Gen & Logic Units. Which is two Arithmetic units and two Address generation or four Logic units. Four is one more than three.

Except that K10 had an ALU and an AGU in each pipeline, for a total of six units - which is more than four. AMD's "macro-ops" meant that it could dispatch both an AGU micro-op and an ALU micro-op simultaneously (for a single instruction) for the cost of one, in every one of their CPUs since the original Athlon.

NostaSeronx said:
Bulldozer was a success in mobile/Mainstream DT/embedded.

No, it wasn't. AMD's iGPUs were a success in those areas. They just happened to have a Bulldozer-family CPU hanging off the side.

Kromaatikse · Mar 15, 2017

richaron said:
Wow you just wrote a wonderful post, good game. But maybe I missed the point again, or maybe I just can't read between the lines, because you seem to keep on comparing it to other architectures.

Seriously, read through it and do the maths.

Work out the total theoretical computing power available in Piledriver if it's fully loaded with 8 threads and given the best mix of instructions you can think of at precisely the right times. Even ignore the cache problems - suppose you have a workload that doesn't need memory access or a big blob of code. That's the "performance potential" of Bulldozer as actually built: two instructions per clock per core, limited by the shared front-end, so 75.2 GIPS, 75.2 scalar GFLOPS (counting an FMA as 2 FP ops) and 300.8 SIMD GFLOPS for the FX-9590 at stock all-core speed (4.7GHz).

Then do the same for ye olde K10 in 8-core form, which everyone already knew exactly how to optimise for because it really was available as a 6-core (and, in servers, as a 12-core MCM). With no shared hardware, it can actually achieve 3 instructions per clock, though it has a weaker FPU. Assume it goes no faster than a 45nm K10, in order to keep the total TDP constant with the higher core count; the 1100T ran at 3.3GHz stock with no turbo. That's 79.2 GIPS, 52.8 scalar GFLOPS and 211.2 SIMD GFLOPS for this notional Phenom II x8 - faster in integers, and only slower in FPU proportionately with the lower clock speed.

Piledriver supports AVX instructions but gets no performance bonus from them (they take two macro-op slots), and had a nasty bug which made storing AVX registers to memory excruciatingly slow. So both CPUs end up using plain old SSE.

Then compare performance per watt. FX-9590 achieves its clock speed only with a massive "factory overclock" which severely restricts motherboard compatibility, due to its 220W TDP. All of the Phenom II models topped out at 125W TDP, and sometimes less. So Phenom II x8 wins on perf/watt, even on an FP workload, and would have been cheaper to produce due to its smaller die, *and* would have been much easier to get peak performance from.

We can run similar numbers for, say, Haswell or Sandy Bridge - and not even the workstation parts, but the desktop ones, as they support the same 8 threads. Again, no shared hardware between physical cores, so both FPU pipelines are available. The front-end can decode 4 instructions per cycle as long as they produce at most 4 (fused) micro-ops. There's enough integer hardware to go around - 3 ALUs on SB, 4 on Haswell. The AVX pipelines are twice as wide as AMD uses, so there's a SIMD bonus for Intel. The i7-4790K even gets a base clock of 4GHz as standard.

So that's 64 GIPS, 64 scalar GFLOPS and 512 SIMD GFLOPS with FMA FP code on a Haswell desktop i7, in an 88W TDP - a mere 40% of the FX-9590's power budget for 85% of the theoretical scalar performance and 70% *extra* theoretical SIMD performance. Most of that performance is available from 4 threads rather than 8, which may simplify the multithreaded design of some algorithms (or complete systems, such as games). And again, this is a well-understood architecture that Intel will happily give you a highly-tuned compiler for.

Thus, I repeat: there was no way for any member of the Bulldozer family to outperform Intel by the application of carefully optimised software. Not even in theory.

innociv · Mar 15, 2017

Kromaatikse said:
I think you might be confusing process scheduling with instruction scheduling. The latter is a function of the compiler; Windows apps, like Linux distro packages, generally optimise for a generic CPU to maximise compatibility.

Thanks for the correction.

But it does not change that that other article looncraz as referring to was just a fix to have it correctly identify which threads are the second threads on a core, and not an update to scheduling to better optimize for the dual CCX structure.

richaron · Mar 15, 2017

Kromaatikse said:
Thus, I repeat: there was no way for any member of the Bulldozer family to outperform Intel by the application of carefully optimised software. Not even in theory.

Aaaand here we arrive at the issue. That's not what you said. That's not what I picked up upon. And that's not what I was talking about. What you said was:

Kromaatikse said:
As for Bulldozer, it's been proved conclusively that there was no "potential performance" that could feasibly be unlocked by adopting better optimisations.

Again I appear to be missing something, because all you have been doing is comparing theoretical performance of two different architectures. But I asked you to prove something related to the real world performance of a single architecture.

I think you got off to a good start introducing the previous architecture as a baseline, but to prove "conclusively that there was no 'potential performance' that could feasibly be unlocked by adopting better optimisations" what you will need to do use use real world data. I would be looking for data which shows the maximum theoretical performances (which you seem to be into) equate to real world performance relative to our introduced baseline. If you can show how AMD's CMT line used to their maximum potential versus the baseline in the real world, is directly related to their theoretical differences then I'll start seeing what you're talking about.

You made a pretty big claim saying "it's been proved conclusively that there was no "potential performance" that could feasibly be unlocked by adopting better optimisations", and it sounded like it was obvious. Maybe you can just link to someone else proving it in a way I can understand? Surely with something so obvious there is plenty out there?

lolfail9001 · Mar 15, 2017

innociv said:
But it does not change that that other article looncraz as referring to was just a fix to have it correctly identify which threads are the second threads on a core, and not an update to scheduling to better optimize for the dual CCX structure.

The fix to identify different CCXs been in kernel since what, November of last year? Positively present in 4.10. And you can't properly optimize for dual CCX structure, not in on-the-fly scheduler.

Kromaatikse · Mar 15, 2017

innociv said:
just a fix to have it correctly identify which threads are the second threads on a core, and not an update to scheduling to better optimize for the dual CCX structure.

True - but that's because Linux already supports weird memory architectures quite well. An 8-core Ryzen is actually better-connected, relatively speaking, than a 2P or 4P Opteron system of yore, and those were optimised for pretty carefully when they were most relevant.

Linux even already knows that there are two L3 caches; no fix is required there because it is reported using a standard mechanism which hasn't changed behaviour. I'm not 100% certain how much weight it puts on that information, but it could easily be used to group threads of a single process into the same cache when practical. Ryzen is not the only important CPU in recent times to have split LLCs, just the first one to have significant penetration on the Windows desktop.

Kromaatikse · Mar 15, 2017

richaron said:
That's not what you said. That's not what I picked up upon. And that's not what I was talking about.

Well I apologise if what you read differed from what I meant. The fact is, Bulldozer has never reached its theoretical peak performance in any non-trivial application, because optimising specifically for it is completely impractical - and not just because better alternatives exist. Real workloads always run into one or more of its many bottlenecks and gotchas.

Naturally, there also exists some software which runs *spectacularly* poorly on it. My attention was drawn recently to the Himeno benchmark, which turns out to be a rather inefficient implementation of a fluid-dynamics equation, and which apparently runs consistently worse on AMD CPUs (including Ryzen) than on anything recent by Intel. I dug into it to see if that could rationally be explained - and wound up making it run over 3x as fast on my Steamroller by rearranging the data structures and streamlining the array index calculations. But those are generic optimisations which work almost as well on other CPUs, including Bobcat, three different (relatively old) Intel CPUs I had to hand, a PowerPC G4 and an ARM Cortex-A7. I simply reduced the number of cache misses and integer instructions per element by a large fraction.

Was I "unlocking potential performance" on those CPUs as well, or just fixing a hideously bad piece of code? What if I vectorised it (which neither GCC nor Clang manage to do automatically, in either the old or new versions), and then saw recent Intel CPUs leap ahead by another factor of 2 due to their wider SIMD units? What if I rewrote it in OpenCL and fed it to the iGPU - does that count, since it's no longer hobbled by the CPU core itself?

richaron · Mar 15, 2017

Kromaatikse said:
Well I apologise if what you read differed from what I meant. The fact is, Bulldozer has never reached its theoretical peak performance in any non-trivial application, because optimising specifically for it is completely impractical - and not just because better alternatives exist. Real workloads always run into one or more of its many bottlenecks and gotchas.

Agreed. I'm glad we got to this point in the discussion because the other stuff was spammy and tiring (but for the record I just read what was written).

Now if I can get back to my tinfoil hat theories which I believe is relevant to this conversation; I think a non-zero amount of the software for microcodes, drivers, compilers or whatever is comparable from Bulldozer all the way through Ryzen up to Naples (at least at a higher level of abstraction). And if one were to be cynical we could almost think of these as development platforms. For example I don't think it's a coincidence the Bulldozer and a CCX both have 4 cores & 8 threads with comparable resource affinity. And I for sure don't think it's a coincidence both Bulldozer and Naples have 4 subunits (Zen cores|modules), and each of these subunits is split in 2 but with some shared resources. Shirley any tricks learnt to help resource sharing or to optimize inter-thing communication would help a number of products.

Kromaatikse · Mar 15, 2017

richaron said:
Shirley any tricks learnt to help resource sharing or to optimize inter-thing communication would help a number of products.

Tinfoil hat territory indeed. You do know that tinfoil hats have been shown to *amplify* the frequency ranges potentially used by government mind-control rays, due to the resonant cavity they form? Stick *that* in your conspiracy theory file.

In my view, there are fewer similarities between anything in the Bulldozer family and Ryzen, than between Ryzen and old K8 multi-socket Opterons, at least as far as topology goes. And that's a *good* thing. But if you want another AMD CPU with four cores per last-level cache and significantly more latency between those caches, look no further than Jaguar, as used in the PS4 and XBone. Of course, those consoles don't run Windows, but a specialised game-centric OS.

Ajay · Mar 16, 2017

Kromaatikse said:
Ryzen is not the only important CPU in recent times to have split LLCs, just the first one to have significant penetration on the Windows desktop.

And here is the rub - Zen doesn't have significant penetration on Windows desktops or Servers yet. So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.

Add to that the likelihood that AMD has improved performance under the windows scheduler, either in a new stepping or in Zenver2, and I suspect we won't be seeing an MS patch any time soon (despite earlier indications that we would).

For giggles, current versions of Windows use an MLFQ scheduler (http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf), with the added bonus of a bolted on core parking algorithm to deal with modern p-states. I can't find any details of how windows handles multicore CPUs, never mind how it handles cache locality, etc.

tamz_msc · Mar 16, 2017

@Kromaatikse I can vouch for the efficacy of the code written by computational physicists, especially an old one like Himeno.

tamz_msc · Mar 16, 2017

Ajay said:
And here is the rub - Zen doesn't have significant penetration on Windows desktops or Servers yet. So Mickey$oft has no incentive to risk destabilizing Intel scheduling performance for the sake of Ryzen processors. Evidently, it's not as simple as reading the CPUID and using switch/case conditionals.

Add to that the likelihood that AMD has improved performance under the windows scheduler, either in a new stepping or in Zenver2, and I suspect we won't be seeing an MS patch any time soon (despite earlier indications that we would).

For giggles, current versions of Windows use an MLFQ scheduler (http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf), with the added bonus of a bolted on core parking algorithm to deal with modern p-states. I can't find any details of how windows handles multicore CPUs, never mind how it handles cache locality, etc.

I think looncraz's findings and the suggestions he made regarding improving Windows scheduling would benefit Intel CPUs as well.

Ryzen: Strictly technical

Junior Member

Golden Member

Senior member

Golden Member

Senior member

Senior member

Senior member

Senior member

Member

Diamond Member

Golden Member

Senior member

Diamond Member

Member

Member

Member

Golden Member

Golden Member

Member

Member

Golden Member

Member

Lifer

Diamond Member

Diamond Member