Speculation: Ryzen 4000 series/Zen 3

Page 140 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Makaveli

Diamond Member
Feb 8, 2002
4,717
1,051
136
I am really tired of this gaming discussion.
Difference in this workload scenario is so negligible that people are arguing over height of graph bars.
I'll tell you, if you can indentify system in double blind test in HU 30 or so games benchmark testbed when using appropriate resolutions and game settings to hardware in systems, I'll lick yours pet behind, tape it and upload it to YouTube.

100% agreed most aren't looking at what is playable vs non playable performance. And 1 cpu giving 5 extra fps is now "Killing" the competition its laughable.
 

Thibsie

Senior member
Apr 25, 2017
746
798
136
100% agreed most aren't looking at what is playable vs non playable performance. And 1 cpu giving 5 extra fps is now "Killing" the competition its laughable.

Well you truncate de X axis at the best spot, it looks like that. Most review sites love that way of doing.
 

arandomguy

Senior member
Sep 3, 2013
556
183
116
A problem in general is that gaming reviews in general seem to be very GPU oriented and optimized essentially. The common AAA multiplatform games that get used for reviews are not really going to be CPU bound. Even desktop CPUs released years before the outgoing console generation greatly outperform the CPU in those consoles (although I wonder with how much more relatively stronger next gen CPUs are to the desktop will impact PC gaming, a concern I've brought up here before). While console+ graphics settings get added that can greatly leverage the stronger GPUs available the same does not exist for CPUs. So effectively all CPUs, even low end ones now, are more than playable for those types of games.

Why is this an issue? Well there are games that exist with relatively low GPU utilization which could show tangible differences in CPU (or other sub system) performances but they don't get tested. These are games that do have performance issues as well.

Furthermore I know there is a tendency for the community to always dismiss games that exhibit CPU performance issues with that they aren't "optimized" and an example of "bad" software which further pushes away exploration of those cases.

So instead for those of us actually invovled in those usage cases we have to extrapolate via data of the previously mentioned commonly reviewed games.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,673
3,789
136
I am really tired of this gaming discussion.
Difference in this workload scenario is so negligible that people are arguing over height of graph bars.
I'll tell you, if you can indentify system in double blind test in HU 30 or so games benchmark testbed when using appropriate resolutions and game settings to hardware in systems, I'll lick yours pet behind, tape it and upload it to YouTube.

I was eating when I read this. It got me a little bit queasy. Funny stuff though.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
You have to be kidding me or you lived on some lonely island last decade. Beause Intel's engineers are ressurecting Skylake for 5 years in a row and basically they bottled the development. AMD did went backwards with horrible Bulldozer, lost entire server market and almost bankrupt. Feel free to explain me why Bulldozer was such a garbage with 2xALUs and why Intel did well with 4xALUs in Haswell (twice IPC than BD). Of course it has nothing to do with number of ALU, right?

Do you think Zen3 will have 1xALU and will thread apart in IPC every uarch including 6xALU Apple?

My humble opinion is that computation performance is done in computation units like ALU, AGU and FPU. If you buy 8-bit micro-controler it consists of 1xALU and no cache and yet does the computation. I'm afraid if you buy chip with cache memory alone you won't able to do any computation. Feel free to prove me wrong :D



Of course, another Intel garbage messenger. Intel inserted into heads of whole generation people that CPU IPC hit the hard wall and it can be increased a little maybe with some cache tuning. It's so sad to see some people still believe this Intel's BS. Intel did that because was lazy and they was earning big money while no effort for CPU development.

Whole 82% IPC advantage of Apple core is just coincidence with 6xALUs. Poor performance of 2xALU Bulldozer was also coincidence with its low number of ALUs. I wonder why AMD didn't went back to 3xALU K10 design and they went directly on 4xALU design similar to Haswell. Coincidence again I guess.



Funny that you suggest IPC isn't dependent on number of ALUs but in the same time you ask for more FPUs. Did you realized that FPUs does the same thing as ALUs just with different format? :)


8-bit micro controller = strawman argument. Nothing to do with the real argument.

Adding an extra ALU is generally not going to change performance at all since the bottlenecks are elsewhere. You are doing good if you achieve an IPC of 1 on a modern processor; highly dependent on code in question though. AMD made some decisions that did not turn out well with bulldozer/excavator CPU’s, but one of the main issues was that they got stuck on 28 nm with no next generation process while Intel moved onto 20 nm and then 14 nm finfet. AMD didn’t really have much of a chance of competing during that time even if they had made a drastically improved design. It would have been too large and too power hungry on 28 nm. Intel is in a surprisingly similar situation with pushing, what 250 to 300 watts, to compete with 14 nm. What wI’ll it look like if AMD gets to 5 nm and Intel is still stuck on 14? Intel has plenty of money to weather the AMD storm, so it is very different in that regard.

I don’t remember the specs anymore and I don’t have time to look them up, but the cache system in Zen was drastically improved over previous excavator cpus. Also, if it turns out that more integer ALUs would increase performance, then they will probably add them. That adds more ports to an already probably very complicated scheduler though. They will be increasing the cache bandwidth to support more FP throughput; 256-bit FMA takes a huge amount of bandwidth, but integer ALU performance is dependent on latency, which is much harder to reduce. larger caches and other improvements may help the average latency, but latency reduction is significantly harder than increasing bandwidth. I don’t think they are suddenly going to be able to keep more than 4 ALUs busy (ipc greater than 4 on branchy integer code) considering most applications are probably still around 1.

Also, if you don’t know the difference between branchy integer code and FP code that can use two 256-bit vector FPUs (8 32-bit FLOPs each), soon to probably be at least three 256-bit vector units, then there isn’t much point in arguing with you. I guess this gives us some idea of the level of your knowledge. We have GPUs that can do thousands of FLOPs per clock but they generally don’t run general branchy integer code. I have been of the opinion that the vector width on cpus should remain limited. Most stuff that can really make use of wide vectors should be run on a gpu.

We have cpus that have ~ 4 scalar integer ALUs for good reasons. Branchy integer code is constantly blocking on memory access. All of the caching, prefetch, speculative execution, etc still leaves it waiting on the memory subsystem constantly. I have seen server application that achieve an ipc of less than 0.25 since they have a huge number of hard to predict branches and also large memory footprints that are not very cacheable. That is where threading and/or lots of small cores come in handy. I have thought that it would be interesting to make a cpu without the huge AVX vector units, perhaps just a simple scale FPU. A huge number of servers have almost no need of FP hardware. Leave the FP crunching to the GPUs. Many of them don’t actually need high performance cpus in the first place. A lot of things get lumped together under “server”. The needs of a storage server are a lot different from an HPC machine.
 

dr1337

Senior member
May 25, 2020
330
558
106

Yet another ES leak from igorslab, this time confirming zen 3 has the exact same amount of cache as zen 2. Clockspeeds are also identical to early rome ES as well.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
I hope AMD is not going to add anything of that sort like Intel's AMX. For a lot of applications that don't use matrix math it is a terrible waste of die space.
I am a believer of HSA, in the future I hope AMD can stack a special accelerator die which can offload these operations like the original x87 coprocessor.
Not all SKUs need to support matrix or specialized vector ops
There is a discussion initiated by Redhat to organize the x86 instructions by feature levels and one thing strikes me is that Level D contains mostly of AVX512 instructions.
I am not sure why Redhat would make this initiative to consider that to support a particular feature level it would need to support previous feature level first.
This is absurd. This means for future feature levels a processor has to support AVX512 first, what a mess. I hope this is not what is going to be the final proposal.
Level D : Level C + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL. At this stage with the AVX-512 focus, just current Intel Xeon Scalable CPUs and Ice Lake.

Linus was quite vocal against AVX512 as well.
I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.
I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.
I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.
Because absolutely nobody cares outside of benchmarks.

The same is largely true of AVX512 now - and in the future. Yes, you can find things that care. No, those things don't sell machines in the big picture.
And AVX512 has real downsides. I'd much rather see that transistor budget used on other things that are much more relevant. Even if it's still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.
I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).
Yes, yes, I'm biased. I absolutely destest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market.
Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough.
Yeah, I'm grumpy.

Linus
Zen3 not having AVX512 was good news to me. At least the rumors seems to indicate no AVX512 support.
I missed having something like the Tricore chips, which was the first HSA chip I worked with, for mainstream applications on x86.
The Unified Memory DSP and FPU accelerator in the TC17XX series I worked with made a lot of difference in the ease of programming and several magnitudes more powerful than its generic CPU cores.
The Unified memory HSA can't come soon enough to x86 to counter all these instructions like AVX512 and AMX. Leave special purpose instructions to ASICs, they called it that for that reason.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Linus was quite vocal against AVX512 as well.

Well speaking purely as a gamer and a hardware enthusiast, I have to disagree with both you and Mr. Torvalds.

I like the prospect of having wide vectors on CPUs, because they have sped up encoding/decoding algorithms immensely, and also because of how they can be used to make complex physics effects run much faster on a CPU.

As a long time PC gamer, I can reminisce back to the P4 and Athlon 64 days where running a single instance of cloth simulation on a CPU would cause the framerate to tank to single digit framerates. The inability of CPUs to handle those kinds of workloads is exactly what led Ageia to design a PPU, and then Nvidia to implement hardware accelerated physics calculations on their GPUs with PhysX. Theoretically a GPU was much better suited to physics calculations than a CPU, but in practice, there were limitations involved due to the high latencies from processing across the PCIe bus, overwhelming the GPU with too much work and not to mention vendor lock in so it never flourished.

Fast forward 15 years later, we now have high core count CPUs with SMT and wide vectors and physics calculations that once needed hardware acceleration now run at blazing speeds; faster than running it on a GPU in some cases like cloth simulation. For me, realistic physics is the last bastion of game technology that needs to be conquered. 3D rendering has advanced greatly over the years, but physics and animation quality has improved much slower.

With the next generation of consoles having much more capable CPUs, I expect that trend to change. AVX2 will see massive adoption in games. In fact, UE5 will have the Chaos physics engine which is optimized with Intel's ISPC compiler and targets the vector instruction sets that Mr. Torvalds is so disparaging towards. :D

But like I said, I'm looking at this purely as a consumer and not as an industry professional.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
Well speaking purely as a gamer and a hardware enthusiast, I have to disagree with both you and Mr. Torvalds.

I like the prospect of having wide vectors on CPUs, because they have sped up encoding/decoding algorithms immensely, and also because of how they can be used to make complex physics effects run much faster on a CPU.

As a long time PC gamer, I can reminisce back to the P4 and Athlon 64 days where running a single instance of cloth simulation on a CPU would cause the framerate to tank to single digit framerates. The inability of CPUs to handle those kinds of workloads is exactly what led Ageia to design a PPU, and then Nvidia to implement hardware accelerated physics calculations on their GPUs with PhysX. Theoretically a GPU was much better suited to physics calculations than a CPU, but in practice, there were limitations involved due to the high latencies from processing across the PCIe bus, overwhelming the GPU with too much work and not to mention vendor lock in so it never flourished.

Fast forward 15 years later, we now have high core count CPUs with SMT and wide vectors and physics calculations that once needed hardware acceleration now run at blazing speeds; faster than running it on a GPU in some cases like cloth simulation. For me, realistic physics is the last bastion of game technology that needs to be conquered. 3D rendering has advanced greatly over the years, but physics and animation quality has improved much slower.

With the next generation of consoles having much more capable CPUs, I expect that trend to change. AVX2 will see massive adoption in games. In fact, UE5 will have the Chaos physics engine which is optimized with Intel's ISPC compiler and targets the vector instruction sets that Mr. Torvalds is so disparaging towards. :D

But like I said, I'm looking at this purely as a consumer and not as an industry professional.
AVX2 is not the same as AVX512. And AMX is something else. Linus was talking about AVX512 not AVX2.
You have summarized perfectly the need for wide vector instructions needed by SW(in gaming, we know for HPC, AI and others it is obviously needed). UE doesn't use AVX512 however.
Even mainstream Intel chips don't have AVX512 (e.g. non X 10 Gen series dont have AVX512). Alder Lake will not have.


The point was about HW and not SW and how the HW handle these very wide vector/matrix instructions like AVX512/AMX. In case of HSA like the TC17XX for example, the compiler creates instructions which are executed by the FPU/DSP or the the ALU. The programmer is unaware where the instructions will run. There are some caveats, also this being from the 2000s, but this is the general idea.
Currently it is not always possible or straightforward to schedule instructions on the GPU without some sync operations, memory copies, memory locks and the like with the added drawback of latency.
If cache coherent unified memory HSA on x86 finally comes to fruition, the SW should not have to worry where wide vector instructions should run (that is the goal at least), the libraries and compiler will do the job of deciding that. On Intel the vector/matrix math will end up being executed in CPU if this is the path they choose. On a unified memory chip the instructions could run on the ASIC which by virtue of how they are designed are very good at it.
At same time the CPU does not have to be unneccesarily large to fit the vector units and frees die space for those SKUs which have no need for such vector instructions. At the same time those SKUs that need it can have the speedup much bigger than those on the CPU vector units.

Obviously there are hurdles to get there. Like what is the latency to get to the GPU.
But from all the papers we have seen from AMD, their 3D stacked chip will have the GPU on die with a local interconnect. Remains to be seen how the latency is going to be for others without GPU die, it should be worse(but this is a different topic).

EDIT
Renoir is also HSA, but the GPU and CPU are not cache coherent even though they have unified memory.
 
Last edited:

HurleyBird

Platinum Member
Apr 22, 2003
2,682
1,266
136
Yet another ES leak from igorslab, this time confirming zen 3 has the exact same amount of cache as zen 2. Clockspeeds are also identical to early rome ES as well.

I'd say points towards rather than confirms. There are three other possibilities: more cache physically exists but has been disabled, software isn't reporting cache sizes correctly, or there's a different die with more physical cache.
 
  • Like
Reactions: Thibsie

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
AVX2 is not the same as AVX512. And AMX is something else. Linus was talking about AVX512 not AVX2.
You have summarized perfectly the need for wide vector instructions needed by SW(in gaming, we know for HPC, AI and others it is obviously needed). UE doesn't use AVX512 however.
Even mainstream Intel chips don't have AVX512 (e.g. non X 10 Gen series dont have AVX512). Alder Lake will not have.

My point though was that the instruction set can be useful for certain workloads, and that developers will likely find ways of using it. Who could say what the future holds?

If you had asked me 10 years ago where the future of game physics and encoding/transcoding lay, I would have told you it was with GPU hardware acceleration. But as of today, hardware accelerated physics and GPU based encoding are both pretty much dead in the water while the software versions are flourishing and very performant.

Intel seems to really be pushing wider vectors, and when you look at their investment in things like their SVT codecs and ISPC compiler, their methodology doesn't seem as haphazard as one might have initially believed.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
I'd say points towards rather than confirms. There are three other possibilities: more cache physically exists but has been disabled, software isn't reporting cache sizes correctly, or there's a different die with more physical cache.

Almost confirm. Another extremely slim possibility is that the cores here are logical rather than physical, but that would mean doubling all across the board of L1, L2, L3; and this is highly unlikely since they already doubled L3 in the previous gen. I'd hoped they would increase L1 or double L2, but this doesn't seem to be in the cards anymore.

As far as not doubling L2 they must have done some calculations and figured out that it wasn't worth powering twice the transistors for L2 for the gains a doubled L2 would give (probably a diminishing returns effect). Perf/Watt is probably valued for Zen3 quite a bit more than just Perf.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
Have any of AMD's APU designs been cache coherent?
None so far.
You would probably have known but just in case... cache coherency would really simplify programming. All Fabric masters are aware if data changed and they can invalidate their cache line when data is changed by another Core or GPU. This allows both CPU and GPU to practically access the same variables/vectors etc. It is like multi threading with another core. No need to prepare a kernel, prepare the data, copy to GPU, launch the kernel and wait for result and copy it back from GPU memory. Just call a library function and it will do matrix FMA with your locally declared matrices for example. But as you can see, lots of BW is needed to keep those SIMD occupied because they work on a lot of data at once, so I suspect cache coherency with GPUs will bring a change in how CPUs look like in the future. And also part of the reason why Mark highlighted the 10X BW density with X3D.
 
  • Like
Reactions: moinmoin

DrMrLordX

Lifer
Apr 27, 2000
21,612
10,817
136
None so far.

That's what I figured. Kaveri and Carrizo certainly weren't. I wasn't sure about anything after that but my suspicion was "no".

You would probably have known but just in case... cache coherency would really simplify programming.

You can sort of cheat there with clever driver implementations, such as what nVidia tries to do over NVLink. I think they have support hardware on NVLink-compatible hardware that helps as well. That being said, I'm not sure AMD is going to go that route.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,673
3,789
136
Almost confirm. Another extremely slim possibility is that the cores here are logical rather than physical, but that would mean doubling all across the board of L1, L2, L3; and this is highly unlikely since they already doubled L3 in the previous gen. I'd hoped they would increase L1 or double L2, but this doesn't seem to be in the cards anymore.

As far as not doubling L2 they must have done some calculations and figured out that it wasn't worth powering twice the transistors for L2 for the gains a doubled L2 would give (probably a diminishing returns effect). Perf/Watt is probably valued for Zen3 quite a bit more than just Perf.

I agree, the L3 is fine where it is. I was hoping for more L2 cache. But I will clearly defer to AMD.
 
  • Like
Reactions: amd6502

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
"The AMD Kavari APU already has a fully coherent memory between the CPU and GPU."

You're absolutely right, I did not think back far enough. I need some more digging.
Also great link discussing the advantages of cache coherent architectures.
Now I am intrigued and interested to buy one of them, but first I need to check compiler support. Highly doubt AMD did something then without any money.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,612
10,817
136
"The AMD Kavari APU already has a fully coherent memory between the CPU and GPU."

I don't think that means they're cache coherent though.

Huh guess they were, nevermind. Never did much good though.

@DisEnchantment

The software tools required to make use of SVM were pretty goony. I tried on my Kaveri ages ago. You had to run Linux to use the old HSA stack because windows never got a working kernel fusion device (kfd). I *think* you can sidestep that with OpenCL 2.0 buuuut I never dug into it. Carrizo is a slightly better incarnation of HSA hardware in that it permitted full support for HSA 1.1 or whichever was the last/latest version of the HSA spec.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
The software tools required to make use of SVM were pretty goony. I tried on my Kaveri ages ago.
Even now it is same :(.

To get ROCm working on my machine was such a nightmare.
If I install amdgpu, it conflicts with ROCm. If I dont have amdgpu nomachine cannot resize the window for me to connect remotely.
If I don't install ROCm my code use cannot use GPU and is very slow.
ROCm keeps giving me failed to allocate memory, but the smi is showing barely any usage :(
Libraries missing, but it is right there I can see, make some symbolic links to fix. force package installation
what a mess. Now I am waiting for update to support newer kernel because some other program needs a newer kernel (trying out some new upstreamed stuff). So I do something on one machine and another thing on another machine.
To work around this I have a gRPC service between the apps on one machine with ROCm to another program in another machine with newer kernel . :expressionless: .
Tried docker, but Linux does not have cgroup support yet for GPU (It was being upstreamed but there is a fight between Intel and AMD devs on it :) ).

We have something else at work, so it is fine.
 

DrMrLordX

Lifer
Apr 27, 2000
21,612
10,817
136
@DisEnchantment

Is that using dGPUs? Let's face it, nVidia has the installed base they do for a reason. Their developer tools seem to work okay.

HSA was . . . actually I wrote a guide to making it work ages ago. Let's see if I can dig that up! Oh, here it is:


Those instructions are long-since deprecated. I don't think most of that software is even available anymore. AMD stopped using okra and . . . did other things prior to mostly throwing in the towel on HSA.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,777
136
Is that using dGPUs? Let's face it, nVidia has the installed base they do for a reason. Their developer tools seem to work okay.
Yes, using dGPUs.

Those instructions are long-since deprecated. I don't think most of that software is even available anymore. AMD stopped using okra and . . . did other things prior to mostly throwing in the towel on HSA.
amdkfd was rearchitected. Most likely it will not support older SW stack.
 
  • Like
Reactions: Elfear

epsilon84

Golden Member
Aug 29, 2010
1,142
927
136
I am really tired of this gaming discussion.
Difference in this workload scenario is so negligible that people are arguing over height of graph bars.
I'll tell you, if you can indentify system in double blind test in HU 30 or so games benchmark testbed when using appropriate resolutions and game settings to hardware in systems, I'll lick yours pet behind, tape it and upload it to YouTube.

Apart from worst case scenarios like Far Cry, I can't tell the difference between my R5 3600 and 8700K @ 5GHz either. Then again, I choose to game with IQ settings maxed or close to, so the GPU (5700XT) is the main bottleneck in almost all cases.

I'm not gonna put my head in the sand though and pretend that if I upgrade to Big Navi or GeForce 3000 series with +50% performance over current GPUs that it would be the same scenario.

AMD has done a fine job to this point in improving gaming on each Zen iteration, but if Ryzen 4000 actually overtakes my 8700K for gaming I would upgrade in a heartbeat.

Truthfully I'm not holding my breath on that, if GN's latest CPU bound gaming tests on the 3600XT vs 10600K is anything to go by, AMD is likely more than another iteration away from dethroning Intel at gaming.

The problem is that by the time AMD actually has Skylake beating gaming performance, you'd think Intel would be done milking their dead 14nm cow...

Yes, I'm aware that in 'blind test' scenarios like you described the difference might be negligible in most cases, at least with current gen GPUs. But that's a rather simplistic stance to take IMO, as the balance between being CPU bound and GPU bound can easily shift depending on the settings you use, and as faster GPUs come out in the coming years we'll ideally have 'faster-than-Skylake' CPUs as well to keep up the pace.
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,650
1,853
136
Apart from worst case scenarios like Far Cry, I can't tell the difference between my R5 3600 and 8700K @ 5GHz either. Then again, I choose to game with IQ settings maxed or close to, so the GPU (5700XT) is the main bottleneck in almost all cases.

I'm not gonna put my head in the sand though and pretend that if I upgrade to Big Navi or GeForce 3000 series with +50% performance over current GPUs that it would be the same scenario.

AMD has done a fine job to this point in improving gaming on each Zen iteration, but if Ryzen 4000 actually overtakes my 8700K for gaming I would upgrade in a heartbeat.

Truthfully I'm not holding my breath on that, if GN's latest CPU bound gaming tests on the 3600XT vs 10600K is anything to go by, AMD is likely more than another iteration away from dethroning Intel at gaming.

The problem is that by the time AMD actually has Skylake beating gaming performance, you'd think Intel would be done milking their dead 14nm cow...

Yes, I'm aware that in 'blind test' scenarios like you described the difference might be negligible in most cases, at least with current gen GPUs. But that's a rather simplistic stance to take IMO, as the balance between being CPU bound and GPU bound can easily shift depending on the settings you use, and as faster GPUs come out in the coming years we'll ideally have 'faster-than-Skylake' CPUs as well to keep up the pace.
IMHO the very fact that any games are still CPU bound at all is a testament to just how badly optimised PC games are in that regard.

You would hope that a mostly standardised feature set with only a handful of well optimised common game engines (Unreal/Unity/CryEngine) would have erased this problem by now.

It seems like those engines need some kind of LTS version which freezes feature set and concentrates on nothing but optimisation - a strategy I believe MS could also benefit from following with Windows 10 for that matter.