Discussion Skylake Microarchitecture and Benchmarks/Games

chlamchowder

Junior Member
Jun 24, 2019
3
0
11
Let's deep dive factors affecting Skylake's performance on certain games/benchmarks. Skylake and other Intel CPUs have several performance counters that can be programmed with events documented somewhat well in the Intel Software Developer's Manual (download the combined volume pdf from their site, start at page 3544 for Skylake/Kaby Lake/Coffee Lake events).

Starting with IPC as a quick overview:
1561440825317.png

Right away, we can see my i5-6600K achieves far lower IPC on games than in benchmarks. Ok, why? Let's look at how often load instructions cause cache misses:

1561442378449.png

I feel like misses per 1K instructions gives a better representation of how cache misses impact performance, because it accounts for how often load instructions occur in the instruction stream. You can have a really bad cache hit rate, but if you have very few load instructions, it doesn't really matter.

With all the benchmarks above except Geekbench's Dijkstra sub test, the i5-6600K's 6 MB L3 cache pretty much catches everything. Either the tests don't deal with big data sets, Skylake's prefetchers are really good, or both. The few games I tested on the other hand do miss L3 quite a bit.

Now a look at instruction cache misses:
1561443809705.png1561444485261.png
That 32 KB L1 instruction cache is just not big enough for games, cinebench, and a couple tests in Geekbench (LLVM, SQLite). And instruction cache misses hurt more than data cache misses. With a data cache miss, you might have enough other instructions in flight to keep the execution units busy with. An instruction cache miss reduces the amount of work the out of order execution engine can look at. GTA V and Overwatch really suffer here, with code fetches often missing L2 too.

Quick conclusion: Cache misses are a much larger problem with games than some popular benchmarks. Games suffer heavily from both data and instruction cache misses. If we want to improve gaming performance, we need bigger caches and faster RAM to cover cases when we do miss in the last level cache.

Methodology:
I used both Intel's VTune software (available for free with a community license) and their open source PCM tool. PCM was used to get all the Geekbench data, but I didn't use VTune because it reported really bad mux reliability for Geekbench. VTune tries to collect data on a lot of performance events, but the CPU only has four programmable counters and three fixed function counters (instructions retired, unhalted cycles, reference cycles). So, VTune programs in a set of events, collects some data, reprograms the counters with another set of events, collects data, and so on. Somehow, it knows if this muxing isn't leading to good accuracy, and that seems to be the case for Geekbench. VTune was used to get data from Overwatch, ESO, GTA V, and Cinebench.

IPC is calculated by doing (instructions retired) / (core clock cycles when logical processor is not in a halt state). That accounts for fluctuations in clock frequency and CPU idle time. If the OS doesn't schedule work on a CPU core, it's in halted state.
Data cache misses are counted with:
  • L1 Misses = event 0xD1, umask 0x08 = MEM_LOAD_RETIRED.L1_MISS = "Retired load instructions missed L1 cache as data sources." Grammar in that manual can be atrocious at times
  • L2 Misses = event 0xD1, umask 0x10 = MEM_LOAD_RETIRED.L2_MISS = "Retired load instructions missed L2. Unknown data source excluded."
  • L3 Misses = event 0xD1, umask 0x20 = MEM_LOAD_RETIRED.L3_MISS = "Retired load instructions missed L3. Excludes unknown data sources."
These events count retired instructions, which excludes instructions that are never retired (for example, those fetched after a mispredicted branch). They also don't give any info about how effective prefetches are. Another caveat is, if an instruction requests data that results in a cache miss, subsequent instructions requesting data from the same 64 byte cache line will count as fill buffer hits - not separate cache misses. Thus, the L2/L3 miss counters don't totally capture the impact of those cache misses. You could have a dozen instructions in close proximity requesting data that's not in L3. The first will count as a L3 miss, and the rest count as fill buffer hits. Ugh.

Instruction cache misses are counted with:
  • L1i Miss = event 0x80, umask 0x02 = ICACHE_64B.IFTAG_MISS = "Instruction fetch tag lookups that miss in the instruction cache (L1i). Counts at 64-byte cache-line granularity"
  • L2 Code Miss = event 0x24, umask 0x24 = L2_RQSTS.CODE_RD_MISS = "L2 cache misses when fetching instructions"
I'm still working on trying to count L3 code misses. There's OFFCORE_RESPONSE:request=DEMAND_CODE_READ:response=L3_MISS.ANY_SNOOP, but adding that and response=L3_HIT.ANY_SNOOP gives a count greater than L2_RQSTS.CODE_RD_MISS. That doesn't make any sense, especially when adding L2 code reads misses and hits approximately equals L1i misses. These are also speculative events, meaning code fetches after a mispredicted branch that are later discarded are counted here.

What happened in games:
  • GTA V: some city driving, which tends to peg the CPU at times and hit some stutters
  • Overwatch: a game of quick play
  • ESO: running the alik'r desert dolmens, which have a mob of players farming them. framerate often drops into the 30s or below. On a vet halls of fabrication raid run late last year, I recorded 0.88 IPC. Framerate was worse back then, but I don't raid anymore. Too many bugs...
  • Dark Souls III: continuously parry Gundyr (the first boss) and almost punch him to death before I get wrecked.
Finally, an ask for reviewers (i.e. Anandtech): Can we get detailed analysis for benchmarked applications on different CPUs, so we can better understand how they stress different aspects of CPU microarchitectures? Tools are free from both Intel and AMD (CodeXL), and both publish docs listing performance monitoring events we can track.
 

yerpuh

Junior Member
Jun 20, 2019
7
0
6
i7 Haswell, Broadwell, Skylake all felt like this to me

Lower fps and micro-stutters when crossing open world zones, towns etc.

Overclocks (RAM/GPU included) never helped much.

GTAV on i9k is the opposite now; Even heavily modded, can't make it hitch anywhere.

New Ryzen are probably pretty smooth too, imagine they have the same type fixes.

(P.S. Driver Fusion on Steam, Iobit's DB6 free- if you want newer/faster drivers than Windows Update. They both make restore points, and are not adware/botnets.)
 

chlamchowder

Junior Member
Jun 24, 2019
3
0
11
i7 Haswell, Broadwell, Skylake all felt like this to me

Lower fps and micro-stutters when crossing open world zones, towns etc.

Overclocks (RAM/GPU included) never helped much.

GTAV on i9k is the opposite now; Even heavily modded, can't make it hitch anywhere.

New Ryzen are probably pretty smooth too, imagine they have the same type fixes.

(P.S. Driver Fusion on Steam, Iobit's DB6 free- if you want newer/faster drivers than Windows Update. They both make restore points, and are not adware/botnets.)

What fixes do you refer to with new Ryzen, or the i9 that help with low fps/stutter problems? If we can measure the impact of a larger cache, that'd be interesting. On a Haswell system with 10 MB of LLC, I get about 1.67 LLC misses per 1K instructions. That's a small improvement over 1.76, but I also didn't do a well controlled test and no longer have GTA V working on that HSW system anyways.

Per-thread IPC was also lower on Haswell, but that's likely because of hyper-threading.

But anyway, a 66% bigger L3 cache reduced misses by less than 10%. That's consistent with diminishing returns from increasing cache size (just google for more detailed research on that topic).
 

yerpuh

Junior Member
Jun 20, 2019
7
0
6
What fixes do you refer to with new Ryzen, or the i9 that help with low fps/stutter problems? If we can measure the impact of a larger cache, that'd be interesting. On a Haswell system with 10 MB of LLC, I get about 1.67 LLC misses per 1K instructions. That's a small improvement over 1.76, but I also didn't do a well controlled test and no longer have GTA V working on that HSW system anyways.

Per-thread IPC was also lower on Haswell, but that's likely because of hyper-threading.

But anyway, a 66% bigger L3 cache reduced misses by less than 10%. That's consistent with diminishing returns from increasing cache size (just google for more detailed research on that topic).

(Great presentation btw, very knowledgeable ! I learned alot, THX.!)

I'm not much of an expert.... but i believe- If you have limited memory, say 8 or 16GB, you will see Windows compressing a lot more. Compression and decompression uses the CPU lanes etc. You could turn off memory compression in Windows, but led to bigger problems at times.

Several hardware bugs like Standby memory cache, Timer resolution etc. Big Open world games were on the forefront, and stuttering alot. People blamed Windows Creators update at the time, but their code never changed because it was actually correct- they simply gave out a temporary workaround (but you could still experience micropauses) . Intel/AMD's hardware supposedly addressed these limitations in newer chipsets.

https://www.ghacks.net/2018/10/29/r...-games-with-intelligent-standby-list-cleaner/
 

chlamchowder

Junior Member
Jun 24, 2019
3
0
11
(Great presentation btw, very knowledgeable ! I learned alot, THX.!)

I'm not much of an expert.... but i believe- If you have limited memory, say 8 or 16GB, you will see Windows compressing a lot more. Compression and decompression uses the CPU lanes etc. You could turn off memory compression in Windows, but led to bigger problems at times.
Yes, using a bit more CPU time to compress or decompress is better than going to disk (even a SSD). The alternative to memory compression when you run out of memory is swapping to disk. I can't imagine swapping being faster than decompression, unless you have a ridiculously unbalanced setup like Optane DIMMs and a 1.0 GHz atom.

Several hardware bugs like Standby memory cache, Timer resolution etc. Big Open world games were on the forefront, and stuttering alot. People blamed Windows Creators update at the time, but their code never changed because it was actually correct- they simply gave out a temporary workaround (but you could still experience micropauses) . Intel/AMD's hardware supposedly addressed these limitations in newer chipsets.

https://www.ghacks.net/2018/10/29/r...-games-with-intelligent-standby-list-cleaner/
Several things going on here so let's tackle them one by one:
  1. Standby memory cache, or standby list: This is simply a cache of files on disk. I suppose it could cache other things because standby is a pretty generic term, but I can't imagine what else. A large standby list should only improve performance by increasing the chance that files you want are already in memory (instead of being loaded from disk).

    Perhaps Windows was really bad at traversing a very large standby list? But that's a software problem. Does any evidence point to a hardware problem?
  2. Timer resolution: You mean the high precision event timer issue with Ryzen chipsets? I'm curious if anyone ever pinpointed its effects. It should be as simple as running Windows Performance Recorder and looking for an unreasonable amount of DPC/ISR time related to chipset drivers.

    Checking over the Anandtech article again (https://www.anandtech.com/show/12678/a-timely-discovery-examining-amd-2nd-gen-ryzen-results/4) Ryzen saw less of a perf impact when HPET was forced on. But they also note meltdown patches and higher timer frequency could explain the Intel results.