- Jun 24, 2019
- 3
- 0
- 11
Let's deep dive factors affecting Skylake's performance on certain games/benchmarks. Skylake and other Intel CPUs have several performance counters that can be programmed with events documented somewhat well in the Intel Software Developer's Manual (download the combined volume pdf from their site, start at page 3544 for Skylake/Kaby Lake/Coffee Lake events).
Starting with IPC as a quick overview:
Right away, we can see my i5-6600K achieves far lower IPC on games than in benchmarks. Ok, why? Let's look at how often load instructions cause cache misses:
I feel like misses per 1K instructions gives a better representation of how cache misses impact performance, because it accounts for how often load instructions occur in the instruction stream. You can have a really bad cache hit rate, but if you have very few load instructions, it doesn't really matter.
With all the benchmarks above except Geekbench's Dijkstra sub test, the i5-6600K's 6 MB L3 cache pretty much catches everything. Either the tests don't deal with big data sets, Skylake's prefetchers are really good, or both. The few games I tested on the other hand do miss L3 quite a bit.
Now a look at instruction cache misses:
That 32 KB L1 instruction cache is just not big enough for games, cinebench, and a couple tests in Geekbench (LLVM, SQLite). And instruction cache misses hurt more than data cache misses. With a data cache miss, you might have enough other instructions in flight to keep the execution units busy with. An instruction cache miss reduces the amount of work the out of order execution engine can look at. GTA V and Overwatch really suffer here, with code fetches often missing L2 too.
Quick conclusion: Cache misses are a much larger problem with games than some popular benchmarks. Games suffer heavily from both data and instruction cache misses. If we want to improve gaming performance, we need bigger caches and faster RAM to cover cases when we do miss in the last level cache.
Methodology:
I used both Intel's VTune software (available for free with a community license) and their open source PCM tool. PCM was used to get all the Geekbench data, but I didn't use VTune because it reported really bad mux reliability for Geekbench. VTune tries to collect data on a lot of performance events, but the CPU only has four programmable counters and three fixed function counters (instructions retired, unhalted cycles, reference cycles). So, VTune programs in a set of events, collects some data, reprograms the counters with another set of events, collects data, and so on. Somehow, it knows if this muxing isn't leading to good accuracy, and that seems to be the case for Geekbench. VTune was used to get data from Overwatch, ESO, GTA V, and Cinebench.
IPC is calculated by doing (instructions retired) / (core clock cycles when logical processor is not in a halt state). That accounts for fluctuations in clock frequency and CPU idle time. If the OS doesn't schedule work on a CPU core, it's in halted state.
Data cache misses are counted with:
Instruction cache misses are counted with:
What happened in games:
Starting with IPC as a quick overview:
Right away, we can see my i5-6600K achieves far lower IPC on games than in benchmarks. Ok, why? Let's look at how often load instructions cause cache misses:
I feel like misses per 1K instructions gives a better representation of how cache misses impact performance, because it accounts for how often load instructions occur in the instruction stream. You can have a really bad cache hit rate, but if you have very few load instructions, it doesn't really matter.
With all the benchmarks above except Geekbench's Dijkstra sub test, the i5-6600K's 6 MB L3 cache pretty much catches everything. Either the tests don't deal with big data sets, Skylake's prefetchers are really good, or both. The few games I tested on the other hand do miss L3 quite a bit.
Now a look at instruction cache misses:
That 32 KB L1 instruction cache is just not big enough for games, cinebench, and a couple tests in Geekbench (LLVM, SQLite). And instruction cache misses hurt more than data cache misses. With a data cache miss, you might have enough other instructions in flight to keep the execution units busy with. An instruction cache miss reduces the amount of work the out of order execution engine can look at. GTA V and Overwatch really suffer here, with code fetches often missing L2 too.
Quick conclusion: Cache misses are a much larger problem with games than some popular benchmarks. Games suffer heavily from both data and instruction cache misses. If we want to improve gaming performance, we need bigger caches and faster RAM to cover cases when we do miss in the last level cache.
Methodology:
I used both Intel's VTune software (available for free with a community license) and their open source PCM tool. PCM was used to get all the Geekbench data, but I didn't use VTune because it reported really bad mux reliability for Geekbench. VTune tries to collect data on a lot of performance events, but the CPU only has four programmable counters and three fixed function counters (instructions retired, unhalted cycles, reference cycles). So, VTune programs in a set of events, collects some data, reprograms the counters with another set of events, collects data, and so on. Somehow, it knows if this muxing isn't leading to good accuracy, and that seems to be the case for Geekbench. VTune was used to get data from Overwatch, ESO, GTA V, and Cinebench.
IPC is calculated by doing (instructions retired) / (core clock cycles when logical processor is not in a halt state). That accounts for fluctuations in clock frequency and CPU idle time. If the OS doesn't schedule work on a CPU core, it's in halted state.
Data cache misses are counted with:
- L1 Misses = event 0xD1, umask 0x08 = MEM_LOAD_RETIRED.L1_MISS = "Retired load instructions missed L1 cache as data sources." Grammar in that manual can be atrocious at times
- L2 Misses = event 0xD1, umask 0x10 = MEM_LOAD_RETIRED.L2_MISS = "Retired load instructions missed L2. Unknown data source excluded."
- L3 Misses = event 0xD1, umask 0x20 = MEM_LOAD_RETIRED.L3_MISS = "Retired load instructions missed L3. Excludes unknown data sources."
Instruction cache misses are counted with:
- L1i Miss = event 0x80, umask 0x02 = ICACHE_64B.IFTAG_MISS = "Instruction fetch tag lookups that miss in the instruction cache (L1i). Counts at 64-byte cache-line granularity"
- L2 Code Miss = event 0x24, umask 0x24 = L2_RQSTS.CODE_RD_MISS = "L2 cache misses when fetching instructions"
What happened in games:
- GTA V: some city driving, which tends to peg the CPU at times and hit some stutters
- Overwatch: a game of quick play
- ESO: running the alik'r desert dolmens, which have a mob of players farming them. framerate often drops into the 30s or below. On a vet halls of fabrication raid run late last year, I recorded 0.88 IPC. Framerate was worse back then, but I don't raid anymore. Too many bugs...
- Dark Souls III: continuously parry Gundyr (the first boss) and almost punch him to death before I get wrecked.