Info Effect of context switches on cache performance

Doug S

Golden Member
Feb 8, 2020
1,500
2,187
106

videogames101

Diamond Member
Aug 24, 2005
6,777
18
81
It's surprising how L1 and L2 cache sizes have remained mostly static since the early 2000s. My first CPU was a single-core Athlon 64 3700+ with 1MB of L2. The venerable i5 2500k had a mere 256kB of L2 per core. All generations of Ryzen have 512kB of L2 per core. Intel's screaming 12900k has just 1.25MB of L2 per core, about the same as my ancient 3700+. This is ignoring the hugely increasing L3 sizes, but consider that the big cores in the Apple M1 have a massive 12MB shared L2 (6MB per core!) along with a typical large L3/SLC. I see this is one of Apple's key efficiency differentiators, since large cache sizing (while costly in terms of die size) is one of the only microarchitectural features which increases IPC while decreasing power consumption. Will x86 follow and develop larger low-level caches? AMD seems happy to pursue a massive 3d-stacked L3 for now.
 

JoeRambo

Golden Member
Jun 13, 2013
1,613
1,735
136
It's surprising how L1 and L2 cache sizes have remained mostly static since the early 2000s. My first CPU was a single-core Athlon 64 3700+ with 1MB of L2. The venerable i5 2500k had a mere 256kB of L2 per core. All generations of Ryzen have 512kB of L2 per core. Intel's screaming 12900k has just 1.25MB of L2 per core, about the same as my ancient 3700+. This is ignoring the hugely increasing L3 sizes, but consider that the big cores in the Apple M1 have a massive 12MB shared L2 (6MB per core!) along with a typical large L3/SLC. I see this is one of Apple's key efficiency differentiators, since large cache sizing (while costly in terms of die size) is one of the only microarchitectural features which increases IPC while decreasing power consumption. Will x86 follow and develop larger low-level caches? AMD seems happy to pursue a massive 3d-stacked L3 for now.
This article by Chips And Cheese is great at discussing various technical tradeoffs between different sized caches.

Looking at historical things: Intel had huge and very good L2 cache during C2D days, Penryn was on 45nm and ~110mm^2 for 6MB L2 shared by dual core. So M1 years later has only twice L2 per core, despite being way more powerful and dealing with workloads that have much larger working sets.
Then Intel switched to small L2, backed by 8MB of inclusive L3 shared by four cores during Nehalem generation and furiously reduced die area from initial 263mm^2 to 122mm^2 (Skylake 4C) with node shrinks and kept milking the market due to lack of competition. Same 4C with 256kb L2 per core and 8MB of L3 shared.

So Intel stuck with these caches on client CPUs cause they could due to market forces, there was nothing stopping them from increasing L2 or L3 per core.

Market forces are against Intel now and they are furiously increasing the size of their caches. Good for us customers.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,079
2,626
136
It's surprising how L1 and L2 cache sizes have remained mostly static since the early 2000s. My first CPU was a single-core Athlon 64 3700+ with 1MB of L2. The venerable i5 2500k had a mere 256kB of L2 per core. All generations of Ryzen have 512kB of L2 per core. Intel's screaming 12900k has just 1.25MB of L2 per core, about the same as my ancient 3700+. This is ignoring the hugely increasing L3 sizes, but consider that the big cores in the Apple M1 have a massive 12MB shared L2 (6MB per core!) along with a typical large L3/SLC. I see this is one of Apple's key efficiency differentiators, since large cache sizing (while costly in terms of die size) is one of the only microarchitectural features which increases IPC while decreasing power consumption. Will x86 follow and develop larger low-level caches? AMD seems happy to pursue a massive 3d-stacked L3 for now.
What I always thought was interesting was K7/K8/K10's L1 was huge at 128k. Heck, the Duron had a larger L1 than L2! The L2 wasn't great on them though. It was slower than Intel's, and the 1MB versions weren't generally much faster than the 512k ones. It just wasn't the P4 that needed all the cache it could get. These days AMD and Intel seemed to have decided a smaller, lower latency L2 is the way to go.
 

JoeRambo

Golden Member
Jun 13, 2013
1,613
1,735
136
These days AMD and Intel seemed to have decided a smaller, lower latency L2 is the way to go.
Back in the day, L2 served as LLC, most of the time ( except some special CPUs ) there was no L3 to fallback to. And it was very fast too. On Penryn it was ~15 cycles, incredibly tight for 6MB cache that was also shared between two cores. Contemporary Atoms have shared 4MBs of L2, that is ~20 cycles and doesn't clock much higher, that's how tight Intel had to design memory subsystem when their FSB architecture from 1985 was getting hammered by AMD.
Nehalem L2 was just mediocre ~ 12 cycles and tiny 256kb. The saving grace of this design was inclusive L3 and IMC that finally reduced overall memory latency while facilitating very fast inter core communications in emerging multithreaded processing world. The pinacle of this architecture was Ivy Bridge, that had same 8MB of L3 and tight IMC that could run DDR3 ~2400, giving mem latencies that were untouchable by later generations due to uncoupled uncore increasing latency for no good reason ( on desktop ).

I think the era of anemic caches completely private to cores is finally over. Everyone will have 1-2MB of L2, ARM, AMD, Intel, all of them.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,079
2,626
136
Back in the day, L2 served as LLC, most of the time ( except some special CPUs ) there was no L3 to fallback to. And it was very fast too. On Penryn it was ~15 cycles, incredibly tight for 6MB cache that was also shared between two cores. Contemporary Atoms have shared 4MBs of L2, that is ~20 cycles and doesn't clock much higher, that's how tight Intel had to design memory subsystem when their FSB architecture from 1985 was getting hammered by AMD.
Nehalem L2 was just mediocre ~ 12 cycles and tiny 256kb. The saving grace of this design was inclusive L3 and IMC that finally reduced overall memory latency while facilitating very fast inter core communications in emerging multithreaded processing world. The pinacle of this architecture was Ivy Bridge, that had same 8MB of L3 and tight IMC that could run DDR3 ~2400, giving mem latencies that were untouchable by later generations due to uncoupled uncore increasing latency for no good reason ( on desktop ).

I think the era of anemic caches completely private to cores is finally over. Everyone will have 1-2MB of L2, ARM, AMD, Intel, all of them.
I was thinking about saying how Intel always had better cache in the past. AMD's ranged from decent to crap (Bulldozer). It wasn't really until Ryzen they had good cache. I will give them credit for Phenom though which led the way to the three level cache structure we have today. Though it did take 45nm and Phenom II to make the L3 large enough.

I'm honestly a bit surprised Zen 3 didn't go with a 1MB L2 per core. I almost wonder if 1MB will be enough with Zen 4. Who knows how they decided to spend die space on that though.
 

igor_kavinski

Diamond Member
Jul 27, 2020
7,687
4,435
106
I can't wait for virtual L4 cache to make its debut in x86 architecture. Imagine how much cache is wasted sitting idle when one core is fully boosted in single threaded tasks, wasting time waiting to get recently evicted data from RAM when it could just have gotten it from the caches of its idle neighboring cores.
 

JoeRambo

Golden Member
Jun 13, 2013
1,613
1,735
136
I can't wait for virtual L4 cache to make its debut in x86 architecture. Imagine how much cache is wasted sitting idle when one core is fully boosted in single threaded tasks, wasting time waiting to get recently evicted data from RAM when it could just have gotten it from the caches of its idle neighboring cores.
I will just cite chips and cheese investigation, and consider that their "huge" cache had rather minor penalty to latency, completely uncrealistic in the scheme of "virtual L4" where you have to cross boundaries of caches with different design goals:

With a giant 90 MB L3 cache, we see some IPC gains on average. But the distribution shows some funny characteristics. A lot of traces don’t benefit at all or see slight IPC losses. On the other extreme, a batch of traces get completely broken by the huge L3, and see over 40% IPC gains. That drags up the average.

In the set of Spec2006 traces, leslie3d and libquantum both see gigantic gains (45% and 93% respectively). lbm from the Spec2017 traces does too (70%). Omnetpp and cactuBSSN see lower gains with around a 20% IPC lead over the baseline config. But that’s well over an average generational jump. When this crazy VCache config wins, it can win by massive margins.
In most cases though, traces either see modest gains or modest losses from increased L3 latency. The second case is actually more common. Out of 214 traces, 111 lost IPC compared to the baseline.
So even if IBM has crazy virtual cache level scheme, it does not mean desktop chip with workloads where current cache is enough will benefit from it.
 

igor_kavinski

Diamond Member
Jul 27, 2020
7,687
4,435
106
So even if IBM has crazy virtual cache level scheme, it does not mean desktop chip with workloads where current cache is enough will benefit from it.
It could make sense in a gaming optimized SKU. Like SMT, the transistor cost for implementing it may not be too high compared to the gains it might make possible.
 

Mopetar

Diamond Member
Jan 31, 2011
7,104
4,555
136
It's surprising how L1 and L2 cache sizes have remained mostly static since the early 2000s.
Making a cache larger increases the number of cycles required to access it, so there's always going to be a point where the added capacity actually results in a performance loss.

Intel and AMD have also really been pushing clock speeds. That almost necessitates a smaller, faster cache or you have fast cores that spend a lot of time doing nothing because they're waiting for data to arrive.

Apple can get away with the larger cache because they have longer cycle times. If Apple ever did push clock speeds, they'd almost certainly have to increase the cycles required to access the cache or change up the design so that the cache can operate faster.

It's also not just as simple as there being an ideal cache configuration either. It really depends on the core you have, what kind of workloads are being run, and other (e.g. power) constraints exist. For years, Apple's cores were wider than most other designs in terms of execution units, so even with the lower clock speeds they were still capable of chewing through a lot of instructions and would likely need that larger cache to avoid situations where they can't be kept fed.
 
  • Like
Reactions: Tlh97

igor_kavinski

Diamond Member
Jul 27, 2020
7,687
4,435
106
That's why I would like to see different types of cores in a CPU, each specialized to be the fastest for a different type of workload. How many different general workloads can there be? When cores start getting smaller to the point where we can easily have a 100 cores in the mainstream CPU, I suppose that's when this type of approach will make sense.
 

Mopetar

Diamond Member
Jan 31, 2011
7,104
4,555
136
We don't have 100 different core types, but with all of the dedicated hardware paths in most modern SoCs, it's effectively the same idea. It seems like a CPU core, an efficiency CPU core, a GPU core, and a neural network core are currently a good enough segmentation to handle most workloads on top of any dedicated hardware blocks.

Perhaps we'll see a few other specialized hardware components arise as more transistors become available or the type of work we want our devices to do evolves, but I'm not sure if we ever get to 100. 10 seems like a more likely limit only because dedicated hardware blocks can handle the oddly specific requirements better than a more generalized core.
 

beginner99

Diamond Member
Jun 2, 2009
5,114
1,495
136
but consider that the big cores in the Apple M1 have a massive 12MB shared L2 (6MB per core!) along with a typical large L3/SLC. I see this is one of Apple's key efficiency differentiators, since large cache sizing (while costly in terms of die size) is one of the only micro architectural features which increases IPC while decreasing power consumption.
Fully agree. I've written it multiple times when people are worshiping the apple CPUs. Apple can/must focus their design on client usage patterns and not server usage patterns like x86. On top apple buys efficiency with die size. They can do that because the chips only goes into expensive devices. No need to also be commercially viable in $300 craptops from best buy. So Apple can make a much more focused design on a limited usage scenario with a relatively high price floor.

Intel/AMD could not make huge L1 and L2 caches because if you back 32 or 64 cores, it simply gets too big. this can now change with chiplets and more advanced packaging.
In terms of the M1 Ultra it's again "chiplets" (well not really but not monolithic either, zen1 like) and only 16 cores. Imagine the die size of the CPU part with 64-cores. and that is on 5nm. I guess the L1/L2 doesn't do that much for server workloads as they can simply be split in 1 thread per request so more cores will often be better.
 

TheELF

Diamond Member
Dec 22, 2012
3,819
644
126
That's why I would like to see different types of cores in a CPU, each specialized to be the fastest for a different type of workload. How many different general workloads can there be? When cores start getting smaller to the point where we can easily have a 100 cores in the mainstream CPU, I suppose that's when this type of approach will make sense.
The overhead of managing that and the context switch time of going to a different arch core would pretty much destroy any speed benefit.
Also this is why CPUs have ISAs, it's specialty hardware to deal with specialty code.
 

ASK THE COMMUNITY