Discussion Apple Silicon SoC thread

Page 442 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
24,175
1,815
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:


M5 Family discussion here:

 
Last edited:

GC2:CS

Member
Jul 6, 2018
39
21
81
Hi anandGPT ! Would you mind to explain this trace cache stuff and why this is such a big deal ? Is it obvious from the die shot ?

Explain to a dummy level, multiple posts, maximum humanity, no use of external links

Thanks !
 

mikegg

Platinum Member
Jan 30, 2010
2,091
633
136
1764066829396.png

M2 Macbook Air 16GB for $650. Probably the best laptop deal right now. Ridiculous good for such a low price. I think the Macbook with iPhone chip for $599 doesn't sound as ridiculous anymore considering the crazy good margins Apple must have if Best Buy can sell one for this low.
 
  • Like
Reactions: Mopetar

Cardyak

Member
Sep 12, 2018
85
204
106
Hi anandGPT ! Would you mind to explain this trace cache stuff and why this is such a big deal ? Is it obvious from the die shot ?

Explain to a dummy level, multiple posts, maximum humanity, no use of external links

Thanks !

I'd be interested in learning more about this as well. My simplistic understanding is that a trace cache stores the dynamic execution of instructions, rather than the actual instructions in a static sense.

So for example, if you have a loop that you iterate over 4 times, it will appear in the trace cache 4 times, but in a micro-op cache only once. But as I already mentioned, I'm not too au fait with this and my assumption could be incorrect. I'm unsure of the benefits and drawbacks of such a system as well. I assume a trace cache is less efficient in terms of storage as you are storing the same "static" code over and over again when it's executed, but there must presumably be some benefits (maybe around extracting parallelism or resolving branches?)

I'm sure someone like @name99 can offer more details.
 

Nothingness

Diamond Member
Jul 3, 2013
3,365
2,457
136
I'd be interested in learning more about this as well. My simplistic understanding is that a trace cache stores the dynamic execution of instructions, rather than the actual instructions in a static sense.

So for example, if you have a loop that you iterate over 4 times, it will appear in the trace cache 4 times, but in a micro-op cache only once. But as I already mentioned, I'm not too au fait with this and my assumption could be incorrect. I'm unsure of the benefits and drawbacks of such a system as well. I assume a trace cache is less efficient in terms of storage as you are storing the same "static" code over and over again when it's executed, but there must presumably be some benefits (maybe around extracting parallelism or resolving branches?)

I'm sure someone like @name99 can offer more details.
Perhaps the patent linked here can help?
 

Jan Olšan

Senior member
Jan 12, 2017
620
1,254
136
View attachment 134419

M2 Macbook Air 16GB for $650. Probably the best laptop deal right now. Ridiculous good for such a low price. I think the Macbook with iPhone chip for $599 doesn't sound as ridiculous anymore considering the crazy good margins Apple must have if Best Buy can sell one for this low.
256 GB fixed storage you can't expand or replace.
 

name99

Senior member
Sep 11, 2010
684
576
136
I'd be interested in learning more about this as well. My simplistic understanding is that a trace cache stores the dynamic execution of instructions, rather than the actual instructions in a static sense.

So for example, if you have a loop that you iterate over 4 times, it will appear in the trace cache 4 times, but in a micro-op cache only once. But as I already mentioned, I'm not too au fait with this and my assumption could be incorrect. I'm unsure of the benefits and drawbacks of such a system as well. I assume a trace cache is less efficient in terms of storage as you are storing the same "static" code over and over again when it's executed, but there must presumably be some benefits (maybe around extracting parallelism or resolving branches?)

I'm sure someone like @name99 can offer more details.
I will describe the problem a trace cache solves for Apple. Note that
1. other companies have different problems (generally decoding variable length instructions), and solve them with things they call a trace cache, but what they are doing with this "trace cache" is very different from Apple.
2. I will describe how Apple performs Fetch. Again, other companies have evolved their way to something that is kinda like Apple, but as the result of evolution rather than design, their solutions are generally not as elegant and not as performant, and use different terminology.

So here's the Fetch problem. Your CPU is 10-wide. That means that EVERY CYCLE, ideally, you want to pull in ten instructions to feed to the rest of the CPU. This has multiple difficulties, so let's start with the older parts before we get to the trace cache.

You can think of the executed stream of instructions as consisting of sequential blocks of instructions separated by branches, some of which are taken, some of which are not. (Taken branches include things like goto, function call, and return, so they may not be conditional.)
Sequential instructions are situated one after the other, within a single cache line, or straddling two cache lines.
So at this stage our problem looks essentially something like: every cycle
1. send this cycle's PC to the instruction cache to load a bunch of sequential instructions
2. simultaneously send this cycle's PC to the Fetch Predictor, which uses that as an index to look up the predicted PC for the next cycle. What's actually looked up is the "characteristics" of the next block of sequential instructions. This includes the PC of this block, the length of the block, and the types of branches within the block including the last branch.
- Let's amplify the above points. Obviously a block of sequentially executed instructions has an address, the PC for this block. And obviously a block of sequentially executed instructions has a length, so the hit against the I-cache knows how many instructions to load. But a block of say 15 sequentially executed instructions may also include say a conditional branch that is expected to not be taken, so we can pull in the full 15 instructions, not end Fetch at instruction 7, and feed the whole block of 15 on to the next pipeline stage (via a queue that holds the extra instructions, if the next stage can say only handle 8 or 10 instructions). It's an important point to load as many instructions as you can when you can, to make up for the cycles where you only load zero to a few instructions.

- So our Fetch predictor has predicted that we can load a run of 15 instructions with a conditional branch in the middle. But the Fetch predictor is very simple (basically a lookup table that records what past Fetch Sequences have looked like) so maybe that conditional branch in the middle could be incorrectly predicted? There are subsequent branch predictors of varying sophistication that handle this. They will kick in (they know to kick in because, remember, the Fetch Predictor includes records of branches in the middle of a Fetch Sequence) in later cycles [basically they are working at the same time that the request is going out to the I-cache and then returning].
If one of these more sophisticated, slower, branch predictors (which predict based not just on what happened lat time we accessed this PC, but also on the entire history of branches and other behavior that led up to this conditional branch) disagrees with the Fetch Predictor, then ultimately what was fetched up till instruction 7 will be kept in the Instruction Queue, what came after instruction 7 will be thrown away, the PC will be reset to the (hopefully correct) taken PC, and things will restart at that PC.

This is NOT the same thing as a branch misprediction!
This is a much lighterweight operation, often called a resteer. We lose two or three cycles of fetch, and we have to throw away some instructions in the Instruction Queue, that is all. This is a case of a better predictor overriding a fast but limited accuracy earlier predictor. True misprediction occurs later, when we actually execute the branch inside the core; then we see the real taken vs not taken, and if the value we guessed with our final (slowest, most accurate) predictor was incorrect, then we have to flush everything in the machine that was done after this inaccurate prediction. That takes ~20 cycles or so, and can throw away hundreds of partially executed instructions.

OK, so at this stage of the game we have a
- Fetch Predictor that generates (every cycle) a prediction for the next Fetch Sequence (a PC, a length, an indication of the non-taken branches within the Sequence; and how the Sequence ends)

A Fetch Sequence usually ends at a taken branch. (Maybe predicted taken, maybe unconditionally taken, like a goto, a function call, or a return). A taken branch means a new location in the I-cache, and if the I-cache can only provide one cache line per cycle, then we have to stop at this point, so that next cycle we can feed that new, taken, PC to the I-cache.

Before we go further, make sure you have the above scheme firmly understood.
There are a number of optimizations Apple makes to this. I won't describe them all, only the ones that I think might help you understood the important points a little better.
In the Fetch Predictor, along with the essential items I've already described, Apple also stores some items to save energy. For example the physical address and the way of the target cache line are recorded, that way we don't have to waste energy accessing the I-TLB or querying multiple ways in the I-cache.
Also the type of branch that ends a trace is recorded because different such branches have to be handled differently. A function call is easy, just set the PC for next cycle to the address of the function. A conditional branch also requires the branch address to be sent to the fancy branch prediction machinery to validate (hopefully...) the prediction that this conditional branch was taken. A return has to be sent to the return address stack to pop off the address and thereby provide the PC for the next cycle fetch.

There is also the (somewhat rare case) where an Instruction Sequence does not end with a taken branch, it just runs on. This can happen if the Instruction Sequence is unusually long (let's say longer than 16 instructions) and also to handle a few unusual cases (eg if the instruction sequence crosses a page boundary, since now you need two Fetch Predictor entries to handle the two page lookups).

OK, now let's think about the performance of this machinery.
On average there's a branch about every 6 cycles or so in code.
So the most naive sort of model of what I have described (that broke Fetch at every branch) would only be able to load about 6 instructions per cycle on average. And in fact it would be a little worse because some Instruction Sequences cross a cache line boundary.
So our first improvement is to boost the I-cache a little (which then requires things like storing two ways in the Fetch Predictor) so that we can deliver a Fetch Sequence that crosses two caches lines. And we allow our Fetch Predictor to pull in long runs (let's say up to 16 instructions long) just to balance the very frequent cases where runs are short (say only three or four instructions between branches). We also provide a large Instruction Queue to hold these long runs, to build up a buffer of instructions we can continue to Decode even if Fetch has to glitch for a cycle or two.

Next we allow Fetch to pull in Instruction Sequences that cross untaken branches, as described above. To add this functionality we also need to now include
- a record of where this taken branch (or maybe two?) is/are in the Instruction Sequence.
- ability for the fancy branch to predict multiple branches in a single cycle
- ability to reach into the Instruction Queue and selectively invalidate instructions that were fetched but need to be thrown away as a consequence of a Resteer.
- there are also technical complications now having to do with things like how to update the Fetch Predictor when it's realized that an internal branch is usually taken, so that Fetch Record should be split.

About half branches are taken so if we do the above work that means that on average now our Instruction Sequences can be about 9 to 10 instructions long. That's already not great if you want to run 10-wide, but remember there are multiple cycles where you can perform no Fetch. Sometimes this is because of a Resteer (maybe 2 or 3 cycles lost), sometimes it's because the L0 Fetch Predictor couldn't provide a prediction and we lose a cycle going to the L1 Fetch Predictor.
So we need a way to get (on average) longer Fetch Sequences.

There are three primary ways you can try to handle this.
One is to predict two PCs per cycle, and feed those both to the I-cache (which obviously also means some modifications to the I-cache). This is not utterly crazy. The second Pc prediction is slightly less accurate than the first, the I-cache modifications are feasible. This was being talked about as an idea for the last-stage, never manufactured, Alpha back in the late 1990s, and I believe it's being used by the most recent AMD chips. (I don't follow AMD closely; I was told this by someone but I'm not utterly convinced they understood the issue well...)

The second way is being used by Apple and has to do with Short Forward Taken Branches.
Suppose you have a branch that looks like
if(condition){one instruction}else{...}
These are fairly common, for example if(condition) a+=b else ...
Now suppose our prediction is that usually the condition is false and we branch over the single instruction.
So suppose we have a Sequence of 16 instructions, and this setup above happens at around instruction 6.
Then what if, for initial Fetch purposes, we just ignore the Short Forward Taken branch (now you see the meaning of each of those adjectives) and just Fetch the entire 16 instructions? Then in the Instruction Queue we snip out the predicted unexecuted instruction (eg the a+=b).
The effect of this is that the rest of the machine still sees the same sequence of instructions as it would have, but we are able to pull in a Sequence of 16 instructions, rather than having to perform this as an initial Fetch of say 6 instructions, then increment the PC by 1 instruction, then load the next Sequence of 9 instructions.
This obviously can also work with runs of 2 or 3 or 4 skipped instructions. The maximum number it's worth skipping is a question for simulations; my guess is it's around 3 or 4.

OK, so we have one Apple mechanism for giving us wider effective Fetches. This only kicks in for the pattern I described, but that pattern is in fact fairly common.

The third option for giving yourself effectively longer Fetch Sequences relies on the fact that many taken branches are either fixed or almost never not taken.
A fixed taken branch is something like a function call.
A predicted taken branch that almost never fails might be something like (if ptr==NULL) or (if divisor==0) or if(error). Things like this dominate most code, and force the code frequently to jump over rarely executed unusual-condition-handling.
So the actually executed code still looks like what we said at the start -- runs of sequential instructions separated by taken branches, but many of those taken branches are (effectively) ALWAYS taken. In this case, you can imagine storing somewhere the full run of the instruction as executed, ie two or three blocks of sequential instructions one after the other. Because they are now stored together, the fact that they are conceptually separated by a taken branch doesn't matter, not if they are all placed together in a single block of SRAM and can be loaded as a unit.
Such a run of sequentially executed instructions, separated by taken branches, is called a Trace, and the cache that holds them is called a Trace Cache.

The way Apple implements this (at least right now), the Trace Cache is as described above. It holds standard instructions, no weirdness with uOps or whatever.
There are now two Fetch Predictors accessed in parallel, one that predicts fetching from the I-cache, the other that predicts Fetching from the Trace Cache.
Obviously this scheme helps with generating longer Fetch Sequences. It probably also helps with power (you have more flexibility to lay out the Trace Cache however you like, and things like the TLB storage and way storage in the traditional Fetch Predictor may not be necessary). There may also be a small improvement in general branch prediction efficiency. For various reasons, almost always taken branches landed up having to be stored in the fancy branch predictor even though they don't really need "prediction", so they used up precious branch predictor space. But when those branches are implicitly stored in the Trace Cache they no longer need to be predicted. (The patents are not clear on this, but the impression I get is that the only traces Apple wants to store are those where all taken branches are effectively ALWAYS taken, enough so that there's no point in performing a fancy branch prediction on them, we just assume they're always taken and the .1% of the time that we're wrong, that's caught when we actually execute the branch in the core, and then that trace is thrown out the Trace Cache and not reused.)

So bottom line is
- 10-wide execution means SUSTAINED 10-wide Fetch
- this is difficult to impossible if there's a taken branch every 10 or so instructions BUT
- enough taken branches are "always" taken that you can construct Traces that are 16..32 or so instructions long (I'm guessing at the sizes, Apple doesn't say) and frequently load a trace in a single cycle.
- these long length traces make up for the cycles where you load nothing or not much, helping you to get closer to that sustained 10-wide Fetch.

Some final points:
None of this has anything to do with loops. Loops are so common and so easy to handle that everyone has had special handling for them forever. Even a decade ago Apple had not just specialized loop buffers, but loop buffers that unrolled small loops in hardware.

Almost all of the standard benchmark suites (eg GB6 or SPEC) do a terrible job of testing Fetch bandwidth. They just don't have very large code footprints, and most of their execution runs out of loops that are handled by loop buffers.
Browser benchmarks are somewhat better, but of course it's hard to compare those across devices because the OS is different, let alone things like Chrome vs Edge vs Safari. So this work done by Apple will show up when using large apps but will not be especially visible in the public benchmarks.

The above is far from the full story. Another dimension of Fetch is ensuring that the instructions you want have been prefetched into the I-cache before you need them. Yet another dimension is the whole question of how the Fetch Predictor gets populated, and how not to lose cycles while it is being populated. These are covered in my PDFs but we certainly don't have time for them here!

 

name99

Senior member
Sep 11, 2010
684
576
136
View attachment 134419

M2 Macbook Air 16GB for $650. Probably the best laptop deal right now. Ridiculous good for such a low price. I think the Macbook with iPhone chip for $599 doesn't sound as ridiculous anymore considering the crazy good margins Apple must have if Best Buy can sell one for this low.
Damn! That's nice!
My initial thought was 8GB/128GB, but no, it's adequate in those as well.

Just yesterday I was in Best Buy buying a Windows laptop for a foreign friend who wanted me to bring her one (something super-basic, super-cheap). You can get Windows laptops at say $400 to $450 that are OK (same nominal specs of 16/256) but of course everything is less nice -- physically larger and heavier, CPU not as performant, shorter battery life, etc.
Best Buy even had a few netbook style-machines (kinda like an iPad with the very lightweight keyboard) at $200 clearance, but they are 4GB/64GB!) Feels like they're going to be just horrible to use under almost any circumstances, even just basic Windows web browsing.
 

Mopetar

Diamond Member
Jan 31, 2011
8,525
7,784
136
@name99

Thanks for the great post. I was also curious about all of the hoopla around Apple's trace cache and that was incredibly informative.
 

mikegg

Platinum Member
Jan 30, 2010
2,091
633
136
The most exciting M5 upgrade is definitely Neural Accelerators (aka matmul inside GPU). Prompt processing got 3-4x faster.

1764124289820.png
1764124305820.png
1764124327923.png

 

Doug S

Diamond Member
Feb 8, 2020
3,800
6,731
136
There has been some discussion about why the delay for M5 Pro & Max and I wonder if AI was the reason? Maybe they had some further improvements almost ready for their GPU's AI support they wanted to wait on, to double down on the gains M5 already made?

They aren't gonna compete with Nvidia, AMD and Google for AI datacenter bubble money, but they could make some waves if they sold the best personal AI server you can buy - and it came in the form of a Macbook Pro.
 

jpiniero

Lifer
Oct 1, 2010
17,137
7,523
136
They aren't gonna compete with Nvidia, AMD and Google for AI datacenter bubble money, but they could make some waves if they sold the best personal AI server you can buy - and it came in the form of a Macbook Pro.

May not be wrong, but it'd be to boost their AI Cred with those moron "Investors"
 
  • Like
Reactions: Hesperax

GC2:CS

Member
Jul 6, 2018
39
21
81
I will describe the problem a trace cache solves for Apple. Note that
1. other companies have different problems (generally decoding variable length instructions), and solve them with things they call a trace cache, but what they are doing with this "trace cache" is very different from Apple.
2. I will describe how Apple performs Fetch. Again, other companies have evolved their way to something that is kinda like Apple, but as the result of evolution rather than design, their solutions are generally not as elegant and not as performant, and use different terminology.

So here's the Fetch problem. Your CPU is 10-wide. That means that EVERY CYCLE, ideally, you want to pull in ten instructions to feed to the rest of the CPU. This has multiple difficulties, so let's start with the older parts before we get to the trace cache.

You can think of the executed stream of instructions as consisting of sequential blocks of instructions separated by branches, some of which are taken, some of which are not. (Taken branches include things like goto, function call, and return, so they may not be conditional.)
Sequential instructions are situated one after the other, within a single cache line, or straddling two cache lines.
So at this stage our problem looks essentially something like: every cycle
1. send this cycle's PC to the instruction cache to load a bunch of sequential instructions
2. simultaneously send this cycle's PC to the Fetch Predictor, which uses that as an index to look up the predicted PC for the next cycle. What's actually looked up is the "characteristics" of the next block of sequential instructions. This includes the PC of this block, the length of the block, and the types of branches within the block including the last branch.
- Let's amplify the above points. Obviously a block of sequentially executed instructions has an address, the PC for this block. And obviously a block of sequentially executed instructions has a length, so the hit against the I-cache knows how many instructions to load. But a block of say 15 sequentially executed instructions may also include say a conditional branch that is expected to not be taken, so we can pull in the full 15 instructions, not end Fetch at instruction 7, and feed the whole block of 15 on to the next pipeline stage (via a queue that holds the extra instructions, if the next stage can say only handle 8 or 10 instructions). It's an important point to load as many instructions as you can when you can, to make up for the cycles where you only load zero to a few instructions.

- So our Fetch predictor has predicted that we can load a run of 15 instructions with a conditional branch in the middle. But the Fetch predictor is very simple (basically a lookup table that records what past Fetch Sequences have looked like) so maybe that conditional branch in the middle could be incorrectly predicted? There are subsequent branch predictors of varying sophistication that handle this. They will kick in (they know to kick in because, remember, the Fetch Predictor includes records of branches in the middle of a Fetch Sequence) in later cycles [basically they are working at the same time that the request is going out to the I-cache and then returning].
If one of these more sophisticated, slower, branch predictors (which predict based not just on what happened lat time we accessed this PC, but also on the entire history of branches and other behavior that led up to this conditional branch) disagrees with the Fetch Predictor, then ultimately what was fetched up till instruction 7 will be kept in the Instruction Queue, what came after instruction 7 will be thrown away, the PC will be reset to the (hopefully correct) taken PC, and things will restart at that PC.

This is NOT the same thing as a branch misprediction!
This is a much lighterweight operation, often called a resteer. We lose two or three cycles of fetch, and we have to throw away some instructions in the Instruction Queue, that is all. This is a case of a better predictor overriding a fast but limited accuracy earlier predictor. True misprediction occurs later, when we actually execute the branch inside the core; then we see the real taken vs not taken, and if the value we guessed with our final (slowest, most accurate) predictor was incorrect, then we have to flush everything in the machine that was done after this inaccurate prediction. That takes ~20 cycles or so, and can throw away hundreds of partially executed instructions.

OK, so at this stage of the game we have a
- Fetch Predictor that generates (every cycle) a prediction for the next Fetch Sequence (a PC, a length, an indication of the non-taken branches within the Sequence; and how the Sequence ends)

A Fetch Sequence usually ends at a taken branch. (Maybe predicted taken, maybe unconditionally taken, like a goto, a function call, or a return). A taken branch means a new location in the I-cache, and if the I-cache can only provide one cache line per cycle, then we have to stop at this point, so that next cycle we can feed that new, taken, PC to the I-cache.

Before we go further, make sure you have the above scheme firmly understood.
There are a number of optimizations Apple makes to this. I won't describe them all, only the ones that I think might help you understood the important points a little better.
In the Fetch Predictor, along with the essential items I've already described, Apple also stores some items to save energy. For example the physical address and the way of the target cache line are recorded, that way we don't have to waste energy accessing the I-TLB or querying multiple ways in the I-cache.
Also the type of branch that ends a trace is recorded because different such branches have to be handled differently. A function call is easy, just set the PC for next cycle to the address of the function. A conditional branch also requires the branch address to be sent to the fancy branch prediction machinery to validate (hopefully...) the prediction that this conditional branch was taken. A return has to be sent to the return address stack to pop off the address and thereby provide the PC for the next cycle fetch.

There is also the (somewhat rare case) where an Instruction Sequence does not end with a taken branch, it just runs on. This can happen if the Instruction Sequence is unusually long (let's say longer than 16 instructions) and also to handle a few unusual cases (eg if the instruction sequence crosses a page boundary, since now you need two Fetch Predictor entries to handle the two page lookups).

OK, now let's think about the performance of this machinery.
On average there's a branch about every 6 cycles or so in code.
So the most naive sort of model of what I have described (that broke Fetch at every branch) would only be able to load about 6 instructions per cycle on average. And in fact it would be a little worse because some Instruction Sequences cross a cache line boundary.
So our first improvement is to boost the I-cache a little (which then requires things like storing two ways in the Fetch Predictor) so that we can deliver a Fetch Sequence that crosses two caches lines. And we allow our Fetch Predictor to pull in long runs (let's say up to 16 instructions long) just to balance the very frequent cases where runs are short (say only three or four instructions between branches). We also provide a large Instruction Queue to hold these long runs, to build up a buffer of instructions we can continue to Decode even if Fetch has to glitch for a cycle or two.

Next we allow Fetch to pull in Instruction Sequences that cross untaken branches, as described above. To add this functionality we also need to now include
- a record of where this taken branch (or maybe two?) is/are in the Instruction Sequence.
- ability for the fancy branch to predict multiple branches in a single cycle
- ability to reach into the Instruction Queue and selectively invalidate instructions that were fetched but need to be thrown away as a consequence of a Resteer.
- there are also technical complications now having to do with things like how to update the Fetch Predictor when it's realized that an internal branch is usually taken, so that Fetch Record should be split.

About half branches are taken so if we do the above work that means that on average now our Instruction Sequences can be about 9 to 10 instructions long. That's already not great if you want to run 10-wide, but remember there are multiple cycles where you can perform no Fetch. Sometimes this is because of a Resteer (maybe 2 or 3 cycles lost), sometimes it's because the L0 Fetch Predictor couldn't provide a prediction and we lose a cycle going to the L1 Fetch Predictor.
So we need a way to get (on average) longer Fetch Sequences.

There are three primary ways you can try to handle this.
One is to predict two PCs per cycle, and feed those both to the I-cache (which obviously also means some modifications to the I-cache). This is not utterly crazy. The second Pc prediction is slightly less accurate than the first, the I-cache modifications are feasible. This was being talked about as an idea for the last-stage, never manufactured, Alpha back in the late 1990s, and I believe it's being used by the most recent AMD chips. (I don't follow AMD closely; I was told this by someone but I'm not utterly convinced they understood the issue well...)

The second way is being used by Apple and has to do with Short Forward Taken Branches.
Suppose you have a branch that looks like
if(condition){one instruction}else{...}
These are fairly common, for example if(condition) a+=b else ...
Now suppose our prediction is that usually the condition is false and we branch over the single instruction.
So suppose we have a Sequence of 16 instructions, and this setup above happens at around instruction 6.
Then what if, for initial Fetch purposes, we just ignore the Short Forward Taken branch (now you see the meaning of each of those adjectives) and just Fetch the entire 16 instructions? Then in the Instruction Queue we snip out the predicted unexecuted instruction (eg the a+=b).
The effect of this is that the rest of the machine still sees the same sequence of instructions as it would have, but we are able to pull in a Sequence of 16 instructions, rather than having to perform this as an initial Fetch of say 6 instructions, then increment the PC by 1 instruction, then load the next Sequence of 9 instructions.
This obviously can also work with runs of 2 or 3 or 4 skipped instructions. The maximum number it's worth skipping is a question for simulations; my guess is it's around 3 or 4.

OK, so we have one Apple mechanism for giving us wider effective Fetches. This only kicks in for the pattern I described, but that pattern is in fact fairly common.

The third option for giving yourself effectively longer Fetch Sequences relies on the fact that many taken branches are either fixed or almost never not taken.
A fixed taken branch is something like a function call.
A predicted taken branch that almost never fails might be something like (if ptr==NULL) or (if divisor==0) or if(error). Things like this dominate most code, and force the code frequently to jump over rarely executed unusual-condition-handling.
So the actually executed code still looks like what we said at the start -- runs of sequential instructions separated by taken branches, but many of those taken branches are (effectively) ALWAYS taken. In this case, you can imagine storing somewhere the full run of the instruction as executed, ie two or three blocks of sequential instructions one after the other. Because they are now stored together, the fact that they are conceptually separated by a taken branch doesn't matter, not if they are all placed together in a single block of SRAM and can be loaded as a unit.
Such a run of sequentially executed instructions, separated by taken branches, is called a Trace, and the cache that holds them is called a Trace Cache.

The way Apple implements this (at least right now), the Trace Cache is as described above. It holds standard instructions, no weirdness with uOps or whatever.
There are now two Fetch Predictors accessed in parallel, one that predicts fetching from the I-cache, the other that predicts Fetching from the Trace Cache.
Obviously this scheme helps with generating longer Fetch Sequences. It probably also helps with power (you have more flexibility to lay out the Trace Cache however you like, and things like the TLB storage and way storage in the traditional Fetch Predictor may not be necessary). There may also be a small improvement in general branch prediction efficiency. For various reasons, almost always taken branches landed up having to be stored in the fancy branch predictor even though they don't really need "prediction", so they used up precious branch predictor space. But when those branches are implicitly stored in the Trace Cache they no longer need to be predicted. (The patents are not clear on this, but the impression I get is that the only traces Apple wants to store are those where all taken branches are effectively ALWAYS taken, enough so that there's no point in performing a fancy branch prediction on them, we just assume they're always taken and the .1% of the time that we're wrong, that's caught when we actually execute the branch in the core, and then that trace is thrown out the Trace Cache and not reused.)

So bottom line is
- 10-wide execution means SUSTAINED 10-wide Fetch
- this is difficult to impossible if there's a taken branch every 10 or so instructions BUT
- enough taken branches are "always" taken that you can construct Traces that are 16..32 or so instructions long (I'm guessing at the sizes, Apple doesn't say) and frequently load a trace in a single cycle.
- these long length traces make up for the cycles where you load nothing or not much, helping you to get closer to that sustained 10-wide Fetch.

Some final points:
None of this has anything to do with loops. Loops are so common and so easy to handle that everyone has had special handling for them forever. Even a decade ago Apple had not just specialized loop buffers, but loop buffers that unrolled small loops in hardware.

Almost all of the standard benchmark suites (eg GB6 or SPEC) do a terrible job of testing Fetch bandwidth. They just don't have very large code footprints, and most of their execution runs out of loops that are handled by loop buffers.
Browser benchmarks are somewhat better, but of course it's hard to compare those across devices because the OS is different, let alone things like Chrome vs Edge vs Safari. So this work done by Apple will show up when using large apps but will not be especially visible in the public benchmarks.

The above is far from the full story. Another dimension of Fetch is ensuring that the instructions you want have been prefetched into the I-cache before you need them. Yet another dimension is the whole question of how the Fetch Predictor gets populated, and how not to lose cycles while it is being populated. These are covered in my PDFs but we certainly don't have time for them here!

Thats a lot of tokens thanks !
 

GC2:CS

Member
Jul 6, 2018
39
21
81
There has been some discussion about why the delay for M5 Pro & Max and I wonder if AI was the reason? Maybe they had some further improvements almost ready for their GPU's AI support they wanted to wait on, to double down on the gains M5 already made?

They aren't gonna compete with Nvidia, AMD and Google for AI datacenter bubble money, but they could make some waves if they sold the best personal AI server you can buy - and it came in the form of a Macbook Pro.
Larger M5 are supposed to be modullar to some degree as per last years rumor.
 

jdubs03

Golden Member
Oct 1, 2013
1,453
1,027
136
From what I remember it was due to using TSMCs new advanced SoIC-MH packaging process. Which would indeed allow for greater modularity.
 

mvprod123

Senior member
Jun 22, 2024
530
573
96
From what I remember it was due to using TSMCs new advanced SoIC-MH packaging process. Which would indeed allow for greater modularity.
It is also possible that Apple planned to use LPDDR5X-10700 for the M5 Pro/Max, but a shortage prevented these chips from being released on time.
 
  • Like
Reactions: mikegg

poke01

Diamond Member
Mar 8, 2022
4,769
6,100
106
I will describe the problem a trace cache solves for Apple. Note that
1. other companies have different problems (generally decoding variable length instructions), and solve them with things they call a trace cache, but what they are doing with this "trace cache" is very different from Apple.
2. I will describe how Apple performs Fetch. Again, other companies have evolved their way to something that is kinda like Apple, but as the result of evolution rather than design, their solutions are generally not as elegant and not as performant, and use different terminology.

So here's the Fetch problem. Your CPU is 10-wide. That means that EVERY CYCLE, ideally, you want to pull in ten instructions to feed to the rest of the CPU. This has multiple difficulties, so let's start with the older parts before we get to the trace cache.

You can think of the executed stream of instructions as consisting of sequential blocks of instructions separated by branches, some of which are taken, some of which are not. (Taken branches include things like goto, function call, and return, so they may not be conditional.)
Sequential instructions are situated one after the other, within a single cache line, or straddling two cache lines.
So at this stage our problem looks essentially something like: every cycle
1. send this cycle's PC to the instruction cache to load a bunch of sequential instructions
2. simultaneously send this cycle's PC to the Fetch Predictor, which uses that as an index to look up the predicted PC for the next cycle. What's actually looked up is the "characteristics" of the next block of sequential instructions. This includes the PC of this block, the length of the block, and the types of branches within the block including the last branch.
- Let's amplify the above points. Obviously a block of sequentially executed instructions has an address, the PC for this block. And obviously a block of sequentially executed instructions has a length, so the hit against the I-cache knows how many instructions to load. But a block of say 15 sequentially executed instructions may also include say a conditional branch that is expected to not be taken, so we can pull in the full 15 instructions, not end Fetch at instruction 7, and feed the whole block of 15 on to the next pipeline stage (via a queue that holds the extra instructions, if the next stage can say only handle 8 or 10 instructions). It's an important point to load as many instructions as you can when you can, to make up for the cycles where you only load zero to a few instructions.

- So our Fetch predictor has predicted that we can load a run of 15 instructions with a conditional branch in the middle. But the Fetch predictor is very simple (basically a lookup table that records what past Fetch Sequences have looked like) so maybe that conditional branch in the middle could be incorrectly predicted? There are subsequent branch predictors of varying sophistication that handle this. They will kick in (they know to kick in because, remember, the Fetch Predictor includes records of branches in the middle of a Fetch Sequence) in later cycles [basically they are working at the same time that the request is going out to the I-cache and then returning].
If one of these more sophisticated, slower, branch predictors (which predict based not just on what happened lat time we accessed this PC, but also on the entire history of branches and other behavior that led up to this conditional branch) disagrees with the Fetch Predictor, then ultimately what was fetched up till instruction 7 will be kept in the Instruction Queue, what came after instruction 7 will be thrown away, the PC will be reset to the (hopefully correct) taken PC, and things will restart at that PC.

This is NOT the same thing as a branch misprediction!
This is a much lighterweight operation, often called a resteer. We lose two or three cycles of fetch, and we have to throw away some instructions in the Instruction Queue, that is all. This is a case of a better predictor overriding a fast but limited accuracy earlier predictor. True misprediction occurs later, when we actually execute the branch inside the core; then we see the real taken vs not taken, and if the value we guessed with our final (slowest, most accurate) predictor was incorrect, then we have to flush everything in the machine that was done after this inaccurate prediction. That takes ~20 cycles or so, and can throw away hundreds of partially executed instructions.

OK, so at this stage of the game we have a
- Fetch Predictor that generates (every cycle) a prediction for the next Fetch Sequence (a PC, a length, an indication of the non-taken branches within the Sequence; and how the Sequence ends)

A Fetch Sequence usually ends at a taken branch. (Maybe predicted taken, maybe unconditionally taken, like a goto, a function call, or a return). A taken branch means a new location in the I-cache, and if the I-cache can only provide one cache line per cycle, then we have to stop at this point, so that next cycle we can feed that new, taken, PC to the I-cache.

Before we go further, make sure you have the above scheme firmly understood.
There are a number of optimizations Apple makes to this. I won't describe them all, only the ones that I think might help you understood the important points a little better.
In the Fetch Predictor, along with the essential items I've already described, Apple also stores some items to save energy. For example the physical address and the way of the target cache line are recorded, that way we don't have to waste energy accessing the I-TLB or querying multiple ways in the I-cache.
Also the type of branch that ends a trace is recorded because different such branches have to be handled differently. A function call is easy, just set the PC for next cycle to the address of the function. A conditional branch also requires the branch address to be sent to the fancy branch prediction machinery to validate (hopefully...) the prediction that this conditional branch was taken. A return has to be sent to the return address stack to pop off the address and thereby provide the PC for the next cycle fetch.

There is also the (somewhat rare case) where an Instruction Sequence does not end with a taken branch, it just runs on. This can happen if the Instruction Sequence is unusually long (let's say longer than 16 instructions) and also to handle a few unusual cases (eg if the instruction sequence crosses a page boundary, since now you need two Fetch Predictor entries to handle the two page lookups).

OK, now let's think about the performance of this machinery.
On average there's a branch about every 6 cycles or so in code.
So the most naive sort of model of what I have described (that broke Fetch at every branch) would only be able to load about 6 instructions per cycle on average. And in fact it would be a little worse because some Instruction Sequences cross a cache line boundary.
So our first improvement is to boost the I-cache a little (which then requires things like storing two ways in the Fetch Predictor) so that we can deliver a Fetch Sequence that crosses two caches lines. And we allow our Fetch Predictor to pull in long runs (let's say up to 16 instructions long) just to balance the very frequent cases where runs are short (say only three or four instructions between branches). We also provide a large Instruction Queue to hold these long runs, to build up a buffer of instructions we can continue to Decode even if Fetch has to glitch for a cycle or two.

Next we allow Fetch to pull in Instruction Sequences that cross untaken branches, as described above. To add this functionality we also need to now include
- a record of where this taken branch (or maybe two?) is/are in the Instruction Sequence.
- ability for the fancy branch to predict multiple branches in a single cycle
- ability to reach into the Instruction Queue and selectively invalidate instructions that were fetched but need to be thrown away as a consequence of a Resteer.
- there are also technical complications now having to do with things like how to update the Fetch Predictor when it's realized that an internal branch is usually taken, so that Fetch Record should be split.

About half branches are taken so if we do the above work that means that on average now our Instruction Sequences can be about 9 to 10 instructions long. That's already not great if you want to run 10-wide, but remember there are multiple cycles where you can perform no Fetch. Sometimes this is because of a Resteer (maybe 2 or 3 cycles lost), sometimes it's because the L0 Fetch Predictor couldn't provide a prediction and we lose a cycle going to the L1 Fetch Predictor.
So we need a way to get (on average) longer Fetch Sequences.

There are three primary ways you can try to handle this.
One is to predict two PCs per cycle, and feed those both to the I-cache (which obviously also means some modifications to the I-cache). This is not utterly crazy. The second Pc prediction is slightly less accurate than the first, the I-cache modifications are feasible. This was being talked about as an idea for the last-stage, never manufactured, Alpha back in the late 1990s, and I believe it's being used by the most recent AMD chips. (I don't follow AMD closely; I was told this by someone but I'm not utterly convinced they understood the issue well...)

The second way is being used by Apple and has to do with Short Forward Taken Branches.
Suppose you have a branch that looks like
if(condition){one instruction}else{...}
These are fairly common, for example if(condition) a+=b else ...
Now suppose our prediction is that usually the condition is false and we branch over the single instruction.
So suppose we have a Sequence of 16 instructions, and this setup above happens at around instruction 6.
Then what if, for initial Fetch purposes, we just ignore the Short Forward Taken branch (now you see the meaning of each of those adjectives) and just Fetch the entire 16 instructions? Then in the Instruction Queue we snip out the predicted unexecuted instruction (eg the a+=b).
The effect of this is that the rest of the machine still sees the same sequence of instructions as it would have, but we are able to pull in a Sequence of 16 instructions, rather than having to perform this as an initial Fetch of say 6 instructions, then increment the PC by 1 instruction, then load the next Sequence of 9 instructions.
This obviously can also work with runs of 2 or 3 or 4 skipped instructions. The maximum number it's worth skipping is a question for simulations; my guess is it's around 3 or 4.

OK, so we have one Apple mechanism for giving us wider effective Fetches. This only kicks in for the pattern I described, but that pattern is in fact fairly common.

The third option for giving yourself effectively longer Fetch Sequences relies on the fact that many taken branches are either fixed or almost never not taken.
A fixed taken branch is something like a function call.
A predicted taken branch that almost never fails might be something like (if ptr==NULL) or (if divisor==0) or if(error). Things like this dominate most code, and force the code frequently to jump over rarely executed unusual-condition-handling.
So the actually executed code still looks like what we said at the start -- runs of sequential instructions separated by taken branches, but many of those taken branches are (effectively) ALWAYS taken. In this case, you can imagine storing somewhere the full run of the instruction as executed, ie two or three blocks of sequential instructions one after the other. Because they are now stored together, the fact that they are conceptually separated by a taken branch doesn't matter, not if they are all placed together in a single block of SRAM and can be loaded as a unit.
Such a run of sequentially executed instructions, separated by taken branches, is called a Trace, and the cache that holds them is called a Trace Cache.

The way Apple implements this (at least right now), the Trace Cache is as described above. It holds standard instructions, no weirdness with uOps or whatever.
There are now two Fetch Predictors accessed in parallel, one that predicts fetching from the I-cache, the other that predicts Fetching from the Trace Cache.
Obviously this scheme helps with generating longer Fetch Sequences. It probably also helps with power (you have more flexibility to lay out the Trace Cache however you like, and things like the TLB storage and way storage in the traditional Fetch Predictor may not be necessary). There may also be a small improvement in general branch prediction efficiency. For various reasons, almost always taken branches landed up having to be stored in the fancy branch predictor even though they don't really need "prediction", so they used up precious branch predictor space. But when those branches are implicitly stored in the Trace Cache they no longer need to be predicted. (The patents are not clear on this, but the impression I get is that the only traces Apple wants to store are those where all taken branches are effectively ALWAYS taken, enough so that there's no point in performing a fancy branch prediction on them, we just assume they're always taken and the .1% of the time that we're wrong, that's caught when we actually execute the branch in the core, and then that trace is thrown out the Trace Cache and not reused.)

So bottom line is
- 10-wide execution means SUSTAINED 10-wide Fetch
- this is difficult to impossible if there's a taken branch every 10 or so instructions BUT
- enough taken branches are "always" taken that you can construct Traces that are 16..32 or so instructions long (I'm guessing at the sizes, Apple doesn't say) and frequently load a trace in a single cycle.
- these long length traces make up for the cycles where you load nothing or not much, helping you to get closer to that sustained 10-wide Fetch.

Some final points:
None of this has anything to do with loops. Loops are so common and so easy to handle that everyone has had special handling for them forever. Even a decade ago Apple had not just specialized loop buffers, but loop buffers that unrolled small loops in hardware.

Almost all of the standard benchmark suites (eg GB6 or SPEC) do a terrible job of testing Fetch bandwidth. They just don't have very large code footprints, and most of their execution runs out of loops that are handled by loop buffers.
Browser benchmarks are somewhat better, but of course it's hard to compare those across devices because the OS is different, let alone things like Chrome vs Edge vs Safari. So this work done by Apple will show up when using large apps but will not be especially visible in the public benchmarks.

The above is far from the full story. Another dimension of Fetch is ensuring that the instructions you want have been prefetched into the I-cache before you need them. Yet another dimension is the whole question of how the Fetch Predictor gets populated, and how not to lose cycles while it is being populated. These are covered in my PDFs but we certainly don't have time for them here!

Why now?
Are they planning on going even wider in future CPUs?
 

Doug S

Diamond Member
Feb 8, 2020
3,800
6,731
136
Why now?
Are they planning on going even wider in future CPUs?

Given their history it seems highly likely they'll do so in a future CPU (maybe not as soon as A20/M6, but "by the end of the decade" I'd say is a lock) The question is when, and what they feel they need to have in place to get enough of a performance boost out of it to make it worth the additional transistors.
 

mvprod123

Senior member
Jun 22, 2024
530
573
96

Apple supply chain analyst Ming-Chi Kuo today said Intel is expected to begin shipping Apple's lowest-end M-series chip as early as mid-2027.
Kuo said Apple plans to utilize Intel's 18A process, which is the "earliest available sub-2nm advanced node manufactured in North America."
If this rumor proves to be accurate, Intel could supply Apple with M6 or M7 chips for future MacBook Air, iPad Air, and iPad Pro models at a minimum.

...

TSMC would continue to supply the majority of Apple's M-series chips.
Kuo said that Apple choosing to have Intel supply its lowest-end M-series chip would appease the Trump administration's desire for "Made in USA" products, and it would also help Apple to diversify its supply chain for manufacturing.
 

jpiniero

Lifer
Oct 1, 2010
17,137
7,523
136
Kuo said that Apple choosing to have Intel supply its lowest-end M-series chip would appease the Trump administration's desire for "Made in USA" products

They could do that by using TSMC nodes that are in the US or Samsung really. Appeasing Trump while he still has Intel stock.... more likely.