Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

HurleyBird · Nov 16, 2020

name99 said:
4 core SoC (+4 efficiency cores) doesn't match 8 core SoC (+SMT)? I should hope so!
That Apple gets to 75% of the AMD score with what looks like half the AMD resources is the real point...

Albeit in a very power constrained environment. That's sort of the point (along with background tasks) of the efficiency cores. The 8C16T machine needs to clock down a lot. If you took that Zen 2 mobile SOC, and replaced half the cores with ones that performed 66% as well while only consuming 25% the power (just pulling random numbers here), would the aggregate MT performance at 15W degrade or would it improve?

Eug · Nov 16, 2020

Antey said:
Macbook air M1
Cinebench R23 SC 1477, MC 6304.

https://forums.macrumors.com/thread...-comparison-with-other-brands.2267377/page-12

In absolute terms, those are very good numbers. Relative to the MacBook Pro scores they aren’t stellar but aren’t bad, with a 16% performance drop due to throttling in a fanless enclosure. I was hoping for around a 10-15% (or less), but I’ll take it. Close enough.

At these performance levels, I am still impressed that they didn’t make an A14 class low end Mac. I guess they felt it just wasn’t worth their time to add the extra pieces to a 2+4 design. No need to sandbag. Go big (plus LITTLE) or go home!

Antey · Nov 16, 2020

more like 19% (0,84)

the only thing i don't like about this chip is the lack of i/o, it's a smartphone soc in disguise, no egpu support (no pcie lanes for it), just 1 external monitor. it has no point of differentiation from a possible a14x

Roland00Address · Nov 16, 2020

Eug said:
At these performance levels, I am still impressed that they didn’t make an A14 class low end Mac. I guess they felt it just wasn’t worth their time to add the extra pieces to a 2+4 design. No need to sandbag. Go big (plus LITTLE) or go home!

They could always go M1— for less money.

(Please don't the M1 is awesome but it is also awesome that this is going to be the slowest mac chip EVER from now on.)

Eug · Nov 16, 2020

This was pointed out to me elsewhere:

Mac

The most powerful Mac laptops and desktops ever. Supercharged by Apple silicon. MacBook Air, MacBook Pro, iMac, Mac mini, Mac Studio, and Mac Pro.

www.apple.com

M1 delivers significantly higher performance at every power level when compared with the very latest PC laptop chip. At just 10 watts (the thermal envelope of a MacBook Air), M1 delivers up to 2x the CPU performance of the PC chip. And M1 can match the peak performance of the PC chip while using just a quarter of the power.

Interesting way to word it. It's mentioning a thermal envelope of 10 Watts for the Air, but it doesn't actually say that M1 is a 10 Watt chip. So if you read between the lines, it's saying it will do well at 10 Watts, but it will throttle.

And then later Apple talks about the MacBook Pro's fan and how good it is for sustained workloads.

---

BTW, there are bazillion unboxing showing up now.

Screen Shot 2020-11-16 at 10.54.10 PM.png

name99 · Nov 16, 2020

Antey said:
more like 19% (0,84)

the only thing i don't like about this chip is the lack of i/o, it's a smartphone soc in disguise, no egpu support (no pcie lanes for it), just 1 external monitor. it has no point of differentiation from a possible a14x

Well yes. It's a chip targeted at this type of computer! If you want the (slightly better) chip, wait for the iMac and higher end MBP! If you want the really good chips, wait for the iMac Pro.

I mean this is like looking at a Lakefield and saying "You know, I like it, but it really should be a Xeon"!

amrnuke · Nov 16, 2020

HurleyBird said:
Albeit in a very power constrained environment. That's sort of the point (along with background tasks) of the efficiency cores. The 8C16T machine needs to clock down a lot. If you took that Zen 2 mobile SOC, and replaced half the cores with ones that performed 66% as well while only consuming 25% the power (just pulling random numbers here), would the aggregate MT performance at 15W degrade or would it improve?

4500U (6/6) and 4700U (8/8) are both hitting 1151 on CB23 1T, and 5600-5900 on MT at 14-15W. If we assume similar CB23 increases from zen2->zen3 as we saw on CB20 zen2->zen3, that puts a theoretical 15W TDP "5500U" with 6C/6T at a CB23 score of 1400-1425 / 6700-6750. If the M1 is 1477 / 6304 and keeps a 10-15W SoC load, that's not awfully dissimilar. 6 big cores and no SMT vs 4 big and 4 small cores.

Obviously, there are a lot of gaps, namely that there, uh, isn't a 5500U, nor do we have any review-quality M1 benchmarks, and I'm making a lot of assumptions. But some of the extrapolation suggests that we have a great competition here. If that holds, Cezanne vs M1 will be a fun thing to watch analyzed over the next year.

I'm pumped to see where this lands us in 2-3 years too.

Eug · Nov 16, 2020

Mac mini Cinebench R23 multi core 7756

Screen Shot 2020-11-16 at 11.35.45 PM.png

Note it says 3.0 GHz multi core, 3.2 GHz single core.

amosliu137 · Nov 17, 2020

Some test are interesting. In some cases air beats pro. M1 uses only 2min 15secs to install Xcode.xip. My MacBook use much much longer time to open it. You can see the live link.
Test

amrnuke · Nov 17, 2020

name99 said:
It feels to me like you don't quite grasp the essential distinctions here.

All speculation is a (statistics informed) guess about how the program will probably behave, but requiring a way to recover if your guess was incorrect.
The most obvious and well-established form of speculation is branch prediction. The speculation is ultimately about the sequence of instructions through the CPU, and the recovery mechanism has two main pieces:
- various mechanisms (physical vs logical registers, and the LSQ [as a secondary function]) hold calculated values provisionally, until each branch is resolved, at which point all values calculated prior to that branch can be graduated from provisional state to correct state.
- the ROB holds the provisional ordering of instructions, connecting the instruction ordering (which instructions are now known to be valid, given that the branch that led to them is correct) to the values described above (sitting in physical registers and the LSQ).

What's important is the dependency graph: what set of subsequent speculated values are dependent on a speculated branch, and thus will be shown to be incorrect if that branch was incorrect.
Now the nature of control speculation (speculation on the sequence of instructions) is that once you make a branch for practical purposes EVERY instruction after that branch depends on that branch. Which means that if a branch was guessed incorrectly (direction or target) everything after it needs to be flushed.
Now you might protest that this is not true, that there are branch structures (diamonds) like
if (a<0){
b=-a
}else{
b=a
}
where there's a tiny amount of constrained control divergence, after which flow reconverges. This is true. But it doesn't help. Even if you want to correct speculation that led down the wrong half of the diamond just by flushing the instruction b=-a, and executing b=a, everything after the diamond closes is dependent on the value of b and also is now incorrect. It's just not practical to track everything that does or does not depend on a particular branch and selectively flush that branch and its dependents and nothing else
(a) almost EVERYTHING is dependent, so this buys you very little and
(b) branches are so dense (think ~1/6 of instructions) that you'd need tremendously complicated accounting to track what is dependent on this branch not that.

So end result is: control speculation as a practical matter has to recover by flushing *everything* (all instructions, all calculated values) after a misprediction.
If you think about this in detail, it leads to a whole set of issues.
- Of course you want an accurate predictor, that's given. But you also want to catch mispredicts if you can, up to decode and rename, but before they enter the OoO machinery, because catching them there only flushes the instructions queued after them in various buffers sitting between fetch, decode, up to rename. Hence the value of a long latency (but even more accurate) secondary branch detection mechanism using even larger pools of storage.
- you want to avoid branches with the characteristic that they are hard to predict and do very little work (like the diamond I described above). Things like MAX or ABS. This leads to the value of predicated instructions and things like CSEL/CMOV. The whole story of CMOV in the x86 world is a tragedy, and since this is supposed to be purely technical I won't cover it. But the fallout is that much of the x86 world, even today, is convinced that predication is a bad idea (and imagine that tiny micro-benchmarks prove this). But microbenchmarks miss the big picture. The value of predication is that it converts a branch (which becomes massively expensive if it's mis-predicted) into straightline execution with no speculation and no penalties. Fortunately ARM CSEL was well designed and implemented from the start so ARM doesn't have this weird x86 aversion. IBM even converts short branches to predication and I suspect Apple does the same (just on the grounds that Apple seems to have implemented every good idea that has ever been invented).

There's vastly more that can be said about branches (some of it said by me here, going forwards and backwards from this anchor link):
Look for the posts by me, Maynard Handley
https://www.realworldtech.com/forum/?threadid=196054&curpostid=196130
Even apart from that, you can start asking how exactly you recover from a misspeculation... The first ways of doing this in the late 80s based on walking the ROB (not very large at the time) were adequate but didn't scale. So then came checkpoints but there's a whole subspecialty in when you implement checkpoints... EVERYTHING in this space is much more than just the words -- I can implement checkpoints, and you can implement checkpoints, but I can get much more value out of my checkpoints than you (ie mispredicts are a lot cheaper) if I am smarter in when I create each checkpoint.

But all the above was throat clearing. The point is that control speculation has characteristics that mean recovering from a misprediction require flushing everything after the misprediction. But there are other forms of speculation.
One of the earliest, which you know something about, is speculative scheduling (ie guess that a load will hit in cache, and schedule subsequent instructions based on that).
Another is load/store aliasing: if a store doesn't yet know its address, so it's just sitting in the store queue waiting, what do we do with a subsequent load? We could delay it until we know the address of every store, but chances are the load doesn't actually load from the address of that store, so we are delaying for nothing. Classical speculation territory... (One way to do this were what I referred to with store sets and the Moshovos patent) But once again, if your speculation is incorrect, then the contents of the load, and everything that depends on it, are now invalid.
A third possibility is value prediction. This notes that there are some loads that occur over and over again but what's loaded never changes, so you can bypass the load and just supply that value. This is the kind of thing you'd say "that's dumb, how often does it happen?" Well, unfortunately it happens way more than it should... Value prediction is in a kinda limbo right now. For years it was talked about but not practical (in the sense that, with limited resources, transistors were better spent elsewhere). But we are getting close to the point where it might start making sense. QC have been publishing on it for a few years now, but as far as I know no-one has stated that a commercial product is using it, though who knows -- maybe Apple have already implemented an initial version?

For each of these cases, we again need recover in the case of misspeculation. Recovery from these kinds of misspeculation is generically called replay. The important difference compared to control speculation is that data speculation lends itself much better to tracking the precise dependencies of successor instructions on the speculated instruction: the dependency chains are shorter and less dense. This means that, although the easiest response, in the face of a data misspeculation, is to reuse the control misspeculation machinery, that's not the only possible response.
But even if you accept this idea and so want to engage in some sort of selective replay, there are still many different ways to do it, and getting the details wrong can have severe consequences (as you are aware from P4, and Cyclone as a suggested mechanism for doing a lot better. However I'd see Cyclone best thought of as an OoO scheduler [something I've not discussed at all] rather than as a *generic* replay mechanism. And Cyclone was Michigan, not Wisconsin, though I also frequently confuse the two!)

So some points from all this:
- the sheer size of structures is not that important. Of course it is important, but not in the way the fanboy internet thinks. Almost every structure of interest consists of not just the structure size but an associated usage algorithm. And the structure is almost always sized optimally for that algorithm, in the sense that growing the structure, even substantially (like doubling) buys you very little in improved performance. The best paper demonstrating this is

https://arxiv.org/pdf/1906.08170.pdf
which shows that even quadrupling Skylake resources in a naive way gets you only about 1.5x extra performance. People latch onto these numbers, like size of the ROB, or amount of storage provided for branch prediction, because they're available. But that's the drunk looking for his keys under the lamp because that's where the light is! The numbers are made available (not by Apple, but by ARM. AMD, Intel, ...) precisely because they are NOT important, they don't offer much of a competitive advantage. The magic is in the algorithm of how that storage is used. Saying 620-entry ROB implies there's a single way that something called a ROB is used, and anyone can just double a ROB and get much better performance. NO! ROB essentially means how much control provisional state can be maintained and scaling that up involves scaling up many many independent pieces.

This is one reason I get so furious at people who say that Apple performance is "just" from a larger ROB or larger BTB or whatever. Such thought display utter ignorance as to the problems that each piece of functionality solves and what goes into the solution. So, consider the ROB. The point of the ROB, as I explained is to track what needs to be flushed if a misprediction goes wrong. So why not just double the ROB?
Well, consider instruction flow. Instruction go into the Issue queue(s), wait for dependencies to be resolved, issue, execute. Executing takes worst case a few cycles, why not have a ROB of 30 or 40 entries?
Because SOME instructions (specifically loads that miss in cache) take a long time, and the instruction at the head of the ROB cannot be removed from the ROB until it completes. So with a short ROB, after you've filled up the 40 slots with the instruction after that load, you stop and wait till the load returns.
OK, so the ROB is just a queue, easily scaled up. Why not make it 4000 entries?
Because almost every instruction that goes into the ROB also requires some other resource, and that resource is held onto until the instruction move to the head of the ROB and completes. (What is called register rename is essentially resource allocation. During rename the resources an instruction will require -- a ROB slot, probably a destination register, perhaps an entry in the load or store queues -- are allocated, and they are held onto until completion.) So sure, you can have 4000 ROB slots, but if you only have 128 physical registers, then after those 128 are all allocated as destination registers, your ROB is filled with ~128 entries and the rest of them are empty because the machine is going to stall until a new physical register becomes available.)
So the first constraint on growing the ROB is that to do it you also need to grow the number of physical registers and the size of the LSQ. Neither of these are at all easy. And both of them involve their own usage algorithms where, yes, you can grow either the register file or the LSQ if you switch to using them in a novel way. But again this novel usage model is not captured by saying "of they have a 144 entry load queue" as though that's some trivial achievement, just a little harder than a 72 entry load queue.

But even THAT is not the end of the game. Because even if you can grow the other resources sufficiently (or can bypass them in other ways: my favorite solution for this is Long Term Parking https://hal.inria.fr/hal-01225019/document ) you have the problem that a good branch predictor is not a perfect branch predictor. Branches 1 in 6; a 600 entry ROB will hold ~100 branches. Even if each of those has only a 1% chance of misprediction that means there's close to 100% certainty that there is A misprediction somewhere in the ROB in all those instructions that piled up behind the load that missed. It's (to good enough accuracy) equally likely anywhere, meaning that half the work in your ROB, 300 instructions, is likely wasted along with wasted energy, done after a bad branch and will have to be flushed. 99% accurate sounds great for a branch predictor (and neither AMD nor Intel hit that, nor Apple who are somewhat closer) -- but by itself a huge ROB and an imperfect branch predictor just mean you're doing a lot more work that will eventually get flushed.
A12 results here:

https://twitter.com/x/status/1307645405883183104

Since Apple have this huge ROB and clearly get value from it
(they didn't start that way, they started with designs that were apparently very similar to say x86 at the time, scaled 1.5x wider. Best thing I've found showing the evolution over the years is here: Not great, but gets the gist and I know nothing better
https://users.nik.uni-obuda.hu/sima...e_2019/Apple's_processor_lines_2018_12_23.pdf )
they are clearly NOT just scaling all the structures up without changing the underlying algorithms. I have some hypotheses for how they get value from that massive ROB abd imperfect branch predictor, but this is already too long!

But I hope you read all this and think about it. And then realize why I get so angry, why saying "oh they *just* doubled the size of the ROB [or cache or branch predictor or whatever" is such an ignorant statement. It's not just that doubling the size of those structures is hard (though it IS), it's that doubling them is not nearly enough, it just doesn't get you very much without accompanying changes in the way those structures are used.
And Apple has been engaged in these changes at an astonishing rate -- looks like pretty much every two years or so they swap out some huge area of functionality like the PRF or the ROB or the branch predictor or the LSQ and swap into something new and, after the first round or two of this, something that has never been built before.

Second point
- your post keeps confusing replay with flushing. Read what I said. These are different concepts, implemented in different ways. Likewise you seem to think that a large ROB helps deal with low quality speculation. Precisely backward! A large ROB is only valuable if your (control) speculation is extremely high quality and provides for extremely fast recovery from mis-speculation. Likewise you seem to think that the ROB is somehow related to the machine width. Not at all.

I suggest that, based on all the reading I have given you, start thinking this stuff out in your head. Consider the flow of instructions from fetch to completion -- at first you don't even need OoO or superscalar, just consider an in-order single issue CPU that utilizes branch prediction. Think about the type of functionality that will HAVE TO be present for this to possibly work, how it will recover when something goes wrong. And then you can start adding additional functionality (superscalar, then OoO) into your mental model.

This is great, and will give me a lot to read. I really appreciate the time you took to write it out.

A few superficial things for now:

- With respect to Apple expanding the ROB: I didn't intend to imply that Apple just doubled the ROB to improve performance. I was asking why they might have a use for a larger ROB - is it because they made a wider core? Or because their branch prediction scheme benefited from it? Both? Clearly I didn't think it's as easy as adding transistors to make a larger buffer and IPC gainz lulz. In any case, I have a lot to read up on. I know the ROB and surrounding logic is incredibly complex, as is the entire pipeline for that matter. My knowledge only goes so deep as my (maybe low-mid-level?) computer science classes 16 years ago. I know about ambiguous dependence and CAMs and that there are complexities around in-flight instructions and RAW violations and such. I do not pretend to know how the hell these things work! So I'm deeply grateful that you've explained what you have, and it makes sense in layman's terms.

- As for replay vs flushing, yes, there's absolutely confusion. As I understand it, I thought that as it pertained to replay, you were talking about flushing as in flushing the whole pipeline, or the whole ROB, etc in the case of a misprediction. Am I mixing it up?

- With respect to predication, is the penalty for use of predication higher if you have a deeper pipeline? Upon what is the x86 aversion to predication based?

Eug · Nov 17, 2020

amosliu137 said:
Some test are interesting. In some cases air beats pro. M1 uses only 2min 15secs to install Xcode.xip. My MacBook use much much longer time to open it. You can see the live link.
Test

Be aware that on a fresh system, load times tend to be much faster than on a well-used system.

For example, when I freshly installed Office 2016 on my iMac, Word used to take a few bounces to launch. However, for some reason, later on after using it for a few months it would take up to 10 bounces to launch. It was extremely frustrating, considering I had a 2 GB/s SSD in this machine.

Also note that the MacBook Air M1 does over 2 GB/s read and 2.6 GB/s write (256 GB SSD). I believe this is literally twice the SSD speed of the last Intel MacBook Air.

amrnuke · Nov 17, 2020

Eug said:
For example, when I freshly installed Office 2016, Word used to take a few bounces to launch. However, for some reason, later on it would take up to 10 bounces to launch. It was extremely frustrating, considering I had a 2 GB/s SSD in this machine.

Imagine how that'll be on Office on Mac on Arm........

shady28 · Nov 17, 2020

That's a miss on R23, look what's above and below it :

Antey · Nov 17, 2020

1,69GHz base and 2,81 GHz base, it's probably 15W and 28W

shady28 · Nov 17, 2020

Antey said:
1,69GHz base and 2,81 GHz base, it's probably 15W and 28W

I don't think it will pull 28W on a single thread bench.

jeanlain · Nov 17, 2020

(ignore)

jeanlain · Nov 17, 2020

shady28 said:
I don't think it will pull 28W on a single thread bench.

The 28W 1185G7 tested by Anandtech draws 21W during a single-core SPEC test (as reported by Andrei F.). Not sure about the 1165G7.
During the same test, the 3GHz A14 draws 5W (which also includes the power consumed by "regulators").
The M1 at 3.1 GHz (that is, before the macOS 11.0.1 update) should draw more than 5W, but not nearly as much as 20W.

Geegeeoh · Nov 17, 2020

Maybe a bit offtopic, but this surprised me:

Mac Support Update -- November 16
With this week’s patch 9.0.2, we’re adding native Apple Silicon support to World of Warcraft. This means that the WoW 9.0.2 client will run natively on ARM64 architecture, rather than under emulation via Rosetta.
We’re pleased to have native day one support for Apple Silicon.
While our testing has been successful, we’re highly aware of the nature of day one support with updates like this. Please let us know if you run into any issues that may be related to Apple Silicon in our Technical Support forum.
Thank you very much.

Mac Support Update -- November 16

With this week’s patch 9.0.2, we’re adding native Apple Silicon support to World of Warcraft. This means that the WoW 9.0.2 client will run natively on ARM64 architecture, rather than under emulation via Rosetta. We’re pleased to have native day one support for Apple Silicon. While our testing...

eu.forums.blizzard.com

gai · Nov 17, 2020

amrnuke said:
- As for replay vs flushing, yes, there's absolutely confusion. As I understand it, I thought that as it pertained to replay, you were talking about flushing as in flushing the whole pipeline, or the whole ROB, etc in the case of a misprediction. Am I mixing it up?

Both flushes and replays are techniques to recover from incorrect speculative execution state. However, their cost, both in energy and in performance, differs wildly.

In this context*, a flush refers to completely removing all traces of program execution past a certain point in architectural order. This means the ROB, but also the LSQs, the scheduler, and so on.

A flush discriminates based on age, which can be computed in parallel for every op in all of these structures. This makes it well-suited for control speculation errors (e.g. branch misprediction) that void all speculative work past the point of the flush. Unfortunately, there are many cases when ops that performed useful work are also flushed. Those instructions will have to be fetched and executed anew, even if they correctly executed beforehand. When the instruction window of a processor grows, the cost of a flush increases greatly. The cost also increases with pipeline length.

Even when considering those instructions that are fetched and executed two or more times, the term "replay" does not apply. Every instruction appears to be brand new and is treated accordingly.

In contrast, a selective replay (or replay for short) does not clear everything out of the pipeline. If the processor can identify (a superset of) all of the ops that could possibly have incorrect state, then those ops can be executed again. It is possible and likely that the replayed ops are non-adjacent in program order.

For example, a modern processor is very likely to speculate that a load op will hit in the L1D$. This speculation can dramatically improve performance when it is correct. However, when that speculation is incorrect, all data-dependents of the load op must be canceled. It is easy to imagine that this optimization would be a net penalty if the recovery mechanism were a flush. However, a selective replay mechanism greatly reduces the cost, by (1) not tampering with unrelated ops and (2) keeping the dependent ops ready to replay as soon as the load data arrives. The reduction in cost, at equal benefit, creates a favorable net outcome.

The dependency analysis for a replay might be very expensive, because the dependent ops may have dependents of their own, and so on. There is a large trade-off space to balance the cost and performance of selective replay. There may be cases when instructions replay several times, and there may be cases when non-dependent instructions are replayed. Critically, the cost of selective replay is relatively independent of instruction window depth.

* The term "flush" can also be used in some other contexts, like "flushing" a cache (purging its contents).

DrMrLordX · Nov 17, 2020

Geegeeoh said:
Maybe a bit offtopic, but this surprised me:

I think that's relevant. Shows Blizzard is interested in where this line of Apple ARM chips under MacOS can go. Which is no small thing.

shady28 · Nov 17, 2020

jeanlain said:
The 28W 1185G7 tested by Anandtech draws 21W during a single-core SPEC test (as reported by Andrei F.). Not sure about the 1165G7.
During the same test, the 3GHz A14 draws 5W (which also includes the power consumed by "regulators").
The M1 at 3.1 GHz (that is, before the macOS 11.0.1 update) should draw more than 5W, but not nearly as much as 20W.

So M1 is king of performance / watt.

What was it they said, 'Fastest CPU in the world'?

Shouldn't they have said 'Most efficient CPU in the world'?

I mean these two things are not the same.

While this is a good CPU, even excellent, it's also contemporary in the sense that it is another good choice among several good choices. Its not blowing anything out of the water.

Soon we'll start seeing test results from using the iGPU and the AI core. These may change the game a bit, all the cards haven't yet been played.

Eug · Nov 17, 2020

shady28 said:
So M1 is king of performance / watt.

What was it they said, 'Fastest CPU in the world'?

Shouldn't they have said 'Most efficient CPU in the world'?

I mean these two things are not the same.

While this is a good CPU, even excellent, it's also contemporary in the sense that it is another good choice among several good choices. Its not blowing anything out of the water.

Soon we'll start seeing test results from using the iGPU and the AI core. These may change the game a bit, all the cards haven't yet been played.

Given how well the M1 chip is performing and how well it's been received by the tech media, at this point I don't think anyone really cares except a few geeks.

Heartbreaker · Nov 17, 2020

Browser benches vs Ryzen 4700u. Note that M1 Chrome numbers are from x86 version running in Rosetta:

Apple-M1-Mac-Mini.browser-benchmarks-1440x1080.png

Antey · Nov 17, 2020

why stop there then?

this is the source

Hands-on with the Apple M1—a seriously fast x86 competitor [Updated]

Apple’s M1 proves that ARM can compete with x86 in high-end systems.

arstechnica.com

there many more benchmarks

for example, this one is interesting

https://cdn.arstechnica.net/wp-content/uploads/2020/11/Apple-M1-Mac-Mini.power-consumption-640x480.png

M1 pulls up to 20W vs 23W with ryzen 4700U

Mopetar · Nov 17, 2020

The AT review/benchmarks have been posted: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested/2

Pretty good showing overall for Apple's debut. It's generally better than what Intel has to offer and even manages to hang with some of AMD's newer CPUs a lot of the time. The GPU results are also quite impressive considering Apple is a lot newer to designing GPU cores than it is CPU cores.

Discussion Apple Silicon SoC thread

Lifer

Platinum Member

Lifer

Member

Platinum Member

Lifer

Senior member

Golden Member

Lifer

Member

Golden Member

Lifer

Golden Member

Platinum Member

Member

Platinum Member

Member

Member

Member

Junior Member

Lifer

Platinum Member

Lifer

Diamond Member

Member

Diamond Member