Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Hitman928 · Nov 17, 2020

Antey said:
why stop there then?

this is the source

Hands-on with the Apple M1—a seriously fast x86 competitor [Updated]

Apple’s M1 proves that ARM can compete with x86 in high-end systems.

arstechnica.com

there many more benchmarks

for example, this one is interesting

https://cdn.arstechnica.net/wp-content/uploads/2020/11/Apple-M1-Mac-Mini.power-consumption-640x480.png

M1 pulls up to 20W vs 23W with ryzen 4700U

There's also this quote from that article:

I did observe one higher power draw from the mini aside from those shown here—during multi-threaded Geekbench runs, the mini hits 30W for about five seconds total at one point during the run.

Based upon that and Anandtech's power results, looks like the M1 TDP is ~22W give or take a couple of watts.

Antey · Nov 17, 2020

Comparing Ryzen 4700U vs M1 GPU, i think yes, M1 is ahead but... it has 1024 ALUs, 1275 MHz (2600 Gflops FP32), and i think it also has a big pool of high bandwidth cache (ala rdna2)... thats compared to a 7 CU, 448 Stream processors, 1600 MHz (1433 Gflops FP32)...

amrnuke · Nov 17, 2020

gai said:
Both flushes and replays are techniques to recover from incorrect speculative execution state. However, their cost, both in energy and in performance, differs wildly.

In this context*, a flush refers to completely removing all traces of program execution past a certain point in architectural order. This means the ROB, but also the LSQs, the scheduler, and so on.

A flush discriminates based on age, which can be computed in parallel for every op in all of these structures. This makes it well-suited for control speculation errors (e.g. branch misprediction) that void all speculative work past the point of the flush. Unfortunately, there are many cases when ops that performed useful work are also flushed. Those instructions will have to be fetched and executed anew, even if they correctly executed beforehand. When the instruction window of a processor grows, the cost of a flush increases greatly. The cost also increases with pipeline length.

Even when considering those instructions that are fetched and executed two or more times, the term "replay" does not apply. Every instruction appears to be brand new and is treated accordingly.

In contrast, a selective replay (or replay for short) does not clear everything out of the pipeline. If the processor can identify (a superset of) all of the ops that could possibly have incorrect state, then those ops can be executed again. It is possible and likely that the replayed ops are non-adjacent in program order.

For example, a modern processor is very likely to speculate that a load op will hit in the L1D$. This speculation can dramatically improve performance when it is correct. However, when that speculation is incorrect, all data-dependents of the load op must be canceled. It is easy to imagine that this optimization would be a net penalty if the recovery mechanism were a flush. However, a selective replay mechanism greatly reduces the cost, by (1) not tampering with unrelated ops and (2) keeping the dependent ops ready to replay as soon as the load data arrives. The reduction in cost, at equal benefit, creates a favorable net outcome.

The dependency analysis for a replay might be very expensive, because the dependent ops may have dependents of their own, and so on. There is a large trade-off space to balance the cost and performance of selective replay. There may be cases when instructions replay several times, and there may be cases when non-dependent instructions are replayed. Critically, the cost of selective replay is relatively independent of instruction window depth.

* The term "flush" can also be used in some other contexts, like "flushing" a cache (purging its contents).

Great rundown, thank you!

In the context of his post, I took "flush" to mean non-selective replay:

But Apple may also be using a way predictor. (And likely are for power reasons.)
A way predictor is problematic if you are doing speculative scheduling (which Apple also likely are) IF you're a lazy moron and handle replay via flushing or something similarly heavyweight.

But it you have a quality replay mechanism (which is a gift that keeps giving in terms of how much additional speculation it allows) then a way predictor is probably well worth doing (and could allow them to grow the L1 even larger if time of flight allowed -- perhaps when 3D becomes available).

So going back to my response to that, I have some questions: why mention a way predictor as a possibility? It is almost guaranteed to be used. They've been in place in AMD and Intel (and presumptively Apple) uarch for a long time. And speculative scheduling has been used for a long time. Handling replay via flushing (i.e. non-selective replay) hasn't been used (to my knowledge, after extensive searching) for a long time outside of the academic realm. I guess I am more perplexed at the implication that there is a a world where Apple aren't using way prediction, speculative scheduling, and a refined replay scheme. More the way he worded the post perhaps? "a way predictor is probably well worth doing (and could allow them to grow the L1 even larger ..." -- the implication being that they aren't already using way prediction? At least to my interpretation.

Purportedly, high-quality replay mechanisms abound, whether token-based selective replay, a replay loop with latency prediction, or whatever. Apple probably has their own modified or perhaps entirely new mechanism that may share features of many schemes. I am curious whether we can glean anything about their branch prediction, scheduling, or replay schemes based on the uarch information we've been given thus far, but don't know enough about the interplay of the various components to know where to begin formulating a guess.

Regarding selective replay being relatively independent of instruction window depth, is that true of all selective replay schemes? IIRC there are some that become exponentially more costly with increasing width or depth (I cannot recall which, or if both possibly are contributors) - but I recall token-based selective replay being developed to circumvent that problem.

amrnuke · Nov 17, 2020

Hitman928 said:
There's also this quote from that article:

Based upon that and Anandtech's power results, looks like the M1 TDP is ~22W give or take a couple of watts.

I really can't wait for some Cezanne comparisons. M1 overall beats the 4700U (and in some cases, blows it out of the water), at a lower power draw, but I can't help but feel the the comparison feels off - because the Zen2 core came out in July of 2019, before even the A13 was released, and Vega is over 3 years old.

Will be interesting to see if the 10-30% gains from Zen2->Zen3 translate to the mobile parts and if so, whether there is any competition from a performance per watt standpoint to what the M1 brings here. Because this is a hell of a chip!

beginner99 · Nov 17, 2020

Mopetar said:
Pretty good showing overall for Apple's debut. It's generally better than what Intel has to offer and even manages to hang with some of AMD's newer CPUs a lot of the time. The GPU results are also quite impressive considering Apple is a lot newer to designing GPU cores than it is CPU cores.

This is basically same as with consoles. If you have a limited range of hardware, you can extract far more with less. In this case Apple can add accelerators and control it via their own API (metal). In terms of GPU it probably helps a lot they do not need to support a ton of legacy APIs and certainly not directx. Also it's almost certainly tile-based with all the advantages and disadvantages. Apple doesn't need to care about compute because they don't make compute cards based on same uArch and they can run AI on the dedicated AI cores. I would expect this to save die space and power. It's the fully vertically integrated part showing it's power.

This would be cool if it weren' fore the vendor-lock in and losing any kind of freedom of what you do with your device and what you install. App store on phone, OK if you can install in different ways. But on a PC? I don't even know why I would risk developing software for this ecosystem if the can pull my key at any moment for any reason and basically bankrupt developers. Once people are lured in, they will start pushing their agenda more and more and nothing you can do about it.

Cool hardware-wise and impossible for intel/amd/nv to compete given the need for higher flexibility of x86/dGPUs. But a no go platform-wise.

Hitman928 · Nov 17, 2020

amrnuke said:
I really can't wait for some Cezanne comparisons. M1 overall beats the 4700U (and in some cases, blows it out of the water), at a lower power draw, but I can't help but feel the the comparison feels off - because the Zen2 core came out in July of 2019, before even the A13 was released, and Vega is over 3 years old.

Will be interesting to see if the 10-30% gains from Zen2->Zen3 translate to the mobile parts and if so, whether there is any competition from a performance per watt standpoint to what the M1 brings here. Because this is a hell of a chip!

I don't see why it wouldn't. Zen3 not only has higher IPC, but is also able to hit higher frequencies than Zen2 at the same or less voltage. I expect in the mobile space that anything lightly threaded, Zen3 will see the most benefit. Things that are heavy loads that scale to 16 threads will see the least benefit due to mobile power limits and still being on the same process. It will be interesting to see how often Apple refreshes their M* line as Zen4 is expected to drop in early 2022 which should put AMD and Apple at process parity, at least until late that year when TSMC 3nm is expected to begin HVM.

shady28 · Nov 17, 2020

Eug said:
Given how well the M1 chip is performing and how well it's been received by the tech media, at this point I don't think anyone really cares except a few geeks.

Early adopter / enthusiast types may not care, but when a regular user (which most Mac users are) gets one and 95% of their applications run slow as heck it will start to backfire - especially on tech sites (who failed to do extensive real-world use testing).

The PPC -> x86 move was different, Apple laptop has G3 and G4 powered laptops at a time when G5 multi-chip multi-core desktops could barely compete with single chip Intel parts. The performance boost of going to Core/Core2 was huge, and mitigated the slowdown while running Rosetta.

This chip isn't fast enough vs previous gen ice lake / comet lake to mitigate that.

This is what users had to deal with in the PPC->x86 move. It was basically a net sum zero game for most uses. This time it isn't. Not sure why no one has noted this, other than hype.

Eug · Nov 17, 2020

shady28 said:
Early adopter / enthusiast types may not care, but when a regular user (which most Mac users are) gets one and 95% of their applications run slow as heck it will start to backfire - especially on tech sites (who failed to do extensive real-world use testing).

The PPC -> x86 move was different, Apple laptop has G3 and G4 powered laptops at a time when G5 multi-chip multi-core desktops could barely compete with single chip Intel parts. The performance boost of going to Core/Core2 was huge, and mitigated the slowdown while running Rosetta.

This chip isn't fast enough vs previous gen ice lake / comet lake to mitigate that.

Uh, I guess you didn't bother reading Anandtech's review.

I'll just sum it up with their own conclusion:

"Overall, Apple hit it out of the park with the M1."

This is what users had to deal with in the PPC->x86 move. It was basically a net sum zero game for most uses. This time it isn't. Not sure why no one has noted this, other than hype.

This is hilarious! Did you really just post that?!

Those are tests done by yours truly, on my own Macs.

You must have lifted that image from my blog!

name99 · Nov 17, 2020

amrnuke said:
This is great, and will give me a lot to read. I really appreciate the time you took to write it out.

A few superficial things for now:

- With respect to Apple expanding the ROB: I didn't intend to imply that Apple just doubled the ROB to improve performance. I was asking why they might have a use for a larger ROB - is it because they made a wider core? Or because their branch prediction scheme benefited from it? Both? Clearly I didn't think it's as easy as adding transistors to make a larger buffer and IPC gainz lulz. In any case, I have a lot to read up on. I know the ROB and surrounding logic is incredibly complex, as is the entire pipeline for that matter. My knowledge only goes so deep as my (maybe low-mid-level?) computer science classes 16 years ago. I know about ambiguous dependence and CAMs and that there are complexities around in-flight instructions and RAW violations and such. I do not pretend to know how the hell these things work! So I'm deeply grateful that you've explained what you have, and it makes sense in layman's terms.

- As for replay vs flushing, yes, there's absolutely confusion. As I understand it, I thought that as it pertained to replay, you were talking about flushing as in flushing the whole pipeline, or the whole ROB, etc in the case of a misprediction. Am I mixing it up?

- With respect to predication, is the penalty for use of predication higher if you have a deeper pipeline? Upon what is the x86 aversion to predication based?

- Here's a summary of what the ROB does:

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

Basically a larger ROB (IF EVERYTHING ELSE IS ALSO SCALED APPROPRIATELY) allows you to do more work after a load misses to DRAM.

- Replay means that a single instruction (and usually a few dependents) generated incorrect results because of incorrect speculation, so they need to be executed again.
Note the issue.
CONTROL SPECULATION determines the order of instruction; when it goes wrong you have to flush everything and start again.
DATA SPECULATION (of various sorts) or SCHEDULE SPECULATION have the correct instructions in the correct order, it's just that the instruction generated the wrong result. So you don't have to flush everything, you just re-execute the instruction in question.

- There is no penalty for predication! Like I tried to say politely (but this infuriates me, it is part of a toxic legacy from the x86 community). Predicated instructions execute like, say, an Add With Carry. Three inputs go in an ALU (two registers and flags) and a single value comes out. Nothing special, just like any other single cycle ALU execution.

I could say more about just *why* the x86 community got it (and still gets it) so wrong, but I'm supposed to be withholding my bile.

shady28 · Nov 17, 2020

Eug said:
Uh, I guess you didn't bother reading Anandtech's review.

I'll just sum it up with their own conclusion:

"Overall, Apple hit it out of the park with the M1."

You are pointing to the tech-head sites as some kind of proof. All it will be is a black mark on their reliability as a source of *useful* information. AnandTech's overly technical review is fairly meaningless to normal users.

It's already starting:

https://www.reddit.com/r/mac/comments/jvojlx

gai · Nov 17, 2020

amrnuke said:
In the context of his post, I took "flush" to mean non-selective replay:

So going back to my response to that, that's why I was asking why he's even mentioning a way predictor - that's been in place in AMD and Intel (and presumptively Apple) uarch for a long time. And speculative scheduling has been used for a long time. Handling replay via flushing (i.e. non-selective replay) hasn't been used (to my knowledge, after extensive searching) for a long time outside of the academic realm.

Although I cannot speak for Maynard regarding his comments on way prediction, I would like to clarify that "flush" should not be considered a synonym for non-selective replay. I neglected non-selective replay in my earlier reply, but it is a third option. In short, imagine a selective replay option whose sole selectivity is the eligible issue window. If it takes 5 cycles to learn whether a load op either hit or missed in the cache, then replay every op that issued after the load wakeup signal in those 5 cycles.

If we consider the terminology established in academia, then hereon in this post I will use the conventions of [1] Kim and Lipasti, which are also used by [2] Perais, Seznec, et al. The three established recovery options, from most expensive to least expensive, are (1) refetch, (2) non-selective replay, and (3) selective replay.

The refetch policy is frequently referred to as "flush". Actually, the flush and the refetch are two coupled tasks with separate latencies. The flush is the removal of speculative work, and the refetch is the restarting process for new work. Both tasks must complete before the processor can continue. The words "flush" and "squash" both may be used at various times to refer to any cancellation of op execution (without flush and refetch) or the removal of ops from specific structures in the processor (for some specific purpose). However, in absence of any more specifics, "flush" is most likely to mean "flush and refetch".

The non-selective replay policy does not flush and refetch. It is a very conservative version of selective replay, where data dependencies are completely ignored. Possible dependent instructions are simply chosen for the cycles in which they issued from the scheduler. If an instruction could possibly have been data-dependent on the canceled load, then it is replayed. Non-selective replay scales negatively with increases in out-of-order resources, though it is at least better than refetch.

Selective replay, then, is a replay policy that reschedules only the data-dependent instructions. In the references that I cite, selective replay is perfect, i.e. it operates only on truly dependent instructions. This leaves a gap in the definitions, because we can imagine a scheme that is of intermediate intelligence. In my previous explanation, I took the liberty to expand my definition of "selective replay" to include intermediate schemes. Selective replay does not cancel independent ops, so the negative scaling is greatly reduced, but not eliminated. The replayed ops have energy and performance costs, but at least they are fewer in number, and the replays were required for correctness in any case.

I hope this brings more clarity than I could provide in my first attempt.

amrnuke said:
Purportedly high-quality replay mechanisms abound, whether token-based selective replay, a replay loop with latency prediction, or whatever. Apple probably has their own modified or perhaps entirely new mechanism that may share features of both. I am curious whether we can glean anything about their branch prediction, scheduling, or replay schemes based on the uarch information we've been given thus far.

Regarding selective replay being relatively independent of instruction window depth, is that true of all selective replay schemes? IIRC there are some that become exponentially more costly with increasing width or depth (I cannot recall which, or if both possibly are contributors) - but I recall token-based selective replay being developed to circumvent that problem.

I expect that selective replay also scales negatively with increased out of order resources. As you have mentioned, there are many specific proposals, so they may have significant differences or glass jaws. Generally speaking, more canceled ops mean more wasted energy and, in the long run, more performance loss (by delaying independent, available ops). However, I wanted to downplay the scaling cost for replay recovery in comparison to the much larger cost incurred by refetch recovery.

[1] Ilhyun Kim and Mikko H. Lipasti. "Understanding Scheduling Replay Schemes". 2004. See especially sections 3.2~3.4. Freely available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3353&rep=rep1&type=pdf
[2] Arthur Perais, André Seznec, et al. "Cost-Effective Speculative Scheduling in High Performance Processors". 2015. See especially section 2.1. Freely available online: https://hal.inria.fr/hal-01193233/document

Eug · Nov 17, 2020

@shady28, I feel honoured you posted my blog benchmarks from 14 years ago. Ah, the memories. Blast from the past!

Here is my full blog post. Lots of benches in there.

iMac Core 2 Duo Benchmarks: It's Fast!

I received my 24" iMac a few days ago, and had a chance over the weekend to work with it a bit and run some benchmarks. I purchased a nearl...

everythingapple.blogspot.com

amrnuke · Nov 17, 2020

Hitman928 said:
I don't see why it wouldn't. Zen3 not only has higher IPC, but is also able to hit higher frequencies than Zen2 at the same or less voltage. I expect in the mobile space that anything lightly threaded, Zen3 will see the most benefit. Things that are heavy loads that scale to 16 threads will see the least benefit due to mobile power limits and still being on the same process. It will be interesting to see how often Apple refreshes their M* line as Zen4 is expected to drop in early 2022 which should put AMD and Apple at process parity, at least until late that year when TSMC 3nm is expected to begin HVM.

Looking at refresh timeframes on the Mac mini, MBA, and MBP - as well as their cadence on the iPad Pro and as such their X/Z chips, it looks like a 15-16 month cadence might be a starting point. However, the cadence on the iPad Pro muddies things. The device was refreshed in early 2020 but the chip is still using the same Vortex/Tempest 4+4 setup from the previous generation.

name99 · Nov 17, 2020

amrnuke said:
Great rundown, thank you!

In the context of his post, I took "flush" to mean non-selective replay:

So going back to my response to that, I have some questions: why mention a way predictor as a possibility? It is almost guaranteed to be used. They've been in place in AMD and Intel (and presumptively Apple) uarch for a long time. And speculative scheduling has been used for a long time. Handling replay via flushing (i.e. non-selective replay) hasn't been used (to my knowledge, after extensive searching) for a long time outside of the academic realm. I guess I am more perplexed at the implication that there is a a world where Apple aren't using way prediction, speculative scheduling, and a refined replay scheme. More the way he worded the post perhaps? "a way predictor is probably well worth doing (and could allow them to grow the L1 even larger ..." -- the implication being that they aren't already using way prediction? At least to my interpretation.

Purportedly, high-quality replay mechanisms abound, whether token-based selective replay, a replay loop with latency prediction, or whatever. Apple probably has their own modified or perhaps entirely new mechanism that may share features of many schemes. I am curious whether we can glean anything about their branch prediction, scheduling, or replay schemes based on the uarch information we've been given thus far, but don't know enough about the interplay of the various components to know where to begin formulating a guess.

Regarding selective replay being relatively independent of instruction window depth, is that true of all selective replay schemes? IIRC there are some that become exponentially more costly with increasing width or depth (I cannot recall which, or if both possibly are contributors) - but I recall token-based selective replay being developed to circumvent that problem.

Way predictor is not "almost guaranteed". If you
- use speculative scheduling and
- the cost of a mis-schedule is high (because your selective replay mechanism is inefficient in way way or another) and
- your way predictor is not close to perfect
then you'll lose as much from the mispredicted way's as you gain from the cycle saved.

Doing a very quick web search as far as I can tell AMD use a way predictor (but maybe not speculative scheduling); Intel use speculative scheduling (but maybe not a way predictor?).

(Perhaps it's not clear to you? Way predictor is a way to decrease the *energy* cost of a high associativity cache, in this case the L1D, at the cost of increasing access time if your predictions are incorrect. It has nothing to do with, eg, branch prediction. You don't have to use it; it gives no direct performance benefit, it's all about energy.)

beginner99 · Nov 17, 2020

shady28 said:
You are pointing to the tech-head sites as some kind of proof. All it will be is a black mark on their reliability as a source of *useful* information. AnandTech's overly technical review is fairly meaningless to normal users.

It's already starting:

https://www.reddit.com/r/mac/comments/jvojlx

Your own fault for not bowing to your new Apple overlords fully! /s

The network issue is also interesting...very high latency.

name99 · Nov 17, 2020

gai said:
Although I cannot speak for Maynard regarding his comments on way prediction, I would like to clarify that "flush" should not be considered a synonym for non-selective replay. I neglected non-selective replay in my earlier reply, but it is a third option. In short, imagine a selective replay option whose sole selectivity is the eligible issue window. If it takes 5 cycles to learn whether a load op either hit or missed in the cache, then replay every op that issued after the load wakeup signal in those 5 cycles.

If we consider the terminology established in academia, then hereon in this post I will use the conventions of [1] Kim and Lipasti, which are also used by [2] Perais, Seznec, et al. The three established recovery options, from most expensive to least expensive, are (1) refetch, (2) non-selective replay, and (3) selective replay.

The refetch policy is frequently referred to as "flush". Actually, the flush and the refetch are two coupled tasks with separate latencies. The flush is the removal of speculative work, and the refetch is the restarting process for new work. Both tasks must complete before the processor can continue. The words "flush" and "squash" both may be used at various times to refer to any cancellation of op execution (without flush and refetch) or the removal of ops from specific structures in the processor (for some specific purpose). However, in absence of any more specifics, "flush" is most likely to mean "flush and refetch".

The non-selective replay policy does not flush and refetch. It is a very conservative version of selective replay, where data dependencies are completely ignored. Possible dependent instructions are simply chosen for the cycles in which they issued from the scheduler. If an instruction could possibly have been data-dependent on the canceled load, then it is replayed. Non-selective replay scales negatively with increases in out-of-order resources, though it is at least better than refetch.

Selective replay, then, is a replay policy that reschedules only the data-dependent instructions. In the references that I cite, selective replay is perfect, i.e. it operates only on truly dependent instructions. This leaves a gap in the definitions, because we can imagine a scheme that is of intermediate intelligence. In my previous explanation, I took the liberty to expand my definition of "selective replay" to include intermediate schemes. Selective replay does not cancel independent ops, so the negative scaling is greatly reduced, but not eliminated. The replayed ops have energy and performance costs, but at least they are fewer in number, and the replays were required for correctness in any case.

I hope this brings more clarity than I could provide in my first attempt.

I expect that selective replay also scales negatively with increased out of order resources. As you have mentioned, there are many specific proposals, so they may have significant differences or glass jaws. Generally speaking, more canceled ops mean more wasted energy and, in the long run, more performance loss (by delaying independent, available ops). However, I wanted to downplay the scaling cost for replay recovery in comparison to the much larger cost incurred by refetch recovery.

[1] Ilhyun Kim and Mikko H. Lipasti. "Understanding Scheduling Replay Schemes". 2004. See especially sections 3.2~3.4. Freely available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3353&rep=rep1&type=pdf
[2] Arthur Perais, André Seznec, et al. "Cost-Effective Speculative Scheduling in High Performance Processors". 2015. See especially section 2.1. Freely available online: https://hal.inria.fr/hal-01193233/document

The essential point that I hope you are getting from both mine and Gai's comments is that none of this stuff is nearly as simple as is expressed in just a few words.
To say that a CPU has x-sized ROB and y-cycle L1 cache access and z-wide PRF does not actually tell you very much. EVERYTHING interesting is in the details -- how does that ROB handle a control misspeculation? Does that L1 cache access involve scheduling speculation and/or way speculation? how often does it mis-speculate? how fast is the recovery from a mis-speculation? How many resource amplification mechanisms (like fusion) are in play to get more value from that PRF? etc etc

And these details are never made public :-(
This is one of the things we can hope for now, from the M1 Macs. That it will be easier for people to start writing various types of stressor code to start reverse engineering exactly how Apple does these things.
(Those as I pointed out, Apple seems to be moving crazy fast. By the time we have figured out some detail of the M1, it may well be completely different, not just scaled up, in the A15 generation...)

Heartbreaker · Nov 17, 2020

Antey said:
why stop there then?

I was heading out the door, and that was interesting because x86 Chrome was hanging in with Native Chrome on Ryzen 4700u.

name99 · Nov 17, 2020

gai said:
Although I cannot speak for Maynard regarding his comments on way prediction, I would like to clarify that "flush" should not be considered a synonym for non-selective replay. I neglected non-selective replay in my earlier reply, but it is a third option. In short, imagine a selective replay option whose sole selectivity is the eligible issue window. If it takes 5 cycles to learn whether a load op either hit or missed in the cache, then replay every op that issued after the load wakeup signal in those 5 cycles.

If we consider the terminology established in academia, then hereon in this post I will use the conventions of [1] Kim and Lipasti, which are also used by [2] Perais, Seznec, et al. The three established recovery options, from most expensive to least expensive, are (1) refetch, (2) non-selective replay, and (3) selective replay.

The refetch policy is frequently referred to as "flush". Actually, the flush and the refetch are two coupled tasks with separate latencies. The flush is the removal of speculative work, and the refetch is the restarting process for new work. Both tasks must complete before the processor can continue. The words "flush" and "squash" both may be used at various times to refer to any cancellation of op execution (without flush and refetch) or the removal of ops from specific structures in the processor (for some specific purpose). However, in absence of any more specifics, "flush" is most likely to mean "flush and refetch".

The non-selective replay policy does not flush and refetch. It is a very conservative version of selective replay, where data dependencies are completely ignored. Possible dependent instructions are simply chosen for the cycles in which they issued from the scheduler. If an instruction could possibly have been data-dependent on the canceled load, then it is replayed. Non-selective replay scales negatively with increases in out-of-order resources, though it is at least better than refetch.

Selective replay, then, is a replay policy that reschedules only the data-dependent instructions. In the references that I cite, selective replay is perfect, i.e. it operates only on truly dependent instructions. This leaves a gap in the definitions, because we can imagine a scheme that is of intermediate intelligence. In my previous explanation, I took the liberty to expand my definition of "selective replay" to include intermediate schemes. Selective replay does not cancel independent ops, so the negative scaling is greatly reduced, but not eliminated. The replayed ops have energy and performance costs, but at least they are fewer in number, and the replays were required for correctness in any case.

I hope this brings more clarity than I could provide in my first attempt.

I expect that selective replay also scales negatively with increased out of order resources. As you have mentioned, there are many specific proposals, so they may have significant differences or glass jaws. Generally speaking, more canceled ops mean more wasted energy and, in the long run, more performance loss (by delaying independent, available ops). However, I wanted to downplay the scaling cost for replay recovery in comparison to the much larger cost incurred by refetch recovery.

[1] Ilhyun Kim and Mikko H. Lipasti. "Understanding Scheduling Replay Schemes". 2004. See especially sections 3.2~3.4. Freely available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3353&rep=rep1&type=pdf
[2] Arthur Perais, André Seznec, et al. "Cost-Effective Speculative Scheduling in High Performance Processors". 2015. See especially section 2.1. Freely available online: https://hal.inria.fr/hal-01193233/document

Oh, one other thing to add to Gai's points.
Everything he has been saying is in the context of schedule speculation.
Recall that I pointed out other types of data speculation - load/store aliasing (common now for years) and value speculation (apparently not yet implemented).

Both of these would require a mechanism for dealing with incorrectly speculated data, and the ideal would be some form of replay, not flush. In other words you'd have to tag the speculated instruction, propagate that tag on to everything that used that instruction result, and have a mechanism to replay every tagged instruction at the point where you check validity (probably, but not necessarily, at completion). These speculation checks have to operate differently from the mechanisms he described because they aren't bounded by the tight cycle boundaries inherent in speculative scheduling.

Yes, this stuff is complicated!

iwulff · Nov 17, 2020

Hitman928 said:
There's also this quote from that article:

Based upon that and Anandtech's power results, looks like the M1 TDP is ~22W give or take a couple of watts.

That's with being on a smaller node and without a screen. I'm not that impressed actually...

Mopetar · Nov 17, 2020

beginner99 said:
Your own fault for not bowing to your new Apple overlords fully! /s

The network issue is also interesting...very high latency.

I didn't read the post in full, but typically a gaming PC has a wired ethernet connection. Is the person comparing ping times over wireless against something that's likely plugged in?

Either this is some kind of elaborate troll or this person is exactly the kind of Mac user that PC enthusiasts like to make fun of all the time.

Viknet · Nov 17, 2020

Hitman928 said:
Based upon that and Anandtech's power results, looks like the M1 TDP is ~22W give or take a couple of watts.

Someone did power profile under multicore cinebench load:

https://twitter.com/x/status/1328715510100344833

Exactly 15W for CPU cores only.

Shivansps · Nov 17, 2020

beginner99 said:
This is basically same as with consoles. If you have a limited range of hardware, you can extract far more with less. In this case Apple can add accelerators and control it via their own API (metal). In terms of GPU it probably helps a lot they do not need to support a ton of legacy APIs and certainly not directx. Also it's almost certainly tile-based with all the advantages and disadvantages. Apple doesn't need to care about compute because they don't make compute cards based on same uArch and they can run AI on the dedicated AI cores. I would expect this to save die space and power. It's the fully vertically integrated part showing it's power.

This would be cool if it weren' fore the vendor-lock in and losing any kind of freedom of what you do with your device and what you install. App store on phone, OK if you can install in different ways. But on a PC? I don't even know why I would risk developing software for this ecosystem if the can pull my key at any moment for any reason and basically bankrupt developers. Once people are lured in, they will start pushing their agenda more and more and nothing you can do about it.

Cool hardware-wise and impossible for intel/amd/nv to compete given the need for higher flexibility of x86/dGPUs. But a no go platform-wise.

Well, in modern GPUs for Windows and Linux you dont need to support DX11/DX9/OGL either, we are well past the point were you can make DX11/DX9 and OGL to run over a Vulkan/DX12 wrapper perfectly fine, no need for hardware support anymore.

Eug · Nov 17, 2020

Some pertinent points from the AnandTech article.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

As mentioned already, Andrei seem to think the M1 TDP is somewhere a bit north of 20 Watts. That surprised me a little as I thought it would be below 20 Watts, but I'll defer to his expertise of course.

M1 gets full memory bandwidth with just a single core, at 58 GB/s read and 35 GB/s write. Bandwidth actually decreases somewhat as you add more cores.

Ryzen 5950X wins for Cinebench R23 ST performance. M1 and Intel Core 1165G7 are effectively tied for second place.

M1 wins at Geekbench 5 ST.

M1 wins at SPEC2006 ST, both for int and for fp.

Ryzen 5950X wins at SPECint2017 ST, but M1 wins at SPECfp2017 ST.

Rosetta 2 performance ranges from 50-95% native according to SPEC2006 and SPEC2017 subtests, mostly in about the 70-80% range.

amrnuke · Nov 17, 2020

name99 said:
- Here's a summary of what the ROB does:

RWT Forums - Real World Tech

content overridden

www.realworldtech.com

Basically a larger ROB (IF EVERYTHING ELSE IS ALSO SCALED APPROPRIATELY) allows you to do more work after a load misses to DRAM.

- Replay means that a single instruction (and usually a few dependents) generated incorrect results because of incorrect speculation, so they need to be executed again.
Note the issue.
CONTROL SPECULATION determines the order of instruction; when it goes wrong you have to flush everything and start again.
DATA SPECULATION (of various sorts) or SCHEDULE SPECULATION have the correct instructions in the correct order, it's just that the instruction generated the wrong result. So you don't have to flush everything, you just re-execute the instruction in question.

- There is no penalty for predication! Like I tried to say politely (but this infuriates me, it is part of a toxic legacy from the x86 community). Predicated instructions execute like, say, an Add With Carry. Three inputs go in an ALU (two registers and flags) and a single value comes out. Nothing special, just like any other single cycle ALU execution.

I could say more about just *why* the x86 community got it (and still gets it) so wrong, but I'm supposed to be withholding my bile.

Excellent - thank you again.

As for predication, I seem to remember there being a penalty to using it somewhere - and perhaps this was a while ago. On a quick search, it appears that predication had a penalty on microarchitecture with long pipelines, but I'm not clear on whether this penalty exists on OoO systems or not.

Heartbreaker · Nov 17, 2020

Eug said:
Some pertinent points from the AnandTech article.

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

As mentioned already, Andrei seem to think the M1 TDP is somewhere a bit north of 20 Watts. That surprised me a little as I thought it would be below 20 Watts, but I'll defer to his expertise of course.

M1 gets full memory bandwidth with just a single core, at 58 GB/s read and 35 GB/s write. Bandwidth actually decreases somewhat as you add more cores.

Ryzen 5950X wins for Cinebench R23 ST performance. M1 and Intel Core 1165G7 are effectively tied for second place.

M1 wins at Geekbench 5 ST.

M1 wins at SPEC2006 ST, both for int and for fp.

Ryzen 5950X wins at SPECint2017, but M1 wins at SPECfp2017 ST.

Rosetta 2 performance ranges from 50-95% native according to SPEC2006 and SPEC2017 subtests, mostly in about the 70-80% range.

I just finished reading the AnandTech article, and it's another VERY impressive showing.

I never owned an Apple product in my life, but with M1, the odds just went up dramatically that this will change.

Discussion Apple Silicon SoC thread

Lifer

Diamond Member

Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Senior member

Platinum Member

Junior Member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Junior Member

Diamond Member

Junior Member

Diamond Member

Lifer

Golden Member

Diamond Member