Question Zen 6 Speculation Thread

adroc_thurston · Jan 24, 2026

basix said:
But how will e.g. AVX512

Normal 256b SIMD, just with halved (1+1) FMA/FADD count.

basix said:
AMX supported by a super power and area optimized core?

Per cluster.
ACE is just x86 SME/sSVE duo.

jpiniero said:
I'd be surprised if AMD supports AMX.

Zen7 has ACE support rofl

Win2012R2 · Jan 24, 2026

adroc_thurston said:
Per cluster.
ACE is just x86 SME/sSVE duo.

Not per core? How that going to work then when multiple cores try to execute AMX stuff?

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Not per core? How that going to work then when multiple cores try to execute AMX stuff?

In a very cursed and funny way (with internal arbitrator on the SME unit).
Refer to ARM SME docs, they're hella funny.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
In a very cursed and funny way (with internal arbitrator on the SME unit).

That can't possibly be fast with cores being stalled competing for single AMX unit, I guess the advantage is that at least code won't crash, but it will run like a dogs**t.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
That can't possibly be fast with cores being stalled competing for single AMX unit

They don't, the unit has an arbitrator and is heavily multithreaded.
just read the docs.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
They don't

How is a thread not stalled when it has to wait for AMX code to execute on that single unit?

It sure can work but very slowly if many cores start accessing same stuff on contended AMX unit, sounds total crap when OS moves AMX code to those lousy codes.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
How is a thread not stalled when it has to wait for AMX code to execute on that single unit?

Because it's not executed on the CPU, same as Intel AMX.
It's a weird in-order coprocessor driven thru the load/store unit.

Win2012R2 said:
It sure can work but very slowly if many cores start accessing same stuff on contended AMX unit

It just SMT's threads.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
Because it's not executed on the CPU, same as Intel AMX.

Are you saying that on Intel CPUs that claim to support AMX each core has not got at least one dedicated AMX unit?

adroc_thurston said:
It's a weird in-order coprocessor driven thru the load/store unit.

So execution on code will have to wait till it's done, in other word stalls waiting for it's turn to get executed, how the feck does that work with AMX registers (tiles), also not on per core? Sounds BS to me.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Are you saying that on Intel CPUs that claim to support AMX each core has not got at least one dedicated AMX unit?

AMX is not part of the core.
It's a separate thing driven thru the load/store unit.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
AMX is not part of the core. It's a separate thing driven thru the load/store unit.

How many units per core, at least 1?

adroc_thurston · Jan 24, 2026

Win2012R2 said:
How many units per core, at least 1?

Yeah but that's irrelevant.
your AMX .tile stuff usually hangs around your memory scheduler as kind-of memory ops until the unit gives you the matmul outputs back.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
Yeah but that's irrelevant.

It's relevant for performance because each Intel core got dedicated AMX unit, where as what you are trying to say about AMD is that they will have 1 per cluster (of multiple cores) - this will run like a dog, maybe ok in 2 core cluster, but who will make that for economy-cores.

adroc_thurston said:
your AMX .tile stuff usually hangs around your memory scheduler as kind-of memory ops until the unit gives you the matmul outputs back.

Yeah, and when core waits for memory then it's famously running very fast.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
It's relevant for performance because each Intel core got dedicated AMX unit

NO!
It doesn't matter at all! they're weird async coprocessors.

Win2012R2 said:
this will run like a dog

how does Apple SME work then lmao

Win2012R2 · Jan 24, 2026

adroc_thurston said:
how does Apple SME work then lmao

Who cares, that's 10% of the market and Apple controls software stack, and ultimately most Apple's users will each anything they give them.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Who cares, that's 10% of the market and Apple controls software stack, and ultimately most Apple's users will each anything they give them.

the hardware.
how does it work?

Win2012R2 · Jan 24, 2026

adroc_thurston said:
the hardware. how does it work?

No idea, but what I know that if there is a shared unit and many cores try to access it then it will stall those thread due to contention on shared resource.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
No idea

why are you arguing with me then.

Win2012R2 said:
but what I know that if there is a shared unit and many cores try to access it then it will stall those thread due to contention on shared resource.

it's a heavily multithreaded async coprocessor.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
why are you arguing with me then.

Because you are full of it and your discussion attitude stinks and someone needs to point that out?

adroc_thurston said:
it's a heavily multithreaded async coprocessor.

Wow, that's genious invention, so explain to me how many cores try to execute AMX stuff on that coprocessor, that gets queued because it's a shared effing resource, and threads stall by waiting for results to be returned from AMX to do further processing, which they can't otherwise because further code is dependent on response from AMX = stalled due to contention on shared device, does not matter if it runs a million threads there if there is one actual AMX unit.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Because you are full of it and your discussion attitude stinks and someone needs to point that out?

cope.

Win2012R2 said:
so explain to me how many cores try to execute AMX stuff on that coprocessor, that gets queued because it's a shared effing resource

sharing what resource, it's multithreaded and does xboxhueg GEMM in very few cycles. that's the point.
core pushes DATA via sSVE/tilestuff and gets it back in a few cycles.
(well ok ARM SME has its own ld/st, sSVE just instructs the actual veclen and operand order and stuffz).

Win2012R2 said:
and threads stall by waiting for results to be returned from AMX

you do understand that modern cores are out-of-order and can hide the latency of weird async ops just fine?

Win2012R2 said:
which they can't otherwise because further code is dependent on response from AMX = stalled due to contention on shared device, does not matter if it runs a million threads there if there is one actual AMX unit.

the whole point is that it does GEMM a gajillion times faster than a normal CPU core. it's a matmul coprocessor. just like FP in the 1980s.

Joe NYC · Jan 24, 2026

adroc_thurston said:
Venice is >1.7x SIR2017 vs 9965.
OMR is ???? SIR2017 vs 9950X.

Ok, @adroc_thurston assigned the thread some homework and I come here a day later, and no one turned in the homework yet.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
you do understand that modern cores are out-of-order and can hide the latency of weird async ops just fine?

Oh wow, if they are out-of-order why are we getting tiny IPC improvements? Obviously because if there is dependency on previous result and that happens a lot in CPUs.

adroc_thurston said:
sharing what resource, it's multithreaded and does xboxhueg GEMM in very few cycles. that's the point.

Sharing AMX unit if it's one per cluster (as in per multiple cores), yes, then it's a shared resource by definition. Unless you start making multiple shared AMX units, then contention ration will obviously be lower and perf will improve.

adroc_thurston said:
the whole point is that it does GEMM a gajillion times faster than a normal CPU core. it's a matmul coprocessor. just like FP in the 1980s.

Yes, awesome, but when 8 cores access single AMX unit that won't give more than 1 AMX unit worth of perf, less really due to sharing overheads.

adroc_thurston · Jan 24, 2026

Joe NYC said:
Ok, @adroc_thurston assigned the thread some homework and I come here a day later, and no one turned in the homework yet.

people here love the discourse and hate doing math.
RWT forums are the opposite but unfortunately people like Maynard drive me absolutely insane so I'll never ever post there.

Win2012R2 said:
Oh wow, if they are out-of-order why are we getting tiny IPC improvements?

Because huge IPC improvements (see Royal Core) collapse your fmax.
Dennard scaling, is, you know, VERY DEAD.

Win2012R2 said:
Sharing AMX unit if it's one per cluster (as in per multiple cores), yes, then it's a shared resource by definition

read the goddamn docs. please.

Win2012R2 said:
but when 8 cores access single AMX unit that won't give more than 1 AMX unit worth of perf, less really due to sharing overheads.

you know that GEMM rates for Apple SME are public. they're pretty dang good man.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
you know that GEMM rates for Apple SME are public. they're pretty dang good man.

Intel server cores: have they got at least 1 AMX unit per core or not? Simple yes or no.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Intel server cores: have they got at least 1 AMX unit per core or not?

we don't care about Intel because Intel sucks.
Focus on implementations that are worth mentioning and read the SME (and CX925) docs just in case.

poke01 · Jan 24, 2026

It’s pretty good at FP64 too

Question Zen 6 Speculation Thread

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member