Question Zen 6 Speculation Thread

adroc_thurston · Jan 24, 2026

poke01 said:
It’s pretty good at FP64 too

yeah le DGEMM is the only argument I have for GEMM blobs in client outside of .gb6 pumping but who the hell runs DGEMM in client?

Win2012R2 · Jan 24, 2026

adroc_thurston said:
we don't care about Intel because Intel sucks.

They are certainly benchmark for AMX execution and will be compared against AMD.

So the question stands: if Intel uses dedicated per core AMX units then why the **** perf won't be terrible if single AMX unit gets shared by multiple cores, potentially high number like 8 or more.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
They are certainly benchmark for AMX execution and will be compared against AMD.

no, the benchmark is Apple since they ship it everywhere.

Win2012R2 said:
So the question stands: if Intel uses dedicated per core AMX units then why the **** perf won't be terrible if single AMX unit gets shared by multiple cores, potentially high number like 8 or more.

a) you know you can build a bigger matmul unit for 8 cores. Apple one is p chungus and has a ton of juice. very nice for DGEMM.
b) no, seriously, ARM SME works. read the docs and look the benchmarks.

adroc_thurston · Jan 24, 2026

Man, I wish talking to some people here wasn't the equivalent of trying to potty train a dumb toddler. embarassing.

poke01 · Jan 24, 2026

adroc_thurston said:
yeah le DGEMM is the only argument I have for GEMM blobs in client outside of .gb6 pumping but who the hell runs DGEMM in client?

no one. But openBLAS did add support for SGEMM for SME targets which is good to see.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
you know you can build a bigger matmul unit for 8 cores

So that's equivalent of multiple AMX units then, reduces shared contention, but it's still a shared resource that run slower for when all cores need it.

adroc_thurston said:
no, the benchmark is Apple since they ship it everywhere.

Not in x86 market.

adroc_thurston · Jan 24, 2026

poke01 said:
no one. But openBLAS did add support for SGEMM for SME targets which is good to see.

yup.
Apple Pro/Max/Ultra sticks were kind of funny wrt ad-hoc DGEMM crunch.

Win2012R2 said:
So that's equivalent of multiple AMX units then

NO.

Win2012R2 said:
reduces shared contention

NO.
The unit is heavily multithreaded and has its own ld/st pipeline in SME case.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
The unit is heavily multithreaded and has its own ld/st pipeline in SME case.

Will classic Zen cores have it per cluster (12 cores?) too?

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Will classic Zen cores have it per cluster (12 cores?) too?

yes? Again, feature support is fully flat across the uarch family.

Joe NYC · Jan 24, 2026

Win2012R2 said:
They are certainly benchmark for AMX execution and will be compared against AMD.

So the question stands: if Intel uses dedicated per core AMX units then why the **** perf won't be terrible if single AMX unit gets shared by multiple cores, potentially high number like 8 or more.

I bet that when Intel implements ACE, AMX will be depreciated into legacy.

Then the question is if ACE makes it to Diamond Rapids or Coral Rapids.

On AMD side, it will be in Zen 7, which should launch between Diamond and Coral.

adroc_thurston · Jan 24, 2026

Joe NYC said:
Then the question is if ACE makes it to Diamond Rapids or Coral Rapids.

Coral. DMR is AMX.

Joe NYC said:
On AMD side, it will be in Zen 7, which should launch between Diamond and Coral.

Coral is trying to intercept Florence.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
yes? Again, feature support is fully flat across the uarch family.

Ok, shared unit too, so what will run faster at the same frequency -
1) 12 Intel cores with AMX as is now
or
2) 12 AMD cores with per cluster AMX

Joe NYC · Jan 24, 2026

adroc_thurston said:
Coral is trying to intercept Florence.

We will see if it succeeds.

It should also be a competition, by proxy, between TSMC A14 and Intel 14A, in time to HVM.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Ok, shared unit too, so what will run faster at the same frequency -
1) 12 Intel cores with AMX as is now
or
2) 12 AMD cores with per cluster AMX

The former.
But why would you want that? If you wanna do Actual Real Matmul at AMD, GPU is your friend.

Joe NYC said:
It should also be a competition, by proxy, between TSMC A14 and Intel 14A, in time to HVM.

man that's an optimistic view of 14A lmao

Joe NYC · Jan 24, 2026

adroc_thurston said:
man that's an optimistic view of 14A lmao

Then, if TSMC A14 beats 14A to HVM, it would imply that Florence beats Coral to market, assuming Florence is one of the first products on A14

poke01 · Jan 24, 2026

will Intel implement amx per core in client?

adroc_thurston · Jan 24, 2026

Joe NYC said:
Then, if TSMC A14 beats 14A to HVM, it would imply that Florence beats Coral to market, assuming Florence is one of the first products on A14

well Venice beats DMR to market and clowns all over it so...

poke01 said:
will Intel implement amx per core in client?

no lmao, their core area is bad enough as is.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
The former.

Ok, good, dedicated per core units will run faster all other things equal.

Now how much faster would Intel be with 12 cores scenario in this case vs AMD - like 2 times faster, or perhaps 12?

adroc_thurston said:
But why would you want that? If you wanna do Actual Real Matmul at AMD, GPU is your friend.

Indeed, why implement this in CPU in the first place, Intel had to do it because of lack of GPU, having integrated GPU seems far better way, maybe with higher latency than CPU.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
dedicated per core units will run faster all other things equal.

It's not 'faster', you just dedicate more area to matmul.

Win2012R2 said:
Now how much faster would Intel be with 12 cores scenario in this case vs AMD - like 2 times faster, or perhaps 12?

idk.
How much faster is a single gfx1311 CU is over an AMX core in GEMM?

Win2012R2 · Jan 24, 2026

adroc_thurston said:
How much faster is a single gfx1311 CU is over an AMX core in GEMM?

No idea, we are talking about CPUs here, you just don't want to answer obvious question that a shared single unit for 12 cores under heavy usage will be slower like more than 12 times than in scenario of 1 dedicated unit per core: ie dogsh*t slow like I said, which is obvious to anybody who actually profiles stuff and runs into contentions like this would be.

Solution is obviously running this stuff on GPUs and drop AMX crap.

adroc_thurston said:
It's not 'faster', you just dedicate more area to matmul.

Oh wow, 600 sq mm GPU is not faster than 200 sq mm, it's just got more area dedicated.

Or an 8 core CPU isn't faster than 1 core, it's just got more area dedicated to cores.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
No idea, we are talking about CPUs here, you just don't want to answer obvious question that a shared single unit for 12 cores under heavy usage will be slower like more than 12 times than in scenario of 1 dedicated unit per core: ie dogsh*t slow like I said, which is obvious to anybody who actually profiles stuff and runs into contentions like this would be.

shared units give you nice matmul rates are relatively minimal area investment.
Again, see Apple SGEMM/DGEMM benchmarks on M4-era stuff.

Win2012R2 said:
Solution is obviously running this stuff on GPUs and drop AMX crap.

No, cuz shared units are cheap and nice nuff.

adroc_thurston · Jan 24, 2026

Win2012R2 said:
Oh wow, 600 sq mm GPU is not faster than 200 sq mm, it's just got more area dedicated.

not my point.
they're both 8c CPUs here.
Obviously you know that, but you gotta be obtuse.

Win2012R2 · Jan 24, 2026

adroc_thurston said:
shared units give you nice matmul rates are relatively minimal area investment.

Yes sure, nice for you to acknowledge it is shared unit, thus leading to contention = reduction in speed.

How relevant will it be for AMX that almost nobody is using, well, I reckon THOSE few people who actually use it will want proper perf from all cores, shared contention will result in non-deterministic latencies and lower throughput, so it seems dumb to support this very niche thing in the first place, especially in client and especially when you sell GPUs.

poke01 · Jan 24, 2026

Win2012R2 said:
why implement this in CPU in the first place, Intel had to do it because of lack of GPU, having integrated GPU seems far better way, maybe with higher latency than CPU.

Because the cpu is standardised unlike GPU or NPU. It’s easy for devs too

adroc_thurston · Jan 24, 2026

Win2012R2 said:
thus leading to contention = reduction in speed.

It's not 'reduction in speed', you're just doing less matmul ops a clock.
That's like saying half-size WMMA in gfx13 is doing 'contetion' vs gfx1250.

poke01 said:
Because the cpu is standardised unlike GPU or NPU. It’s easy for devs too

bingo.

Question Zen 6 Speculation Thread

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member