Question Zen 6 Speculation Thread

Page 352 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

adroc_thurston

Diamond Member
Jul 2, 2023
8,437
11,168
106
How is a thread not stalled when it has to wait for AMX code to execute on that single unit?
Because it's not executed on the CPU, same as Intel AMX.
It's a weird in-order coprocessor driven thru the load/store unit.
It sure can work but very slowly if many cores start accessing same stuff on contended AMX unit
It just SMT's threads.
 
  • Like
Reactions: marees

Win2012R2

Golden Member
Dec 5, 2024
1,322
1,361
96
Because it's not executed on the CPU, same as Intel AMX.
Are you saying that on Intel CPUs that claim to support AMX each core has not got at least one dedicated AMX unit?
It's a weird in-order coprocessor driven thru the load/store unit.
So execution on code will have to wait till it's done, in other word stalls waiting for it's turn to get executed, how the feck does that work with AMX registers (tiles), also not on per core? Sounds BS to me.
 

Win2012R2

Golden Member
Dec 5, 2024
1,322
1,361
96
Yeah but that's irrelevant.
It's relevant for performance because each Intel core got dedicated AMX unit, where as what you are trying to say about AMD is that they will have 1 per cluster (of multiple cores) - this will run like a dog, maybe ok in 2 core cluster, but who will make that for economy-cores.
your AMX .tile stuff usually hangs around your memory scheduler as kind-of memory ops until the unit gives you the matmul outputs back.
Yeah, and when core waits for memory then it's famously running very fast.
 

Win2012R2

Golden Member
Dec 5, 2024
1,322
1,361
96
why are you arguing with me then.
Because you are full of it and your discussion attitude stinks and someone needs to point that out?

it's a heavily multithreaded async coprocessor.
Wow, that's genious invention, so explain to me how many cores try to execute AMX stuff on that coprocessor, that gets queued because it's a shared effing resource, and threads stall by waiting for results to be returned from AMX to do further processing, which they can't otherwise because further code is dependent on response from AMX = stalled due to contention on shared device, does not matter if it runs a million threads there if there is one actual AMX unit.
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,437
11,168
106
Because you are full of it and your discussion attitude stinks and someone needs to point that out?
cope.
so explain to me how many cores try to execute AMX stuff on that coprocessor, that gets queued because it's a shared effing resource
sharing what resource, it's multithreaded and does xboxhueg GEMM in very few cycles. that's the point.
core pushes DATA via sSVE/tilestuff and gets it back in a few cycles.
(well ok ARM SME has its own ld/st, sSVE just instructs the actual veclen and operand order and stuffz).
and threads stall by waiting for results to be returned from AMX
you do understand that modern cores are out-of-order and can hide the latency of weird async ops just fine?
which they can't otherwise because further code is dependent on response from AMX = stalled due to contention on shared device, does not matter if it runs a million threads there if there is one actual AMX unit.
the whole point is that it does GEMM a gajillion times faster than a normal CPU core. it's a matmul coprocessor. just like FP in the 1980s.
 
  • Like
Reactions: Joe NYC

Win2012R2

Golden Member
Dec 5, 2024
1,322
1,361
96
you do understand that modern cores are out-of-order and can hide the latency of weird async ops just fine?
Oh wow, if they are out-of-order why are we getting tiny IPC improvements? Obviously because if there is dependency on previous result and that happens a lot in CPUs.
sharing what resource, it's multithreaded and does xboxhueg GEMM in very few cycles. that's the point.
Sharing AMX unit if it's one per cluster (as in per multiple cores), yes, then it's a shared resource by definition. Unless you start making multiple shared AMX units, then contention ration will obviously be lower and perf will improve.
the whole point is that it does GEMM a gajillion times faster than a normal CPU core. it's a matmul coprocessor. just like FP in the 1980s.
Yes, awesome, but when 8 cores access single AMX unit that won't give more than 1 AMX unit worth of perf, less really due to sharing overheads.
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,437
11,168
106
Ok, @adroc_thurston assigned the thread some homework and I come here a day later, and no one turned in the homework yet.
people here love the discourse and hate doing math.
RWT forums are the opposite but unfortunately people like Maynard drive me absolutely insane so I'll never ever post there.
Oh wow, if they are out-of-order why are we getting tiny IPC improvements?
Because huge IPC improvements (see Royal Core) collapse your fmax.
Dennard scaling, is, you know, VERY DEAD.
Sharing AMX unit if it's one per cluster (as in per multiple cores), yes, then it's a shared resource by definition
read the goddamn docs. please.
but when 8 cores access single AMX unit that won't give more than 1 AMX unit worth of perf, less really due to sharing overheads.
you know that GEMM rates for Apple SME are public. they're pretty dang good man.
 
  • Like
Reactions: Joe NYC