you do understand that modern cores are out-of-order and can hide the latency of weird async ops just fine?
Oh wow, if they are out-of-order why are we getting tiny IPC improvements? Obviously because if there is dependency on previous result and that happens a lot in CPUs.
sharing what resource, it's multithreaded and does xboxhueg GEMM in very few cycles. that's the point.
Sharing AMX unit if it's one per cluster (as in per multiple cores), yes, then it's a shared resource by definition. Unless you start making multiple shared AMX units, then contention ration will obviously be lower and perf will improve.
the whole point is that it does GEMM a gajillion times faster than a normal CPU core. it's a matmul coprocessor. just like FP in the 1980s.
Yes, awesome, but when 8 cores access single AMX unit that won't give more than 1 AMX unit worth of perf, less really due to sharing overheads.