adroc_thurston
Diamond Member
- Jul 2, 2023
- 8,359
- 11,120
- 106
yeah le DGEMM is the only argument I have for GEMM blobs in client outside of .gb6 pumping but who the hell runs DGEMM in client?It’s pretty good at FP64 too
yeah le DGEMM is the only argument I have for GEMM blobs in client outside of .gb6 pumping but who the hell runs DGEMM in client?It’s pretty good at FP64 too
They are certainly benchmark for AMX execution and will be compared against AMD.we don't care about Intel because Intel sucks.
no, the benchmark is Apple since they ship it everywhere.They are certainly benchmark for AMX execution and will be compared against AMD.
a) you know you can build a bigger matmul unit for 8 cores. Apple one is p chungus and has a ton of juice. very nice for DGEMM.So the question stands: if Intel uses dedicated per core AMX units then why the **** perf won't be terrible if single AMX unit gets shared by multiple cores, potentially high number like 8 or more.
no one. But openBLAS did add support for SGEMM for SME targets which is good to see.yeah le DGEMM is the only argument I have for GEMM blobs in client outside of .gb6 pumping but who the hell runs DGEMM in client?
So that's equivalent of multiple AMX units then, reduces shared contention, but it's still a shared resource that run slower for when all cores need it.you know you can build a bigger matmul unit for 8 cores
Not in x86 market.no, the benchmark is Apple since they ship it everywhere.
yup.no one. But openBLAS did add support for SGEMM for SME targets which is good to see.
NO.So that's equivalent of multiple AMX units then
NO.reduces shared contention
Will classic Zen cores have it per cluster (12 cores?) too?The unit is heavily multithreaded and has its own ld/st pipeline in SME case.
yes? Again, feature support is fully flat across the uarch family.Will classic Zen cores have it per cluster (12 cores?) too?
They are certainly benchmark for AMX execution and will be compared against AMD.
So the question stands: if Intel uses dedicated per core AMX units then why the **** perf won't be terrible if single AMX unit gets shared by multiple cores, potentially high number like 8 or more.
Coral. DMR is AMX.Then the question is if ACE makes it to Diamond Rapids or Coral Rapids.
Coral is trying to intercept Florence.On AMD side, it will be in Zen 7, which should launch between Diamond and Coral.
Ok, shared unit too, so what will run faster at the same frequency -yes? Again, feature support is fully flat across the uarch family.
Coral is trying to intercept Florence.
The former.Ok, shared unit too, so what will run faster at the same frequency -
1) 12 Intel cores with AMX as is now
or
2) 12 AMD cores with per cluster AMX
man that's an optimistic view of 14A lmaoIt should also be a competition, by proxy, between TSMC A14 and Intel 14A, in time to HVM.
man that's an optimistic view of 14A lmao
well Venice beats DMR to market and clowns all over it so...Then, if TSMC A14 beats 14A to HVM, it would imply that Florence beats Coral to market, assuming Florence is one of the first products on A14
no lmao, their core area is bad enough as is.will Intel implement amx per core in client?
Ok, good, dedicated per core units will run faster all other things equal.The former.
Indeed, why implement this in CPU in the first place, Intel had to do it because of lack of GPU, having integrated GPU seems far better way, maybe with higher latency than CPU.But why would you want that? If you wanna do Actual Real Matmul at AMD, GPU is your friend.
It's not 'faster', you just dedicate more area to matmul.dedicated per core units will run faster all other things equal.
idk.Now how much faster would Intel be with 12 cores scenario in this case vs AMD - like 2 times faster, or perhaps 12?
No idea, we are talking about CPUs here, you just don't want to answer obvious question that a shared single unit for 12 cores under heavy usage will be slower like more than 12 times than in scenario of 1 dedicated unit per core: ie dogsh*t slow like I said, which is obvious to anybody who actually profiles stuff and runs into contentions like this would be.How much faster is a single gfx1311 CU is over an AMX core in GEMM?
Oh wow, 600 sq mm GPU is not faster than 200 sq mm, it's just got more area dedicated.It's not 'faster', you just dedicate more area to matmul.
shared units give you nice matmul rates are relatively minimal area investment.No idea, we are talking about CPUs here, you just don't want to answer obvious question that a shared single unit for 12 cores under heavy usage will be slower like more than 12 times than in scenario of 1 dedicated unit per core: ie dogsh*t slow like I said, which is obvious to anybody who actually profiles stuff and runs into contentions like this would be.
No, cuz shared units are cheap and nice nuff.Solution is obviously running this stuff on GPUs and drop AMX crap.
not my point.Oh wow, 600 sq mm GPU is not faster than 200 sq mm, it's just got more area dedicated.
Yes sure, nice for you to acknowledge it is shared unit, thus leading to contention = reduction in speed.shared units give you nice matmul rates are relatively minimal area investment.
Because the cpu is standardised unlike GPU or NPU. It’s easy for devs toowhy implement this in CPU in the first place, Intel had to do it because of lack of GPU, having integrated GPU seems far better way, maybe with higher latency than CPU.
It's not 'reduction in speed', you're just doing less matmul ops a clock.thus leading to contention = reduction in speed.
bingo.Because the cpu is standardised unlike GPU or NPU. It’s easy for devs too
