stayfrosty
Member
- Apr 4, 2024
- 26
- 56
- 51
The larger the context the longer the prefill phase takes. Prefill needs lots of low-precision compute but barely any bandwidth, so the big fat expensive HBM is sitting idle during prefill.What exactly makes GDDR7 card (aka CPX solution) so much better for large context inference? This appears to be normal 202 die that will be relatively cheap with relatively large memory, so using it for inference rather than super expensive HBM versions makes sense.
Equivalent response from AMD will be this AT0 chip.
I'm guessing AT0 is not optimized for low-precision like Rubin CPX. Rubin CPX has 6x the FP4 compute of a 5090 with a similar die size. this isn't just GR202.