Discussion RDNA 5 / UDNA (CDNA Next) speculation

marees · 2025-10-06T23:18:37-0400

RnR_au said:
I'm sure they use pytorch. Just not sure if its used in production. OpenAI uses alot of python in their stack, but Triton is a python based language.

While not talking about their operational software stack in details I found this OpenAI account interesting;

https://calv.info/openai-reflections

edit: from my understanding pytorch has multiple backends as firstclass citizens nowadays. Not just CUDA.

I am deeply skeptical of anything not c++ working in high perf scenarios such as training

But I am old school that way 😜

RnR_au · 2025-10-07T00:37:47-0400

marees said:
I am deeply skeptical of anything not c++ working in high perf scenarios such as training

But I am old school that way 😜

Hehe - python is just being used as a scripting language calling highly optimised 'AI primitives' coded in C/C++.

There is a thing called MegaKernel - you describe the computation graph for your LLM in python code and then it compiles a single gpu kernel that is highly optimised in terms of memory accesses. Very interesting stuff. Very fast and no C++

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

TL;DR: We developed a compiler that automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs…

zhihaojia.medium.com

A smidge offtopic though.... looking forward to the 128GB RDNA 5 AI cards!!

marees · 2025-10-07T00:54:08-0400

RnR_au said:
There is a thing called MegaKernel - you describe the computation graph for your LLM in python code and then it compiles a single gpu kernel that is highly optimised in terms of memory accesses. Very interesting stuff. Very fast and no C++

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

TL;DR: We developed a compiler that automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs…

zhihaojia.medium.com

This stuff seems to be specific on particular generation of gpu architecture

Doesn't seem as generic as c++, cuda or pytorch — but maybe works for massive hardware deployments

Usually it is the meta/Facebook guys who come up with generic software that works on all hardwares

Search

Discussion RDNA 5 / UDNA (CDNA Next) speculation

marees

Golden Member

RnR_au

Platinum Member

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

marees

Golden Member

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

TRENDING THREADS