- Jul 27, 2020
- 24,088
- 16,817
- 146
I mean, this is kinda known already.
But the article in the first post is not saying that nVidia GPU bad
Not the GPUs themselves but Nvidia didn't step in to help them do things the secure way and they ended up wasting a lot of time and effort.We could have shipped GPUs very quickly by doing what Nvidia recommended: standing up a standard K8s cluster to schedule GPU jobs on. Had we taken that path, and let our GPU users share a single Linux kernel, we’d have been on Nvidia’s driver happy-path.
Alternatively, we could have used a conventional hypervisor. Nvidia suggested VMware (heh). But they could have gotten things working had we used QEMU. We like QEMU fine, and could have talked ourselves into a security story for it, but the whole point of Fly Machines is that they take milliseconds to start. We could not have offered our desired Developer Experience on the Nvidia happy-path.
Instead, we burned months trying (and ultimately failing) to get Nvidia’s host drivers working to map virtualized GPUs into Intel Cloud Hypervisor. At one point, we hex-edited the closed-source drivers to trick them into thinking our hypervisor was QEMU.
I’m not sure any of this really mattered in the end. There’s a segment of the market we weren’t ever really able to explore because Nvidia’s driver support kept us from thin-slicing GPUs. We’d have been able to put together a really cheap offering for developers if we hadn’t run up against that, and developers love “cheap”, but I can’t prove that those customers are real.
What you say is mostly rightBut the article in the first post is not saying that nVidia GPU bad, but rather that people want ready to use solutions instead of rolling their own. FLY.IO hoped to make a business case of renting gpu instances to others that they could configure to run LLMs, but they believe now that it is too niche market compared to people who just prefer to use apis to existing services. And most of those existing services are running nVidia HW. Or did I misunderstand?
I read the same in the article: they hoped developers would seek optimized environments (better cost/performance). Developers are seeking easy and rapid deployment instead, they're more than willing to trade convenience for cost. In a way this makes sense, we're still in the early days of AI tech, so real costs are hidden for the sake of market adoption while the offerings change fast. A developer may not even know what model/provider they want to use 6-9 months from now.But the article in the first post is not saying that nVidia GPU bad, but rather that people want ready to use solutions instead of rolling their own. FLY.IO hoped to make a business case of renting gpu instances to others that they could configure to run LLMs, but they believe now that it is too niche market compared to people who just prefer to use apis to existing services. And most of those existing services are running nVidia HW. Or did I misunderstand?