I particularly remember that the TNT series got my attention. I think I bought one of those. Ironically, I was at that time drawn to Nvidia due to their support for open/platform standards (OpenGL/DirectX) versus the proprietary Glide API from 3DFX. Oh, how things have changed!
True, although they popularised the term "GPU" in the PC space, we'll have to admit. Before then we just called it a "3D graphics card" or just a "3D card", didn't we? The buzzword in the 1990s was "3D", I seem to recall.
The more interesting development, though, was the move of more tasks and computational abilities to the graphics card — transform and lighting, etc. — which allowed Ian Buck to "abuse" this functionality to make the graphics card into an accelerator of more general-purpose computations (GPGPU). So the background story of CUDA, as the quoted article tells it (post #75), begins with Ian Buck's PhD work at Stanford University on the Brook programming language in 1999, although he wasn't hired by Nvidia until 2004, and CUDA wasn't introduced until 2006. In between 1999 and 2006 we got programmable shaders and better support for floating-point arithmetic in the hardware, which set the stage for even more general-purpose programmability in CUDA. I remember following the exciting developments in DirectX in this period, although I didn't end up doing much 3D graphics or GPGPU programming, sadly.
"A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs."
At the Hot Chips conference, running this week, Intel has shown off some SYCL performance results for their "Ponte Vecchio" server GPU. The latter is a key component of the upcoming Aurora supercomputer. Just as interesting though, is the "A100-SYCL" performance compared to the "A100-CUDA" performance, using Nvidia's A100 GPU. SYCL is very performance portable here — which is remarkable, as the A100-SYCL code is the result of automated migration using Intel's DPC++ Compatibility Tool.
Still not a word on SYCL from AMD, but HIP — the CUDA dialect serving as a programming model for their ROCm framework — is getting traction in the HPC space:
"Devito is a domain-specific Language (DSL) and code generation framework for the design of highly optimized finite-difference kernels for use in simulation, inversion, and optimization. Devito utilizes a combination of symbolic computation and compiler technologies to automatically generate highly optimized software for a wide range of computer architectures. Previously Devito only supported AMD GPUs using OpenMP offloading. Thanks to Devito Codes’ new collaboration with AMD, we quickly adapted our existing CUDA to also support HIP for AMD GPUs. This resulted in a substantial uplift in performance, achieving competitive levels of performance with comparable architectures."
Samsung just announced a supercomputer based on their ground-breaking processing-in-memory (PIM) technology in combination with AMD Instinct GPUs — using SYCL as the programming model:
"The supercomputer, disclosed Tuesday at an industry event in South Korea, includes 96 AMD Instinct MI100 GPUs, each of which are loaded with a processing-in-memory (PIM) chip, a new kind of memory technology that reduces the amount of data that needs to move between the CPU and DRAM. Choi Chang-kyu, the head of the AI Research Center at Samsung Electronics Advanced Institute of Technology, reportedly said the cluster was able to train the Text-to-Test Transfer Transformer (T5) language model developed by Google 2.5 times faster while using 2.7 times less power compared to the same cluster configuration that didn't use the PIM chips. [...] Samsung hopes to spur adoption of its PIM chips in the industry by creating software that will allow organizations to use the tech in an integrated software environment. To do this, it's relying on SYCL, a royalty-free, cross-architecture programming abstraction layer that happens to underpin Intel's implementation of C++ for its oneAPI parallel programming model."
CodePlay is making great progress on their support for AMD and Nvidia in oneAPI, all based on the SYCL programming model standard. Despite being an AMD investor, this is one area I'm rooting for Raja Koduri, James Reinders and the other SYCL champions at Intel. In one important sense — standard-based heterogeneous system architectures and programming models — oneAPI is a continuation of the HSA effort started at AMD years ago. (Interestingly, Phil Rogers, former HSA Foundation president and one of AMD's few Corporate Fellows, a distinguished title only bestowed upon their brightest engineers, jumped ship to Nvidia back in 2015.)
Develop using the open, standards-based SYCL™ programming model for multiple accelerators with oneAPI.[Dec. 16, 2022 - Edinburgh, UK] – Today, Codeplay...
Here is The Next Platform's article on the news:
"In a world where the number of chip platforms is rapidly expanding and accelerators from GPUs to FPGAs to DPUs are becoming the norm, being able to use the same tools when programming for the myriad chip architectures has a utopian vibe for many developers. It’s one of the reasons James Reinders returned to Intel just over two years ago after spending more than 27 years at the chip maker before leaving in 2016. It was a chance to help create a technology that could bring benefits to the IT industry, from enterprises out to HPC organizations. [...] OneAPI is seeing some momentum among early adopters – as of a year ago, more than 100 national laboratories, research organizations, educational institutions, and enterprises were using the platform, with Intel pulling in community input and contributions to the oneAPI spec through the open project. There also are now 30 oneAPI Centers of Excellence around the world."
“What will happen when these tools come out is you can download the tools from Intel, but then Codeplay will have a download that … plugs in and adds support for Nvidia GPUs and can plug in and support AMD GPUs,” Reinders says. “To the user, once those are all installed, you just run the compiler and it’ll take advantage of all of them and it can produce a binary – this is what really distinguishes it –that when you run it, if it turns out you have a system with, say, AXV-512 on your CPU, maybe an integrated graphics from Intel, a plug-in graphics from Nvidia, plug-in graphics from AMD, a plug-in from Intel, your program can come up and use all five of them in one run.”
"The rise of heterogeneous computing (typically CPU/GPU pairings) is the big challenge that many hope SYCL can help address, and the first round of U.S. exascale supercomputers is something of a poster child for that challenge."
Not directly related to SYCL, but the just announced OpenXLA compiler project for machine learning (ML) further erodes CUDA's incumbency in the AI/ML space. With this open cross-platform solution for accelerated linear algebra (XLA), the high-level frameworks (that researchers and developers actually use) reduce the ML models down to high-level operations (HLO) in the StableHLO specification, which in turn are compiled by the OpenXLA target-independent optimising compilers, producing MLIR code, which is finally compiled into actual code for the specific target platform. All the important vendors seem to be aboard, including Nvidia:
"OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware. Developers using OpenXLA will see significant improvements in training time, throughput, serving latency, and, ultimately, time-to-market and compute costs. [...] OpenXLA provides out-of-the-box support for a multitude of hardware devices including AMD and NVIDIA GPUs, x86 CPU and Arm architectures, as well as ML accelerators like Google TPUs, AWS Trainium and Inferentia, Graphcore IPUs, Cerebras Wafer-Scale Engine, and many more. OpenXLA additionally supports TensorFlow, PyTorch, and JAX via StableHLO, a portability layer that serves as OpenXLA's input format."
Notably, on low-level performance tuning, SYCL is mentioned as an option:
""OpenXLA gives users the flexibility to manually tune hotspots in their models. Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features."
The European Union is investing in RISC-V and SYCL:
"The wide-spread adoption of AI has resulted in a market for novel hardware accelerators that can efficiently process AI workloads. Unfortunately, all popular AI accelerators today use proprietary hardware—software stacks, leading to a monopolization of the acceleration market by a few large industry players. Eight leading European organizations have joined in an effort to break this monopoly via Horizon Europe project SYCLOPS (Scaling extreme analYtics with Cross-architecture acceleration based on Open Standards). The vision of SYCLOPS is to democratize AI acceleration using open standards, and enabling a healthy, competitive, innovation-driven ecosystem for Europe and beyond."
OpenAI's Triton is emerging as an open-source implementation language for neural network AI frameworks, replacing (or at least marginalising) low-level heterogeneous programming models such as CUDA and SYCL. The Triton compiler performs optimisations that are hard to do by hand.
"We’re releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce."
Triton is now used as a backend for the popular PyTorch 2.0 framework:
"TorchInductor is a deep learning compiler that generates fast code for multiple accelerators and backends. For NVIDIA and AMD GPUs, it uses OpenAI Triton as a key building block. [...] For a new compiler backend for PyTorch 2.0, we took inspiration from how our users were writing high performance custom kernels: increasingly using the Triton language. We also wanted a compiler backend that used similar abstractions to PyTorch eager, and was general purpose enough to support the wide breadth of features in PyTorch. TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs."
SEO: Python-like language promises to be easier to write than native CUDA and specialized GPU code but has performance comparable to what expert GPU coders can produce and better than standard library code such as Torch.
Over the last decade, the landscape of machine learning software development has undergone significant changes. Many frameworks have come and gone, but most have relied heavily on leveraging Nvidia's CUDA and performed best on Nvidia GPUs. However, with the arrival of PyTorch 2.0 and OpenAI's...