Speculation: SYCL will replace CUDA

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

What does the future hold for heterogeneous programming models?


  • Total voters
    26

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Minor quibble, but the GeForce 256 wasn't their first dGPU. It was actually the STG2000. Most of us only remember the Riva 128.

I particularly remember that the TNT series got my attention. I think I bought one of those. Ironically, I was at that time drawn to Nvidia due to their support for open/platform standards (OpenGL/DirectX) versus the proprietary Glide API from 3DFX. Oh, how things have changed!

PS. The GeForce 256 was marketed by Nvidia as "the world's first Graphics Processing Unit". See also Wikipedia | GPU.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Well okay. But it doesn't mean their marketing was correct.

True, although they popularised the term "GPU" in the PC space, we'll have to admit. Before then we just called it a "3D graphics card" or just a "3D card", didn't we? The buzzword in the 1990s was "3D", I seem to recall.

The more interesting development, though, was the move of more tasks and computational abilities to the graphics card — transform and lighting, etc. — which allowed Ian Buck to "abuse" this functionality to make the graphics card into an accelerator of more general-purpose computations (GPGPU). So the background story of CUDA, as the quoted article tells it (post #75), begins with Ian Buck's PhD work at Stanford University on the Brook programming language in 1999, although he wasn't hired by Nvidia until 2004, and CUDA wasn't introduced until 2006. In between 1999 and 2006 we got programmable shaders and better support for floating-point arithmetic in the hardware, which set the stage for even more general-purpose programmability in CUDA. I remember following the exciting developments in DirectX in this period, although I didn't end up doing much 3D graphics or GPGPU programming, sadly.

"A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs."

Wikipedia
 
Last edited:
  • Like
Reactions: moinmoin

Vattila

Senior member
Oct 22, 2004
799
1,351
136
At the Hot Chips conference, running this week, Intel has shown off some SYCL performance results for their "Ponte Vecchio" server GPU. The latter is a key component of the upcoming Aurora supercomputer. Just as interesting though, is the "A100-SYCL" performance compared to the "A100-CUDA" performance, using Nvidia's A100 GPU. SYCL is very performance portable here — which is remarkable, as the A100-SYCL code is the result of automated migration using Intel's DPC++ Compatibility Tool.

nmfhNLEvYu6aftbEvpwzYW-970-80.jpg.webp


Intel Ponte Vecchio Seemingly Offers 2.5x Higher Performance Than Nvidia's A100 | Tom's Hardware (tomshardware.com)
 
  • Like
Reactions: moinmoin

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Still not a word on SYCL from AMD, but HIP — the CUDA dialect serving as a programming model for their ROCm framework — is getting traction in the HPC space:

"Devito is a domain-specific Language (DSL) and code generation framework for the design of highly optimized finite-difference kernels for use in simulation, inversion, and optimization. Devito utilizes a combination of symbolic computation and compiler technologies to automatically generate highly optimized software for a wide range of computer architectures. Previously Devito only supported AMD GPUs using OpenMP offloading. Thanks to Devito Codes’ new collaboration with AMD, we quickly adapted our existing CUDA to also support HIP for AMD GPUs. This resulted in a substantial uplift in performance, achieving competitive levels of performance with comparable architectures."

DevitoPRO getting HIP with AMD Instinct™ | Devito Codes

211085700_AMD_INSTINCT_MI200_Server_1.png
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Samsung just announced a supercomputer based on their ground-breaking processing-in-memory (PIM) technology in combination with AMD Instinct GPUs — using SYCL as the programming model:

"The supercomputer, disclosed Tuesday at an industry event in South Korea, includes 96 AMD Instinct MI100 GPUs, each of which are loaded with a processing-in-memory (PIM) chip, a new kind of memory technology that reduces the amount of data that needs to move between the CPU and DRAM. Choi Chang-kyu, the head of the AI Research Center at Samsung Electronics Advanced Institute of Technology, reportedly said the cluster was able to train the Text-to-Test Transfer Transformer (T5) language model developed by Google 2.5 times faster while using 2.7 times less power compared to the same cluster configuration that didn't use the PIM chips. [...] Samsung hopes to spur adoption of its PIM chips in the industry by creating software that will allow organizations to use the tech in an integrated software environment. To do this, it's relying on SYCL, a royalty-free, cross-architecture programming abstraction layer that happens to underpin Intel's implementation of C++ for its oneAPI parallel programming model."

 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
CodePlay is making great progress on their support for AMD and Nvidia in oneAPI, all based on the SYCL programming model standard. Despite being an AMD investor, this is one area I'm rooting for Raja Koduri, James Reinders and the other SYCL champions at Intel. In one important sense — standard-based heterogeneous system architectures and programming models — oneAPI is a continuation of the HSA effort started at AMD years ago. (Interestingly, Phil Rogers, former HSA Foundation president and one of AMD's few Corporate Fellows, a distinguished title only bestowed upon their brightest engineers, jumped ship to Nvidia back in 2015.)


Here is The Next Platform's article on the news:

"In a world where the number of chip platforms is rapidly expanding and accelerators from GPUs to FPGAs to DPUs are becoming the norm, being able to use the same tools when programming for the myriad chip architectures has a utopian vibe for many developers. It’s one of the reasons James Reinders returned to Intel just over two years ago after spending more than 27 years at the chip maker before leaving in 2016. It was a chance to help create a technology that could bring benefits to the IT industry, from enterprises out to HPC organizations. [...] OneAPI is seeing some momentum among early adopters – as of a year ago, more than 100 national laboratories, research organizations, educational institutions, and enterprises were using the platform, with Intel pulling in community input and contributions to the oneAPI spec through the open project. There also are now 30 oneAPI Centers of Excellence around the world."

“What will happen when these tools come out is you can download the tools from Intel, but then Codeplay will have a download that … plugs in and adds support for Nvidia GPUs and can plug in and support AMD GPUs,” Reinders says. “To the user, once those are all installed, you just run the compiler and it’ll take advantage of all of them and it can produce a binary – this is what really distinguishes it –that when you run it, if it turns out you have a system with, say, AXV-512 on your CPU, maybe an integrated graphics from Intel, a plug-in graphics from Nvidia, plug-in graphics from AMD, a plug-in from Intel, your program can come up and use all five of them in one run.”


oneAPI 2023: One Plug-In To Run Them All (nextplatform.com)

Intel-oneAPI-2023.png
 
Last edited:
  • Like
Reactions: moinmoin

Chaotic42

Lifer
Jun 15, 2001
33,929
1,097
126
I appreciate you posting all of this info, Vattila. This is definitely something interesting and I hope it gains traction.
 
  • Like
Reactions: Vattila

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Not directly related to SYCL, but the just announced OpenXLA compiler project for machine learning (ML) further erodes CUDA's incumbency in the AI/ML space. With this open cross-platform solution for accelerated linear algebra (XLA), the high-level frameworks (that researchers and developers actually use) reduce the ML models down to high-level operations (HLO) in the StableHLO specification, which in turn are compiled by the OpenXLA target-independent optimising compilers, producing MLIR code, which is finally compiled into actual code for the specific target platform. All the important vendors seem to be aboard, including Nvidia:

"OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware. Developers using OpenXLA will see significant improvements in training time, throughput, serving latency, and, ultimately, time-to-market and compute costs. [...] OpenXLA provides out-of-the-box support for a multitude of hardware devices including AMD and NVIDIA GPUs, x86 CPU and Arm architectures, as well as ML accelerators like Google TPUs, AWS Trainium and Inferentia, Graphcore IPUs, Cerebras Wafer-Scale Engine, and many more. OpenXLA additionally supports TensorFlow, PyTorch, and JAX via StableHLO, a portability layer that serves as OpenXLA's input format."

Notably, on low-level performance tuning, SYCL is mentioned as an option:

""OpenXLA gives users the flexibility to manually tune hotspots in their models. Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features."

OpenXLA is available now to accelerate and simplify machine learning | Google Open Source Blog (googleblog.com)

1678381848804.png
 
Last edited:
  • Like
Reactions: soresu and moinmoin

Vattila

Senior member
Oct 22, 2004
799
1,351
136
The European Union is investing in RISC-V and SYCL:

"The wide-spread adoption of AI has resulted in a market for novel hardware accelerators that can efficiently process AI workloads. Unfortunately, all popular AI accelerators today use proprietary hardware—software stacks, leading to a monopolization of the acceleration market by a few large industry players. Eight leading European organizations have joined in an effort to break this monopoly via Horizon Europe project SYCLOPS (Scaling extreme analYtics with Cross-architecture acceleration based on Open Standards). The vision of SYCLOPS is to democratize AI acceleration using open standards, and enabling a healthy, competitive, innovation-driven ecosystem for Europe and beyond."

 
  • Like
Reactions: moinmoin

Vattila

Senior member
Oct 22, 2004
799
1,351
136
OpenAI's Triton is emerging as an open-source implementation language for neural network AI frameworks, replacing (or at least marginalising) low-level heterogeneous programming models such as CUDA and SYCL. The Triton compiler performs optimisations that are hard to do by hand.

"We’re releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce."

Introducing Triton: Open-source GPU programming for neural networks

Triton is now used as a backend for the popular PyTorch 2.0 framework:

"TorchInductor is a deep learning compiler that generates fast code for multiple accelerators and backends. For NVIDIA and AMD GPUs, it uses OpenAI Triton as a key building block. [...] For a new compiler backend for PyTorch 2.0, we took inspiration from how our users were writing high performance custom kernels: increasingly using the Triton language. We also wanted a compiler backend that used similar abstractions to PyTorch eager, and was general purpose enough to support the wide breadth of features in PyTorch. TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs."

PyTorch 2.0 | Get started

Support for AMD hardware in Triton is forthcoming, opening up the options for AI development and deployment:

ROCm 5.6.1 | Release Highlights
GitHub | Triton | [ROCM] Core Functionality for AMD

Here is some further reading on Triton:



Paper: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
It appears OpenSYCL (formerly known as hipSYCL) has yet again been renamed (due to legal pressure). The new name sounds straight out of AMD's new marketing terms dictionary (strongly influenced by the former Xilinx group). Hopefully AMD will get aboard, if not already.

logo-color.png


"AdaptiveCpp is the independent, community-driven modern platform for C++-based heterogeneous programming models targeting CPUs and GPUs from all major vendors. AdaptiveCpp lets applications adapt themselves to all the hardware found in the system. This includes use cases where a single binary needs to be able to target all supported hardware, or utilize hardware from different vendors simultaneously."

"It currently supports the following programming models: (1) SYCL and (2) C++ standard parallelism."

"AdaptiveCpp is currently the only solution that can offload C++ standard parallelism constructs to GPUs from Intel, NVIDIA and AMD — even from a single binary."

stack.png


GitHub - AdaptiveCpp
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,946
7,656
136
It's odd that there would be legal issues about an open Khronos standard. Such efforts ideally should be shared with upstream anyway. Sounds like some member of Khronos wants segmentation there?
 

soresu

Platinum Member
Dec 19, 2014
2,660
1,860
136
It's odd that there would be legal issues about an open Khronos standard. Such efforts ideally should be shared with upstream anyway. Sounds like some member of Khronos wants segmentation there?
Seems more like the name change signifies diversification in programming models to include C++ standard parallelism as well as SYCL.

Though why they are already splitting their attention when they haven't even got significant software support as yet I don't know.
 
  • Like
Reactions: moinmoin

Vattila

Senior member
Oct 22, 2004
799
1,351
136
It looks like AMD's software stack is coming together at the right time for AI take-off. AMD CEO Lisa Su said years ago that AMD's biggest investments were in software, and we are now seeing it bearing fruit.

PyTorch (the most popular machine learning framework) now comes with AMD ROCm support out of the box.

And DeepSpeed (Microsoft's open-source deep learning optimisation library for PyTorch) is now "hipified", i.e. using AMD's HIP language to provide compatibility with both ROCm and CUDA backends, and to be accepted into the code base, any code changes must now pass AMD compatibility as part of the test suite.

As a reminder, HIP is essentially a dialect of the CUDA programming language, and the ROCm framework is a plug-in replacement for the CUDA framework.

 
Last edited:
  • Like
Reactions: moinmoin