Speculation: SYCL will replace CUDA

soresu · Jun 21, 2021

moinmoin said:
We are also still lacking an open source Windows.

Well, an up to date and bug free OSS Windows anyway.

After a couple of decades ReactOS is sadly still far from viable.

I also do wonder if MS's eventual strategy is just to pivot heavily towards Linux as Proton further matures and just make Windows a proprietary Wine implementation on steroids.

Highly unlikely to be sure, but an interesting possibility.

ThatBuzzkiller · Jun 22, 2021

I think people are getting confused here ...

SYCL is purely just a source language that can be abstracted over other APIs like CUDA, HIP, or OpenCL ...

For APIs such as CUDA/HIP/OpenCL, there's almost always a built-in programming interface for the drivers to potentially trigger device acceleration and they usually feature their own native source languages like CUDA/HIP kernel language for CUDA/HIP or OpenCL C for OpenCL. Then we have APIs like OpenMP which doesn't have source language by itself at all and you can use either C++ (CPUs) as your source language or do interop with CUDA/HIP (GPUs) to use their source languages but there is virtually minimal code reuse here in this case ...

APIs also feature an ingestion format which can be either in the form of a source language like OpenCL C or a bytecode like PTX and SPIR-V ...

As a hard lesson learned, it is never sane to compile from the source language which ultimately led to the down fall of both OpenGL and OpenCL. GLSL shaders were never portable between different vendors in practice because of the idiosyncrasies with their unique GLSL compilers so authors had to rewrite their GLSL shaders until BOTH AMD AND Nvidia's GLSL compilers could generate the same results. Blender deprecated Cycle's OpenCL backend for these same reasons. Cycle's initial megakernel OpenCL implementation only ever worked on Nvidia's OpenCL C compiler. It wasn't until AMD submitted a split/micro OpenCL kernel implementation that the Cycles renderer would start working on AMD's OpenCL compiler. When the new OpenCL kernel implementation was being brought up, the Blender team had another upcoming problem in which there was a lot of duplicated code that will inevitably cause issues down the line so they decided to remove the OpenCL megakernel altogether. Was this ever the end of the story ? Of course not!

When Intel wanted to get serious about GPUs, they too desired to get Blender's Cycles renderer working on their devices but much to the shock of no one the split/micro OpenCL kernel implementation didn't work at all on Intel's OpenCL compiler! The Blender team was coming to another crossroads again between having more code duplication or maintaining a more complex codebase with tons of workarounds for each vendors OpenCL compiler. The Blender team ultimately decided that a future with duplicated backends instead of a unified backend was more maintainable and less complex for their project ...

In recent years with the introduction of SPIR-V, the reuse of GLSL shaders improved massively with Vulkan compared to early days with OpenGL because programmers didn't have to pray anymore for the driver to correctly compile their GLSL shaders and could instead feed the much simpler SPIR-V format which was far harder for other vendors to mess up. To Microsoft's credit, this was originally Direct3D's model. When Microsoft created HLSL they knew that vendors could not be trusted to make their own HLSL compilers so instead they standardized DXBC for driver ingestion and Khronos Group eventually came to follow a similar model with SPIR-V. Unfortunately, all attempts to replicate the same success for compute kernels have failed thus far ...

One of the goals behind HSA Foundation was standardizing an intermediate language/representation to fix OpenCL's portability issues. AMD released HSA drivers that supported HSAIL for ingestion but other vendors like ARM or Qualcomm didn't care about HSAIL at all so OpenCL remained in a broken state. Later on during the past decade, AMD attempted to implement SPIR for their OpenCL drivers, however they looked around and nobody was following them at the time so they dropped the work on their new SPIR compiler and shortly thereafter decided to start the ROCm project. SPIR 1.2 was only ever supported on AMD's Fury series and Vega never supported SPIR ...

Even if AMD did support SYCL there would be no portability value to reusing it's source code as we see for GLSL or OpenCL C. Meanwhile there's a lot of portability value in reusing binaries with a bytecode format like SPIR-V. C++ source code is NOT reusable across different architectures but Java bytecode binaries are reusable across different architectures. If we really wanted to follow to the philosophy behind reusability then we won't ever truly find an answer by just focusing solely on the source language such SYCL but the answer lies within converging to a common bytecode format ...

We wouldn't even have to think about replacing the CUDA source language if we could compile it into SPIR-V bytecode! If we want no changes to the CUDA source code for existing programs then we could emulate PTX on SPIR-V as well! The question shouldn't be if SYCL will replace CUDA. The question should be is if SPIR-V will ever replace PTX ? That's the important question we should be thinking about ...

DrMrLordX · Jun 22, 2021

Cogman said:
AMD's "Instinct" has been a flop

Seems like they made money off MI60 at least.

moinmoin · Jun 22, 2021

soresu said:
Well, an up to date and bug free OSS Windows anyway.

After a couple of decades ReactOS is sadly still far from viable.

That's like mentioning Nouveau when talking about Nvidia and open source.

soresu said:
I also do wonder if MS's eventual strategy is just to pivot heavily towards Linux as Proton further matures and just make Windows a proprietary Wine implementation on steroids.

Highly unlikely to be sure, but an interesting possibility.

Actually Microsoft's main business arguably has been Azure for quite some time now, and there it pivoted very hard toward Linux pretty early on already, along which Microsoft started introducing its tech (like support for Hyper-V) into Linux as well as WSL in Windows.

soresu · Jun 22, 2021

Cogman said:
What needs to happen first is others need to break into the GPGPU market in server farms. AMD's "Instinct" has been a flop and Intel's "Phi" didn't really go anywhere.

It's waaaayyy too early to declare that when CDNA2 is not even out yet and is already in a planned (pre-ordered?) supercomputer.

Vattila · Jun 22, 2021

ThatBuzzkiller said:
I think people are getting confused here ... [about the backend complexities]

That was a great post! Thanks for sharing your insight and thoughts on the subject. Heterogeneous programming models and implementations are complex and difficult to get right, especially when there isn't a common interest across the industry. As you point out, there have been many initiatives, failures and some successes. We're not quite there yet. But SYCL looks promising to me.

That said, although I as a programmer enjoyed your post, I don't think it cleared up this complex subject for non-technical readers. I have had comments on other forums that the interactions and layering of technologies are utterly confusing.

For those confused, the infographic I included in my initial post, should give a broad overview of the layering of the technologies, with SYCL being the frontend high-level language (based on ISO C++) in which you write your application, sitting on top of SYCL implementations (oneAPI/DPC++, hipSYCL, ComputeCpp, etc.), which provide toolchains (compilers, linkers, debuggers, etc.) and support libraries (math, AI/ML, vision, algorithms, etc). These implementations in turn sit on top of backend technologies (OpenCL, OpenMP, CUDA, ROCm, etc.) in which the low-level detail resides.

To clear up confusion, it is also good to distinguish between programming language and framework (which includes toolchains, support libraries, etc.).

Intel: Language is SYCL, framework is oneAPI.
AMD: Language is HIP (very similar to CUDA), framework is ROCm.
Nvidia: Language is CUDA, framework is CUDA (and just to conflate even more, GPU cores are CUDA).

Hope that is helpful to the confused.

PS. Note that Intel and AMD's frameworks are open and non-proprietary with open-source implementations readily available. Nvidia's framework is closed and proprietary.

DrMrLordX · Jun 22, 2021

Vattila said:
Nvidia: Language is CUDA, framework is CUDA (and just to conflate even more, GPU cores are CUDA).

To me this is where nVidia wins. It's all CUDA, from top to bottom. It's always been CUDA. My head still spins when I think of all the frameworks and toolchains I would need to tackle GPGPU using Radeon Instinct. And I don't even like NV as a company. But it "just works". And they still have the fastest accelerators on the market . . . for now.

moinmoin · Jun 22, 2021

DrMrLordX said:
To me this is where nVidia wins. It's all CUDA, from top to bottom. It's always been CUDA. My head still spins when I think of all the frameworks and toolchains I would need to tackle GPGPU using Radeon Instinct. And I don't even like NV as a company. But it "just works". And they still have the fastest accelerators on the market . . . for now.

And that coverage is why Jen-Hsun Huang called Nvidia a software company over a decade ago already.

beginner99 · Jun 23, 2021

moinmoin said:
And that coverage is why Jen-Hsun Huang called Nvidia a software company over a decade ago already.

Yeah and AMD is a hardware company and it shows.

Vattila · Jun 27, 2021

Rather than just blabber about SYCL, I've decided to learn it!

So, I've started reading the free book "Data Parallel C++", which is an easy introduction to SYCL programming using Intel's free Data Parallel C++ compiler (DPC++). The book is written by very competent Intel people — PhD's with long history in supercomputing, compiler writing, etc. — but the text is very novice friendly, starting with brief explanations of fundamental concepts in parallel programming, such as Amdahl's limiting law (speed-up is limited by the serial part of a program) and Gustafson's counteracting law (the parallel part of a program can scale tremendously with more data). The example code is simple and easy to understand.

The introduction to the book is enlightening when it comes to the origins of SYCL (Khronos) and its aims, in particular relative to ISO C++, the key participants in the standardisation (Codeplay, in particular) and Intel's involvement with DPC++ and oneAPI.

So, well worth having a look for anyone that wants to know more about parallel programming and SYCL in particular. Again, it is completely free and available for simple download at the following link.

Data Parallel C++

This open access book focuses on the key aspects of parallel programming as a wide and complex topic, namely language support for data parallel algorithm coding. Learn to effectively and swiftly use DPC++, the key to programming for Intel’s new One API initiative.

link.springer.com

Here is a video presentation by James Reinders, one of the authors, on the book and on SYCL and DPC++/oneAPI in general:

https://youtu.be/CV6oqyGocCo

PS. AMD and Nvidia should get aboard officially! I'm sure they are both already working with SYCL behind the scenes, through their cooperation with the national labs on the existing and upcoming supercomputers and the related programming models to be supported.

Vattila · Jun 27, 2021

beginner99 said:
Yeah and AMD is a hardware company and it shows.

Absolutely. Although ROCm shows promise, and surely has contributed to the supercomputer wins, the software situation at AMD is still dire. In particular, the following issues need resolving:

ROCm isn't supported on Windows, and support on Linux is limited (see installation guide).
ROCm only supports a limited range of GPUs (see documentation at GitHub).
SYCL isn't directly supported as a programming model (but you may use hipSYCL).
Although OpenCL is listed as a supported programming model for ROCm, AMD last released a conforming OpenCL 2.0 driver in 2015 (according to the Khronos site), leaving later hardware without support in frameworks and applications built on OpenCL (such as some SYCL implementations).

Hopefully, they can accelerate their software development. Codeplay's work on implementing SYCL support in DPC++ for AMD GPUs should help improve the situation.

Vattila · Jun 27, 2021

Here is some academic work done back in 2019 on automated CUDA-to-SYCL translation:

ReSYCLator: Transforming CUDA C++ source code into SYCL

"CUDA™ while very popular, is not as flexible with respect to target devices as OpenCL™. While parallel algorithm research might address problems first with a CUDA C++ solution, those results are not easily portable to a target not directly supported by CUDA. In contrast, a SYCL™ C++ solution can operate on the larger variety of platforms supported by OpenCL. ReSYCLator is a plug-in for the C++ IDE Cevelop[2], that is itself an extension of Eclipse-CDT. ReSYCLator bridges the gap between algorithm availability and portability, by providing automatic transformation of CUDA C++ code to SYCL C++. A first attempt basing the transformation on NVIDIA®’s Nsight™ Eclipse CDT plug-in showed that Nsight™’s weak integration into CDT’s static analysis and refactoring infrastructure is insufficient. Therefore, an own CUDA-C++ parser and CUDA language support for Eclipse CDT was developed (CRITTER) that is a sound platform for transformations from CUDA C++ programs to SYCL based on AST transformations."

iwocl-2019-dhpcc-tobias-stauber-resyclator-transforming-cuda-C-source-code-into-sycl.pdf

PS. Peter Sommerlad is a prominent participant in the C++ community and a member of the standardisation committee. See his presentations and C++ standard proposal papers.

Vattila · Jul 5, 2021

This year's Hans Meuer Award goes to researchers at University of Bristol for their performance analysis of parallel programming models:

"This 2021 award went to Andrei Poenaru and co-authors Wei-Chen Lin and Simon McIntosh-Smith (all from the University of Bristol) for their paper: A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application. [...] Poenaru, a PhD student at the University of Bristol, provides a brief summary of his paper: “We introduced a mini-app, based on a compute bound computational chemistry application developed here at the University of Bristol. It’s a molecular docking application. And what we did is we took the core kernel code and we implemented that in lots of different programming models. These range from more traditional models like of OpenMP and OpenCL to newer ones like SYCL and Kokkos, and these can target both CPUs and GPUs, so we then took that mini-app and we run across a wide range of platforms in an attempt to evaluate the performance portability of all of these programming models.” [...] Coauthor Wei-Chin (Tom) Lin, a PhD student at University of Bristol, added: “The mini-app we refer to as miniBUDE has already been very useful in my previous work, where I surveyed the performance of several SYCL implementations over time to track their improvements historically. And my current work is surveying potential HPC languages, such as Rust, Julia, maybe even Scala, and I think having a compute-bound mini-app like miniBUDE will be very helpful for comparing these programming languages. [...]” Poenaru again: “The performance portability work is interesting from two perspectives. One of them is heterogeneous architectures — we’re seeing a lot of different CPUs and GPUs and other kinds of processors from a wide range of vendors, and it’s really in everyone’s advantage to be able to program as many of those as uniformly as we can. That’s why we’re interested in APIs that are not vendor proprietary but you can use them on a wide range of different platforms. And the other perspective is programming languages. In other areas of software development, new languages have made big improvements over the past decades but in HPC, we’re still using older languages and frameworks, so I really want to see new C++ based modern approaches to parallel programming because I think it will greatly improve the productivity of programmers working in HPC”.

ISC 2021 Reveals Winners of Research Awards (hpcwire.com)

Ajay · Jul 16, 2021

Vattila said:
PS. AMD and Nvidia should get aboard officially!

Why would Nvidia get aboard. They have the advantage. They have a huge collection of tools and libraries, in-house developed and from ISVs to support their compute systems. HPC systems get allot of press, but compute systems exist across a wide range of businesses. The Gov't makes sure these systems get parceled out across multiple vendors, the private sectors could care less about this.

moinmoin · Jul 16, 2021

Ajay said:
the private sectors could care less about this.

May be it couldn't, but monopolization is never a good quality in the long run so the private sectors should care more about this.

Vattila · Aug 25, 2021

Argonne National Laboratory buys a 44-petaflops AMD+Nvidia supercomputer to start transitioning their codes to the exascale era using the emerging SYCL standard programming model, as well as the traditional MPI and OpenMP standards. Note that the upcoming Aurora supercomputer is an Intel CPU+GPU system that will be programmed using ISO C++ and SYCL (on which Intel's DPC++ compiler and oneAPI framework is based). However, this transitional supercomputer ("Polaris") is based on AMD CPU and Nvidia GPU. So the software porting effort will be an extreme exercise in cross-platform development.

"With its heterogeneous CPU-GPU architecture (in a 1:4 ratio), Polaris is helping Argonne make the transition to the Intel-HPE Aurora system, which slipped from 2021 to 2022 on account of Intel roadmap delays (impacting Sapphire Rapids and Ponte Vecchio). Polaris will be used by researchers within the DOE’s Exascale Computing Project and the ALCF’s Aurora Early Science Program to start prepping their codes for Aurora. 'We looked at many possible solutions with Aurora in the back of our mind,' said Kumaran of the technology selection process. 'We wanted something with multi-GPU node support. And we wanted something that would support some of the key programming models on Aurora, which is MPI, OpenMP, and also SYCL in DPC++ (the SYCL 2020 variant from Intel). We wanted these programming models supported, and Polaris offered that solution.' [...] A wider goal, long-sought and steadily inching forward, is cross-platform code portability. Argonne has researchers working with NERSC (Berkeley Lab) and Codeplay (prominent SYCL supporter) to port SYCL and DCP++ to the A100 GPU. 'If people are porting code to Aurora using SYCL or DCP++, they will be able to continue to support that programming model and not have to rewrite to OpenMP or MPI or CUDA to use on Polaris,' said Kumaran. 'And similarly, we’ve also explored supporting HIP on this platform (Polaris), and so if you have CUDA support, and you are developing with CUDA on Summit, or for future AMD-based platforms, with Frontier, then you can use that. And finally, we are also exploring SYCL and DCP++ for AMD GPUs [in collaboration with Oak Ridge and Codeplay]. And so if you’re looking for an alternate solution to CUDA and HIP on AMD GPUs and you want to run your DCP++ code, we have a proof-of-concept working on that.'"

Argonne’s 44-Petaflops ‘Polaris’ Supercomputer Will Be Testbed for Aurora, Exascale Era

A new 44-petaflops (theoretical peak) supercomputer is under construction at the Department of Energy’s Argonne National Laboratory. Called Polaris, this new supercomputing star has been selected to light the way […]

www.hpcwire.com

PS. Here is a page on SYCL at Argonne's online support centre for Aurora. Lots of resources here:

SYCL and DPC++ for Aurora | Argonne Leadership Computing Facility

www.alcf.anl.gov

Heartbreaker · Nov 5, 2021

It looks more like SYCL is an extra abstraction layer on top of lower level GPU APIs like CUDA or OpenCL, than it is a replacement for either.

Abstraction layers don't replace the lower layers. You still need them.

You question should be, "will more people program in SYCL, rather than in CUDA directly?"

I am dubious that you will win over many CUDA programmers with another abstraction layer.

Vattila · Nov 15, 2021

HIP replaces CUDA in Blender:

"HIP (Heterogeneous-computing Interface for Portability) is a C++ Runtime API and kernel language that allows developers to create portable applications for AMD and NVIDIA GPUs from a single source code. This allows the Blender Cycles developers to write one set of rendering kernels and run them across multiple devices. The other advantage is that the tools with HIP allow easy migration from existing CUDA code to something more generic. AMD has been working closely with Blender to add support for HIP devices in Blender 3.0, and this code is already available in the latest daily Blender 3.0 beta release."

Next level support for AMD GPUs — Blender Developers Blog

Ultimately, generic code will end up using SYCL, I hope, thereby supporting all competitors.

moinmoin · Nov 15, 2021

That blog entry is by Brian Savery, product manager at AMD. It's good news from AMD since it is also the start of support for HIP/ROCm in consumer cards under Windows (with the Radeon 6000 series noted as being compatible). On the other hand it may be a little misleading as it implies HIP is being used to merge the previous separate CUDA and OpenCL code bases. The previous blog entry by the Blender Foundation chairman Ton Roosendaal indicates something different though:
"Blender now is being supported by all major silicon manufacturers: Intel, AMD, Nvidia and Apple."

Vattila · Nov 15, 2021

moinmoin said:
It's good news from AMD since it is also the start of support for HIP/ROCm in consumer cards under Windows

If that is the case, that would be a great step forward indeed for AMD.

On the other hand it may be a little misleading as it implies HIP is being used to merge the previous separate CUDA and OpenCL code bases.

I think the blog post only concerns Cycles (Blender’s physically-based path tracer). According to this screenshot from the blog post, it now apparently only supports "CUDA" and "HIP" devices, and not OpenCL:

moinmoin · Nov 15, 2021

Ton mentions in the comments that Apple will move to Metal for Cycles. Not sure what Intel will do (I'm sure they want Xe to be supported as well, maybe oneAPI?), but this doesn't appear to be a code merge, rather the opposite.

Vattila · Nov 15, 2021

moinmoin said:
Not sure what Intel will do

Hopefully, they'll contribute an implementation using SYCL (on which oneAPI/DPC++ is based). Then, hopefully, with backend support for Nvidia and AMD (which Codeplay is working on), there can be convergence towards a common generic codebase.

Vattila · Nov 22, 2021

QUDA (not to be confused with CUDA) is being ported to cross-platform backends (HIP, SYCL, OpenMP, C++ STL) by a Portability Working Group consisting of representatives from AMD, Intel, Nvidia and national labs. QUDA is a library for performing calculations in lattice quantum chromodynamics (QCD) on graphics processing units (GPUs), originally leveraging NVIDIA’s CUDA platform.

QUDA into the Frontier - YouTube

moinmoin · Nov 22, 2021

Renaming to QUIP incoming?

Vattila · Nov 23, 2021

LUMI is a recent HPE Cray EX supercomputer deployed in the EU. The largest partition of the system is the LUMI-G partition consisting of GPU-accelerated nodes using the new AMD EPYC "Trento" CPU and Instinct MI250X GPU, providing 375 petaflops of committed Linpack performance.

How is the LUMI-G supercomputer programmed?

Below are the relevant sections from a LUMI blog post introducing the system and its programming models. Note in particular the mentions of HIP (AMD's CUDA dialect), as well as SYCL (the open Khronos standard for heterogeneous system programming). Also note the section on how the vendor was chosen. In particular, the benchmark criteria included 3 MLPerf benchmarks out of 7 codes, as well as a requirement for "a robust way to translate or run programs using CUDA". So HPE and AMD's winning bid had to clearly demonstrate sufficient ML performance and that the CUDA moat could be overcome — which it convincingly did, apparently!

Programming environment introduction

The AMD programming environment comes under the ROCm (Radeon Open Compute) brand. As the name suggests, it is mostly open source components, but they are actively developed and maintained by AMD, the source code is hosted at RadeonOpenCompute/ROCm, and the documentation can be found on ROCm documentation platform.

The ROCm software stack contains the usual set of accelerated scientific libraries, such as ROCBlas for BLAS functionality and ROCfft for FFT, etc. The AMD software stack naturally also comes with the necessary compilers needed to get code compiled for the GPUs. The AMD GPU compiler will have support for offloading through OpenMP directives. In addition, the ROCm stack comes with HIP, which is AMDs replacement for CUDA, and the tools required to make translating code from CUDA to HIP much easier. The code translated to HIP can still work on CUDA hardware. We will cover translating codes to HIP in more detail in later posts.

The ROCm stack also includes the tools needed to debug code running on the GPUs in the form of ROCgdb, AMDs ROCm source-level debugger for Linux based on the GNU Debugger (GDB). For profiling, the ROCm stack also comes with rocProf, implemented on the top of rocProfiler and rocTracer APIs, allowing you to profile code running on the GPUs. AMD provides its MIOpen library, an open-source library for high performance machine learning primitives for machine learning. MIOpen provides optimized hand-tuned implementations such as convolutions, batch normalizations, poolings, activation functions, and recurrent neural networks (RNNs).

In addition to the AMD software stack, LUMI will also come with the full Cray Programming Environment (CPE) stack. The Cray programming environment comes with the needed compilers and tools that help users port, debug, and optimize for GPUs and conventional multi-core CPUs. It also includes fine-tuned scientific libraries that can use the CPU host and the GPU accelerator when executing kernels [...].

Getting your code ready for LUMI

While LUMI does include a partition with CPU-only nodes, most of the performance comes from GPU nodes. If you want to take full advantage of LUMI, you will need to make sure your code can do at least the majority of its work on the GPUs. If your codes currently don’t utilize any GPU acceleration, now is a great time to start looking into what parts could benefit from GPU acceleration. If you are using a code that currently uses GPU acceleration, some porting may still be needed. In cases where the code is widely used in the community, and you are not the developer, the porting effort will likely be made by someone else. But in the cases where the code is something you have developed yourself, you will need to do some porting work. If the current code uses CUDA, that code will need to be converted to AMD’s equivalent to HIP. The HIP is very similar to CUDA, and there are source-to-source translation tools available that will do the majority of the conversion work, see the Questions and Answers section below. Some OpenACC support level is available for the AMD hardware used, but it may be worth considering converting OpenACC codes to use OpenMP offloading instead. [...]

Questions and Answers

How was the LUMI vendor chosen? — The LUMI contract was awarded based on a combination of functional requirements and performance/cost metrics. The functional requirements included things like a working software stack etc. For the cost/performance metric, the performance was a combination of synthetic benchmarks such as the industry-standard HPL and HPCG benchmarks but also a set of application benchmarks. The application benchmark set consisted of three benchmarks from the MLPerf benchmarks set and 4 codes with 6 different input cases; the codes are applications that are widely used within the LUMI consortium. The non-ML part of the application benchmarks consisted of codes using either CUDA and one using OpenACC to offload their computation to the GPU. With this type of benchmark set, we enforced additional functionality requirements that there needs to be a robust way to translate or run programs using CUDA on the machine.

How will my CUDA-code work on LUMI? — Your CUDA code has to be converted to AMD HIP to work with the AMD GPUs. The HIP is a C++ Runtime API and Kernel Language that can be used to create portable applications for AMD and NVIDIA GPUs using the same source code. AMD provides various tools for porting CUDA code to HIP so that the developer does not have to port all of the code manually, we still expect some manual modifications, and performance tuning for the hardware to be needed. Code converted to HIP can also run on CUDA hardware using just the HIP header files, allowing you to maintain only the HIP version of your code. When running HIP on CUDA hardware, the performance should be similar to the original CUDA code’s performance. Many CUDA libraries are also ported or in the process of being ported. AMD provided its optimized libraries; for example, the AMD BLAS library comes in the form of rocBLAS. They also offer an additional convenience library, hipBLAS, to allow HIP programs to call the appropriate BLAS libraries based on the hardware they are running on, i.e., rocBLAS will be called on an AMD platform and cuBLAS on a CUDA platform. Recently Hipfort was also released, which is a way for Fortran codes to access the HIP libraries and allow them to offload code to AMD GPUs.

How will my OpenACC code work on LUMI? — A community effort, led by Oak Ridge National Laboratory, supports OpenACC in the GNU compilers. The most recent GCC release, currently 10.1, supports OpenACC 2.6. We expect that GCC’s support and stability will improve and be better closer to the GPU partition availability. If you can currently build your OpenACC programs with the GNU compiler, you should be able to use OpenACC on LUMI. As an alternative to OpenACC, LUMI will support programs using OpenMP directives to offload code to the GPUs. OpenMP offloading has better support from the system vendor, meaning it may be worth considering porting your OpenACC code to OpenMP.

How can I run my favourite ML framework on LUMI? — The most well-known machine learning frameworks are going to be available for AMD GPUs. For example, TensorFlow and PyTorch are already supported by AMD, and their effort continues to bring the best experience for the machine learning users. You can find more information here.

Can other programming approaches for GPUs be used on LUMI?

OpenMP — With OpenMP, it is possible to offload computation to the GPUs using similar directives as one can use to create multithreaded applications. The AMD compiler and the Cray compiler both support offloading to the AMD GPUs used in LUMI and will support the latest version 5.0 of the OpenMP standard.

OpenCL — AMD has long been supporting OpenCL, and that support should continue. The AMD OpenCL driver supports CPUs and GPUs, and the programming environment includes debugging and profiling tools (CodeXL) and performance libraries like clMath. Documentation for OpenCL on AMD GPUs can be found at ROCm Documentation platform.

SYCL — SYCL is a single-source C++-based programming model targeting systems with heterogeneous architecture. Multiple SYCL implementations are supporting AMD GPUs. HipSYCL is an SYCL implementation built on HIP; as such, it targets both AMD and NVIDIA GPUs; another alternative is using Codeplays ComputeCpp SYCL implementation that also supports AMD GPUs.

Libraries — Many libraries offer GPU acceleration for specific operations. For instance, for linear algebra, there is the BLAS library that AMD has created their version that can offload the computation done to the GPU. Using these libraries can be an easy way to get part of your program offloaded to GPUs.

lumi-supercomputer.eu

PS. The HPE+AMD supercomputer architecture and programming models also apply to the upcoming exascale supercomputers in USA — Frontier and El Capitan. The upcoming Intel-based Aurora system will use the oneAPI framework based primarily on the SYCL standard.

Speculation: SYCL will replace CUDA

What does the future hold for heterogeneous programming models?

CUDA is going to lead for the foreseeable future, due to installed base and support.

SYCL will rapidly replace CUDA, due to being an open standard with wide backing.

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member