News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

moinmoin · May 31, 2021

With Sapphire Rapids delayed Intel's Aurora exascale supercomputer misses yet another date. This means Frontier is now on route of becoming the first exascale supercomputer.

Aurora's Troubles Move Frontier into Pole Exascale Position

Intel’s 7nm node delay has raised questions about the status of the Aurora supercomputer that was scheduled to be stood up at Argonne National Laboratory next year. Aurora was in the running to be the United States’ first exascale supercomputer although it was on a contemporaneous timeline with...

www.hpcwire.com

Joe NYC · Aug 29, 2021

DrMrLordX said:
Debateable. But if you're a developer, are you really going to want to work with the AMD software stack?

I am not familiar with the API, differences between them and ease of use. But if you are a programmer, you are hired for a job, you do the job.

DrMrLordX said:
Got any links? I don't remember seeing that.

Intel has been playing games here, maybe they are still trying to make it work.

https://twitter.com/x/status/1429874228887801868

DrMrLordX · Aug 29, 2021

Joe NYC said:
I am not familiar with the API, differences between them and ease of use. But if you are a programmer, you are hired for a job, you do the job.

That's not how it works. With supercomputers provided to research organizations, you either:

a). have a toolset the existing researchers and support staff can use while keeping to deadlines/budgets or
b). adopt a toolset that requires extending deadlines or bloating budgets to bring on more support staff.

It's worth a lot of $$$ to stick with solution a).

Intel has been playing games here, maybe they are still trying to make it work.

That would surprise me if true. Need to see more hard evidence of that, or at least more independent sources to corroborate.

Joe NYC · Aug 29, 2021

DrMrLordX said:
That's not how it works. With supercomputers provided to research organizations, you either:

a). have a toolset the existing researchers and support staff can use while keeping to deadlines/budgets or
b). adopt a toolset that requires extending deadlines or bloating budgets to bring on more support staff.

It's worth a lot of $$$ to stick with solution a).

These supercomputer projects have multiple year lead times for hardware, and same lead time to get the software tools ready.

And perhaps the government sees a software lock to one vendor as a big negative (0 for 3 for NVidia on exascale supercomputer). It is as if government wanted quite strenuously to avoid this single vendor lock.

DrMrLordX said:
That would surprise me if true. Need to see more hard evidence of that, or at least more independent sources to corroborate.

You will have hard time finding hard evidence because Intel went radio silence on the subject of CXL link between SPR and PVC.

dacostafilipe · Aug 30, 2021

Well, a lot of scientists have to write their own code so having a mature/simple stack like CUDA is really helpful, specially when you have limited access to the supercomputer ressources or limited budget for your local hardware. This happens a lot in universities.

But, when you have a big project, you normally have your own team of developers that write code and the benefit of CUDA should be less obvious or/and the issue with portability could make it a less desirable choice.

DrMrLordX · Aug 30, 2021

Joe NYC said:
These supercomputer projects have multiple year lead times for hardware, and same lead time to get the software tools ready.

The hardware vendors have years of lead time to prepare the hardware, but those who wish to use said hardware are presumably busy with other projects on the hardware already in their possession; furthermore, unless you're dealing in truly bespoke hardware (rather than clusters featuring what are essentially commodity dGPUs), the hardware vendor provides software tools. Not the research team that's going to be using said hardware.

It is as if government wanted quite strenuously to avoid this single vendor lock.

Aurora is (like many supercomputer projects) largesse. It should not be seen as an allergic reaction to CUDA.

You will have hard time finding hard evidence because Intel went radio silence on the subject of CXL link between SPR and PVC.

That's awfully convenient.

NeoLuxembourg said:
Well, a lot of scientists have to write their own code so having a mature/simple stack like CUDA is really helpful, specially when you have limited access to the supercomputer ressources or limited budget for your local hardware. This happens a lot in universities.

But, when you have a big project, you normally have your own team of developers that write code and the benefit of CUDA should be less obvious or/and the issue with portability could make it a less desirable choice.

There's a great deal of effort going on to make sure that CUDA projects will be portable to SYCL. It isn't yet clear how or how well that will work out. Modern CUDA relies a great deal on drivers to balance workloads on NVLink-equipped systems, for example.

moinmoin · Aug 30, 2021

DrMrLordX said:
CUDA is still a force, and NV intends to keep it that way.

Not so much "still" but actually coming into its own in the last couple years where Nvidia's datacenter business' revenue passed its consumer business for the first time ever. It's now a major juggernaut that has been a long time coming. The more important it is to foster valid alternatives, unless monopolies and lack of choices are of no concern whatsoever.

DisEnchantment · Aug 30, 2021

DrMrLordX said:
CUDA is still a force, and NV intends to keep it that way. They don't want their hardware to compete on a level playing field with a common software stack. Especially not when their competitors control the CPUs and motherboards. However . . .

CUDA for the exascale era of computing is fading.
Not relying on open standards for gigantic investments like Exascale programs and all the researches under its umbrella is setting up yourself for disaster.
Look at what happened to Argonne, the HW keeps getting pushed but the research will keep moving forward because DoE has settled on open standards to be used across all projects. It is a requirement.

Preparing for science in the exascale era | Argonne National Laboratory

www.anl.gov

Argonne is paying CodePlay to support SYCL for AMD GPUs in oneAPI/DPC++ and LLVM, because the code from Aurora will eventually be targetted to run on Frontier too.

Argonne and Oak Ridge National Laboratories Award Codeplay Software to Further Strengthen SYCL Support Extending the Open Standard Software for AMD GPUs | Argonne Leadership Computing Facility

www.alcf.anl.gov

For gigantic projects where HW is far less cheaper than all the research effort and the SW used therein, getting your research of decades locked in a vendor is just setting up yourself for the perfect storm.
CUDA will have a place for smaller enterprises/projects.
That too, open standards are maturing and implemented by multiple vendors.

DrMrLordX · Aug 30, 2021

moinmoin said:
Not so much "still" but actually coming into its own in the last couple years where Nvidia's datacenter business' revenue passed its consumer business for the first time ever. It's now a major juggernaut that has been a long time coming. The more important it is to foster valid alternatives, unless monopolies and lack of choices are of no concern whatsoever.

If we could turn back time and find a way to keep OpenCL development fresh and relevant to serve as a counter to CUDA, we would all be better off today. As it stands, everyone's playing catchup.

DisEnchantment said:
CUDA for the exascale era of computing is fading.

Hope you're right, but having to hire an outside shop just to get your software tools ready really stinks. In the long run, the most likely factor to knock out CUDA will be NV's inability to control platforms.

DisEnchantment · Aug 30, 2021

DrMrLordX said:
Hope you're right, but having to hire an outside shop just to get your software tools ready really stinks. In the long run, the most likely factor to knock out CUDA will be NV's inability to control platforms.

Paying a third party to get the SW, tools and standards is fairly standard practice in big enterprises, it is peanuts compared to the actual cost and effort of the engineering teams.
3rd party ISVs, like Mentor, WindRiver are specialists in providing support to ensure your massive SW investment remains compatible with lots of SW libraries and platforms provided by different players.

Just an example even if we skip HPC for a moment,
Basically every SoC platform you plan to develop on comes with a BSP and support for the standard interfaces and libraries like OpenGL, Vulkan, OpenCV, OpenMP etc.
Adherence to standards means you get MTK this year and QC next year or Exynos or whatever. Your SW will remain compatible.

At least this is what we do. We jumped off the NV train many years ago.
Bad luck for the folks with deep dependencies to Vibrante and all the myriad of SDKs to realize the other SoC vendors have much better parts but they would still have to stick with the Tegras

moinmoin · Aug 30, 2021

DrMrLordX said:
If we could turn back time and find a way to keep OpenCL development fresh and relevant to serve as a counter to CUDA, we would all be better off today.

Khronos did turn back the time on OpenCL.

Khronos Announces OpenCL 3.0: Hitting the Reset Button on Compute Frameworks

www.anandtech.com

Khronos (which president is VP developer ecosystem at Nvidia) being an industry consortium is exactly why OpenCL can't counter CUDA. Its standards rely on the support of all its members, and after OpenCL 2.0 that support fell apart until nobody was left for 2.2.

Joe NYC · Aug 30, 2021

That was quick. From AMD press release:

Polaris is scheduled to be delivered and installed in August 2021

AMD EPYC™ Processors Picked by Argonne National Laboratory to Prepare for Exascale Future

Browse AMD’s company-wide and financial press releases.

ir.amd.com

beginner99 · Aug 30, 2021

moinmoin said:
Actual scientists preferring proprietary closed source tool chains, really? Eww...

They use what works to get their papers out vs. tinkering 4 years to get the software running somewhat...

moinmoin · Aug 30, 2021

beginner99 said:
They use what works to get their papers out vs. tinkering 4 years to get the software running somewhat...

They use black boxes to get their papers out vs. researching how they actually get all their crucial data, got it.

Vattila · Aug 30, 2021

Here are a couple of nice quotes from researchers attending Argonne's training program for exascale computing:

“The presentation on SYCL/DPC++ by Argonne’s Thomas Applecourt ended up being the most helpful session for me,” said participant Ral Bielawski, a doctoral student in aerospace engineering at the University of Michigan. “He covered a programming model I had never been exposed to before and provided a solution that would target Intel GPUs, and potentially most GPUs in the future.”

“I’ll be using a lot of the ideas I’ve seen here to update material in our high-performance computing courses,” [said participant Kevin Green, a research scientist for the Department of Computer Science at the University of Saskatchewan in Canada]. “I’ve also gotten a good feel for how we can migrate our current code designs to designs that will perform well across different supercomputing architectures.”

Virtual Argonne training program prepares researchers for extreme-scale computing | Argonne National Laboratory

www.anl.gov

Makes me hopeful that SYCL will trickle down fast — in research labs and universities, at least.

PS. By the way, here is a page on SYCL at Argonne's online support centre for Aurora. Lots of resources here:

SYCL and DPC++ for Aurora | Argonne Leadership Computing Facility

www.alcf.anl.gov

Hitman928 · Aug 30, 2021

Vattila said:
Here is a couple of nice quotes from researchers attending Argonne's training program for exascale computing:

“The presentation on SYCL/DPC++ by Argonne’s Thomas Applecourt ended up being the most helpful session for me,” said participant Ral Bielawski, a doctoral student in aerospace engineering at the University of Michigan. “He covered a programming model I had never been exposed to before and provided a solution that would target Intel GPUs, and potentially most GPUs in the future.”

“I’ll be using a lot of the ideas I’ve seen here to update material in our high-performance computing courses,” [said participant Kevin Green, a research scientist for the Department of Computer Science at the University of Saskatchewan in Canada]. “I’ve also gotten a good feel for how we can migrate our current code designs to designs that will perform well across different supercomputing architectures.”

Virtual Argonne training program prepares researchers for extreme-scale computing | Argonne National Laboratory

www.anl.gov

Makes me hopeful that SYCL will trickle down fast — in research labs and universities, at least.

If SYCL spreads quickly in research labs and universities, it won't be too long before it's incorporated industry wide. There's a reason Nvidia sponsors so much university research and has research lab partnerships where they spend a decent chunk of change on subsidizing hardware and support to get them using CUDA.

DrMrLordX · Aug 31, 2021

DisEnchantment said:
Paying a third party to get the SW, tools and standards is fairly standard practice in big enterprises, it is peanuts compared to the actual cost and effort of the engineering teams.

See post from @Hitman928 , I thought NV was in the habit of providing 3rd party support to try to get people stuck on CUDA? I mean yeah if you're coming from a Tegra standpoint then it would make sense to buck them anyway since the hardware is so bad. But we're talking about the segment where NV has had dominant hardware performance for awhile.

moinmoin said:
Khronos did turn back the time on OpenCL.

That wasn't exactly what I had in mind since it rolls back all their work on SVM. We'll see if they can do any good.

moinmoin · Aug 31, 2021

DrMrLordX said:
That wasn't exactly what I had in mind since it rolls back all their work on SVM. We'll see if they can do any good.

They didn't roll back "all their work on SVM", now it's just an optional part of the spec. As the article points out:

"This, as it turns out, is very similar to how Khronos has tackled Vulkan, which has been far more successful in recent years. Giving vendors some flexibility in what their API implements has allowed Vulkan to be stretched from mobile devices to the desktop, so there is some very clear, real-world evidence that this structure can work. And it’s this kind of success that the OpenCL working group would like to see as well."

DrMrLordX · Aug 31, 2021

moinmoin said:
They didn't roll back "all their work on SVM", now it's just an optional part of the spec. As the article points out:

Yeah but that part of the spec may be orphaned if nobody uses it. Which, under the circumstances, I can see people not using it unless Intel builds on it as part of OneAPI.

beginner99 · Sep 3, 2021

moinmoin said:
They use black boxes to get their papers out vs. researching how they actually get all their crucial data, got it.

With that logic they would also need to build their own CPUs and GPUs because you can say the exact same thing about the hardware. And you also imply that researchers using SYCL would read and then try to understand all the source code because if they don't, it's still a black box to them.

I mean just look how many stuff actually uses excel still and yes for publications. Given your logic all these use a black box called excel and often on top a black box called windows which itself runs on a black box called the CPU microcode.

moinmoin · Sep 3, 2021

beginner99 said:
With that logic they would also need to build their own CPUs and GPUs because you can say the exact same thing about the hardware.

System design is part of computer science, and that's exactly what's (should be, was for me) happening there. There exist many educational projects in that area, and RISC-V started that way.

beginner99 said:
Given your logic all these use a black box called excel

Eww. (Seriously though, people stuffing everything into spreadsheets for no reason at all is a disease.)

moinmoin · Sep 30, 2021

Frontier is currently being installed.

US Closes in on Exascale: Frontier Installation Is Underway

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership...

www.hpcwire.com

moinmoin · Oct 5, 2021

Perfectly fitting this topic: Summary of the Aurora woes contrasted with Frontier:

First Look At Oak Ridge’s “Frontier” Exascaler, Contrasted To Argonne’s “Aurora”

The fiscal year of the federal government in the United States ends on September 30, and whether we all knew it or not, the US Department of Energy had a

www.nextplatform.com

Joe NYC · Oct 5, 2021

moinmoin said:
Perfectly fitting this topic: Summary of the Aurora woes contrasted with Frontier:

First Look At Oak Ridge’s “Frontier” Exascaler, Contrasted To Argonne’s “Aurora”

The fiscal year of the federal government in the United States ends on September 30, and whether we all knew it or not, the US Department of Energy had a

www.nextplatform.com

"Aurora dates not yet publicly available"

On another topic, the article speculates that the mini-Aurora (Polaris) stand it is unlikely to be extended to full Aurora spec, that more likely it would be a wholesale replace to full AMD system, because Polaris would need to many nodes to hit the performance targets.

Does that mean that NVidia A100 would lag behind both Mi200 and PVC?

moinmoin · Oct 5, 2021

Joe NYC said:
Does that mean that NVidia A100 would lag behind both Mi200 and PVC?

In price to performance ratio definitely.

Joe NYC · Oct 5, 2021

moinmoin said:
In price to performance ratio definitely.

It seems that also in raw performance, since it takes 9,000 nodes for Mi200 and PVC to get to 1.5 exaflop, but 16,000 for A100 (according to the estimates in the article). That's almost 2:1 ratio.

News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

Diamond Member

Golden Member

Lifer

Golden Member

Senior member

Lifer

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member