News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
With Sapphire Rapids delayed Intel's Aurora exascale supercomputer misses yet another date. This means Frontier is now on route of becoming the first exascale supercomputer.

 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
I am not familiar with the API, differences between them and ease of use. But if you are a programmer, you are hired for a job, you do the job.

That's not how it works. With supercomputers provided to research organizations, you either:

a). have a toolset the existing researchers and support staff can use while keeping to deadlines/budgets or
b). adopt a toolset that requires extending deadlines or bloating budgets to bring on more support staff.

It's worth a lot of $$$ to stick with solution a).

Intel has been playing games here, maybe they are still trying to make it work.

That would surprise me if true. Need to see more hard evidence of that, or at least more independent sources to corroborate.
 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
That's not how it works. With supercomputers provided to research organizations, you either:

a). have a toolset the existing researchers and support staff can use while keeping to deadlines/budgets or
b). adopt a toolset that requires extending deadlines or bloating budgets to bring on more support staff.

It's worth a lot of $$$ to stick with solution a).

These supercomputer projects have multiple year lead times for hardware, and same lead time to get the software tools ready.

And perhaps the government sees a software lock to one vendor as a big negative (0 for 3 for NVidia on exascale supercomputer). It is as if government wanted quite strenuously to avoid this single vendor lock.

That would surprise me if true. Need to see more hard evidence of that, or at least more independent sources to corroborate.

You will have hard time finding hard evidence because Intel went radio silence on the subject of CXL link between SPR and PVC.
 
  • Like
Reactions: Tlh97 and moinmoin

dacostafilipe

Senior member
Oct 10, 2013
771
244
116
Well, a lot of scientists have to write their own code so having a mature/simple stack like CUDA is really helpful, specially when you have limited access to the supercomputer ressources or limited budget for your local hardware. This happens a lot in universities.

But, when you have a big project, you normally have your own team of developers that write code and the benefit of CUDA should be less obvious or/and the issue with portability could make it a less desirable choice.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
These supercomputer projects have multiple year lead times for hardware, and same lead time to get the software tools ready.

The hardware vendors have years of lead time to prepare the hardware, but those who wish to use said hardware are presumably busy with other projects on the hardware already in their possession; furthermore, unless you're dealing in truly bespoke hardware (rather than clusters featuring what are essentially commodity dGPUs), the hardware vendor provides software tools. Not the research team that's going to be using said hardware.

It is as if government wanted quite strenuously to avoid this single vendor lock.

Aurora is (like many supercomputer projects) largesse. It should not be seen as an allergic reaction to CUDA.

You will have hard time finding hard evidence because Intel went radio silence on the subject of CXL link between SPR and PVC.

That's awfully convenient.

Well, a lot of scientists have to write their own code so having a mature/simple stack like CUDA is really helpful, specially when you have limited access to the supercomputer ressources or limited budget for your local hardware. This happens a lot in universities.

But, when you have a big project, you normally have your own team of developers that write code and the benefit of CUDA should be less obvious or/and the issue with portability could make it a less desirable choice.

There's a great deal of effort going on to make sure that CUDA projects will be portable to SYCL. It isn't yet clear how or how well that will work out. Modern CUDA relies a great deal on drivers to balance workloads on NVLink-equipped systems, for example.
 
  • Like
Reactions: Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
CUDA is still a force, and NV intends to keep it that way.
Not so much "still" but actually coming into its own in the last couple years where Nvidia's datacenter business' revenue passed its consumer business for the first time ever. It's now a major juggernaut that has been a long time coming. The more important it is to foster valid alternatives, unless monopolies and lack of choices are of no concern whatsoever.
 
  • Like
Reactions: prtskg

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
CUDA is still a force, and NV intends to keep it that way. They don't want their hardware to compete on a level playing field with a common software stack. Especially not when their competitors control the CPUs and motherboards. However . . .
CUDA for the exascale era of computing is fading.
Not relying on open standards for gigantic investments like Exascale programs and all the researches under its umbrella is setting up yourself for disaster.
Look at what happened to Argonne, the HW keeps getting pushed but the research will keep moving forward because DoE has settled on open standards to be used across all projects. It is a requirement.

Argonne is paying CodePlay to support SYCL for AMD GPUs in oneAPI/DPC++ and LLVM, because the code from Aurora will eventually be targetted to run on Frontier too.
For gigantic projects where HW is far less cheaper than all the research effort and the SW used therein, getting your research of decades locked in a vendor is just setting up yourself for the perfect storm.
CUDA will have a place for smaller enterprises/projects.
That too, open standards are maturing and implemented by multiple vendors.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
Not so much "still" but actually coming into its own in the last couple years where Nvidia's datacenter business' revenue passed its consumer business for the first time ever. It's now a major juggernaut that has been a long time coming. The more important it is to foster valid alternatives, unless monopolies and lack of choices are of no concern whatsoever.

If we could turn back time and find a way to keep OpenCL development fresh and relevant to serve as a counter to CUDA, we would all be better off today. As it stands, everyone's playing catchup.

CUDA for the exascale era of computing is fading.

Hope you're right, but having to hire an outside shop just to get your software tools ready really stinks. In the long run, the most likely factor to knock out CUDA will be NV's inability to control platforms.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Hope you're right, but having to hire an outside shop just to get your software tools ready really stinks. In the long run, the most likely factor to knock out CUDA will be NV's inability to control platforms.
Paying a third party to get the SW, tools and standards is fairly standard practice in big enterprises, it is peanuts compared to the actual cost and effort of the engineering teams.
3rd party ISVs, like Mentor, WindRiver are specialists in providing support to ensure your massive SW investment remains compatible with lots of SW libraries and platforms provided by different players.

Just an example even if we skip HPC for a moment,
Basically every SoC platform you plan to develop on comes with a BSP and support for the standard interfaces and libraries like OpenGL, Vulkan, OpenCV, OpenMP etc.
Adherence to standards means you get MTK this year and QC next year or Exynos or whatever. Your SW will remain compatible.

At least this is what we do. We jumped off the NV train many years ago.
Bad luck for the folks with deep dependencies to Vibrante and all the myriad of SDKs to realize the other SoC vendors have much better parts but they would still have to stick with the Tegras
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
If we could turn back time and find a way to keep OpenCL development fresh and relevant to serve as a counter to CUDA, we would all be better off today.
Khronos did turn back the time on OpenCL. ;)

Khronos (which president is VP developer ecosystem at Nvidia) being an industry consortium is exactly why OpenCL can't counter CUDA. Its standards rely on the support of all its members, and after OpenCL 2.0 that support fell apart until nobody was left for 2.2.
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Here are a couple of nice quotes from researchers attending Argonne's training program for exascale computing:

“The presentation on SYCL/DPC++ by Argonne’s Thomas Applecourt ended up being the most helpful session for me,” said participant Ral Bielawski, a doctoral student in aerospace engineering at the University of Michigan. “He covered a programming model I had never been exposed to before and provided a solution that would target Intel GPUs, and potentially most GPUs in the future.”

“I’ll be using a lot of the ideas I’ve seen here to update material in our high-performance computing courses,” [said participant Kevin Green, a research scientist for the Department of Computer Science at the University of Saskatchewan in Canada]. “I’ve also gotten a good feel for how we can migrate our current code designs to designs that will perform well across different supercomputing architectures.”


Makes me hopeful that SYCL will trickle down fast — in research labs and universities, at least.

PS. By the way, here is a page on SYCL at Argonne's online support centre for Aurora. Lots of resources here:

 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,245
7,793
136
Here is a couple of nice quotes from researchers attending Argonne's training program for exascale computing:

“The presentation on SYCL/DPC++ by Argonne’s Thomas Applecourt ended up being the most helpful session for me,” said participant Ral Bielawski, a doctoral student in aerospace engineering at the University of Michigan. “He covered a programming model I had never been exposed to before and provided a solution that would target Intel GPUs, and potentially most GPUs in the future.”

“I’ll be using a lot of the ideas I’ve seen here to update material in our high-performance computing courses,” [said participant Kevin Green, a research scientist for the Department of Computer Science at the University of Saskatchewan in Canada]. “I’ve also gotten a good feel for how we can migrate our current code designs to designs that will perform well across different supercomputing architectures.”


Makes me hopeful that SYCL will trickle down fast — in research labs and universities, at least.

If SYCL spreads quickly in research labs and universities, it won't be too long before it's incorporated industry wide. There's a reason Nvidia sponsors so much university research and has research lab partnerships where they spend a decent chunk of change on subsidizing hardware and support to get them using CUDA.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
Paying a third party to get the SW, tools and standards is fairly standard practice in big enterprises, it is peanuts compared to the actual cost and effort of the engineering teams.

See post from @Hitman928 , I thought NV was in the habit of providing 3rd party support to try to get people stuck on CUDA? I mean yeah if you're coming from a Tegra standpoint then it would make sense to buck them anyway since the hardware is so bad. But we're talking about the segment where NV has had dominant hardware performance for awhile.

Khronos did turn back the time on OpenCL. ;)

That wasn't exactly what I had in mind since it rolls back all their work on SVM. We'll see if they can do any good.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
That wasn't exactly what I had in mind since it rolls back all their work on SVM. We'll see if they can do any good.
They didn't roll back "all their work on SVM", now it's just an optional part of the spec. As the article points out:

"This, as it turns out, is very similar to how Khronos has tackled Vulkan, which has been far more successful in recent years. Giving vendors some flexibility in what their API implements has allowed Vulkan to be stretched from mobile devices to the desktop, so there is some very clear, real-world evidence that this structure can work. And it’s this kind of success that the OpenCL working group would like to see as well."
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
They didn't roll back "all their work on SVM", now it's just an optional part of the spec. As the article points out:

Yeah but that part of the spec may be orphaned if nobody uses it. Which, under the circumstances, I can see people not using it unless Intel builds on it as part of OneAPI.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
They use black boxes to get their papers out vs. researching how they actually get all their crucial data, got it. ;)

With that logic they would also need to build their own CPUs and GPUs because you can say the exact same thing about the hardware. And you also imply that researchers using SYCL would read and then try to understand all the source code because if they don't, it's still a black box to them.

I mean just look how many stuff actually uses excel still and yes for publications. Given your logic all these use a black box called excel and often on top a black box called windows which itself runs on a black box called the CPU microcode.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
With that logic they would also need to build their own CPUs and GPUs because you can say the exact same thing about the hardware.
System design is part of computer science, and that's exactly what's (should be, was for me) happening there. There exist many educational projects in that area, and RISC-V started that way.

Given your logic all these use a black box called excel
Eww. (Seriously though, people stuffing everything into spreadsheets for no reason at all is a disease.)
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
Frontier is currently being installed.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
Perfectly fitting this topic: Summary of the Aurora woes contrasted with Frontier:

ecp-frontier-aurora-timing.jpg
 

Joe NYC

Golden Member
Jun 26, 2021
1,934
2,272
106
Perfectly fitting this topic: Summary of the Aurora woes contrasted with Frontier:

ecp-frontier-aurora-timing.jpg

"Aurora dates not yet publicly available"

On another topic, the article speculates that the mini-Aurora (Polaris) stand it is unlikely to be extended to full Aurora spec, that more likely it would be a wholesale replace to full AMD system, because Polaris would need to many nodes to hit the performance targets.

Does that mean that NVidia A100 would lag behind both Mi200 and PVC?