News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

moinmoin · May 31, 2021

With Sapphire Rapids delayed Intel's Aurora exascale supercomputer misses yet another date. This means Frontier is now on route of becoming the first exascale supercomputer.

Aurora's Troubles Move Frontier into Pole Exascale Position

Intel’s 7nm node delay has raised questions about the status of the Aurora supercomputer that was scheduled to be stood up at Argonne National Laboratory next year. Aurora was in the running to be the United States’ first exascale supercomputer although it was on a contemporaneous timeline with...

www.hpcwire.com

moinmoin · Oct 5, 2021

Joe NYC said:
It seems that also in raw performance, since it takes 9,000 nodes for Mi200 and PVC to get to 1.5 exaflop, but 16,000 for A100 (according to the estimates in the article). That's almost 2:1 ratio.

The article also indirectly mentions that it's AMD's approach to MCM that made this possible: Technically it's not one CPU but 8 chiplets, and not four GPUs but 8 let's call them "GPUlets". Nvidia has no product (yet) that fits the same amount of performance in that space.

jpiniero · Oct 5, 2021

If Argonne ends up cancelling Aurora completely and goes with AMD+nVidia instead, I imagine they would wait for Hopper.

Joe NYC · Oct 5, 2021

jpiniero said:
If Argonne ends up cancelling Aurora completely and goes with AMD+nVidia instead, I imagine they would wait for Hopper.

I think if Argonne cancels, it would be because they want something right away, not another 2 year wait.

jpiniero · Oct 5, 2021

Joe NYC said:
I think if Argonne cancels, it would be because they want something right away, not another 2 year wait.

Presumably Hopper will be available by then.

itsmydamnation · Oct 5, 2021

Mi200 has full rate DP FP and RPM SP FP , so its a "traditional" GPGPU compute monster.

scineram · Oct 6, 2021

jpiniero said:
Presumably Hopper will be available by then.

Well, Frontier is available now.

jpiniero · Oct 6, 2021

scineram said:
Well, Frontier is available now.

Given that they spent the money on the A100 cluster, they must consider Aurora and Frontier two different projects. Guess it's not out of the question they would just expand Frontier if Argonne ends up cancelling Aurora.

Joe NYC · Oct 6, 2021

scineram said:
Well, Frontier is available now.

The installation is still under way. I think the hardware installation completion target is end of the year.

Asterox · Nov 18, 2021

Green500 HPC is dominated by AMD Epyc CPU-s, or 8 out of 10.

Green500 List - November 2021 | TOP500

www.top500.org

moinmoin · Nov 21, 2021

Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

Raja's Chip Notes Lay Out Intel's Path to Zettascale

After bumping into Raja Koduri, at SC21, we had a few beers and he explained Intel's path to achieving Zettascale "1000x" systems in ~5 years

www.servethehome.com

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.

NTMBK · Nov 21, 2021

moinmoin said:
Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

Raja's Chip Notes Lay Out Intel's Path to Zettascale

After bumping into Raja Koduri, at SC21, we had a few beers and he explained Intel's path to achieving Zettascale "1000x" systems in ~5 years

www.servethehome.com

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.

So Raja continues to be full of it?

Markfw · Nov 21, 2021

NTMBK said:
So Raja continues to be full of it?

I agree. 6-7 years down the road ? AMD may be so far ahead in servers that nobody will use Intel hardly.

Wishful thinking.

Panino Manino · Nov 21, 2021

moinmoin said:
Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

Raja's Chip Notes Lay Out Intel's Path to Zettascale

After bumping into Raja Koduri, at SC21, we had a few beers and he explained Intel's path to achieving Zettascale "1000x" systems in ~5 years

www.servethehome.com

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.

RAJA IS BACK BABY!
But can he really do it now that he is in a place that have "unlimited" resources?

Asterox · Nov 21, 2021

Panino Manino said:
RAJA IS BACK BABY!
But can he really do it now that he is in a place that have "unlimited" resources?

You wanted it yourself, so you have no option to object.

DrMrLordX · Nov 21, 2021

NTMBK said:
So Raja continues to be full of it?

Even if he is, he makes it entertaining, no?

moinmoin · Nov 22, 2021

NTMBK said:
So Raja continues to be full of it?

It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:

Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.
As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.
Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.
Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

lobz · Nov 22, 2021

moinmoin said:
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:

Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.

As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.

Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.

Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

Almost like a poor Volta interview.

NTMBK · Nov 22, 2021

moinmoin said:
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:

Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.

As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.

Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.

Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

16X for Architecture is the one that really feels like nonsense to me. The article specified that this was for FP64, not any reduced precision format- which makes sense, as that's the benchmark for general purpose HPC. So where on earth is a 16x efficiency increase for FP64 going to come from architecturally? Maybe some specialised instructions to make it more efficient to operate on sparse matrices...? But come on, that isn't going to give you a 16X improvement.

LightningZ71 · Nov 22, 2021

It depends a lot on where the current bottlenecks in performance are and how that us affected by current technology advances. If the bottlenecks exist mainly in data throughput, then the current progress in DRAM performance will likely lead to a doubling in throughput there. Stacking large caches can make for multiples of performance improvements in working sets that fit it. We expect that circuit density for computational units will double at least twice in that time frame, likely giving a 4x throughput increase there. Those things alone, when combined, can, in VERY specific circumstances, lead to a more than 16x total throughput increase.

What I don't see is system cooling allowing anything like the kind of improvement in performance without massively increasing the footprint of these systems.

lobz · Nov 22, 2021

NTMBK said:
16X for Architecture is the one that really feels like nonsense to me. The article specified that this was for FP64, not any reduced precision format- which makes sense, as that's the benchmark for general purpose HPC. So where on earth is a 16x efficiency increase for FP64 going to come from architecturally? Maybe some specialised instructions to make it more efficient to operate on sparse matrices...? But come on, that isn't going to give you a 16X improvement.

They could do an inverse HT moment (again) and get not only full speed DP, but design an architecture to bring in resources from other dimensions as well, making it double or quadruple speed DP.

moinmoin · Nov 22, 2021

NTMBK said:
16X for Architecture is the one that really feels like nonsense to me.

"but Raja told me that Intel knows the architectural changes to scale well beyond that."

🤣 🤷

I guess the question is rather: how? Exascale is mainly possible due to the high amount of GPU and CPU cores. I kinda doubt Raja talks about within the same power and temperature envelope, as that's covered by power/thermals... dunno, it's really a classic Raja. Better to not read too much into it and set a reminder for 2027.

andermans · Nov 22, 2021

moinmoin said:
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

I feel like a bunch of those are not multiplicative though. Like if you improve computation you also need to improve data movement, just to keep up.

moinmoin · Nov 22, 2021

andermans said:
I feel like a bunch of those are not multiplicative though. Like if you improve computation you also need to improve data movement, just to keep up.

The multiplication with data movement is not wrt the bandwidth (where you are right, but bandwidth is never multiplicative unless it's a bottleneck where it's still not multiplicative, just a hindrance...) but wrt to making it more efficient and using the saved power for computation instead.

Ajay · Nov 22, 2021

Markfw said:
I agree. 6-7 years down the road ? AMD may be so far ahead in servers that nobody will use Intel hardly.

Wishful thinking.

The biggest problem for AMD is getting enough wafers to actually compete with Intel in terms of volume. It's kind of the Opteron situation all over again.
In 6-7 years, Intel could be doing better if Gelsinger can actually change the culture at Intel, and if they actually get some profitable volume clients with IDM 2.0.
It's surely will be a tough road for them, no doubt. I just hope they don't auger in.

Joe NYC · Nov 22, 2021

moinmoin said:
Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.

If some of the rumors of increase in efficiency of RDNA3 transfer to CDNA3, there could be another 2x efficiency gain for Mi300 vs. Aurora.

From latest public info from the Supercomputer Conference, it seems that Aurora will be launching against El Capitain, in the same time frame. so AMD may be 4x the efficiency in El Capitain vs. Aurora.

moinmoin said:
Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.

I think Intel is quite aware of this and is working on it, as we can see from Ponte Vecchio. But based on the exotic cache configuration that Ponte Vecchio alread has, I am not surprised that it did not make greater efficiency gains.

moinmoin said:
Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

Intel is apparently using TSMC N7 and N5. I would assume that N5 is being use for the most dense compute part. So yeah, if Intel is already taking advantage of that (and AMD is not, for comparison) huge further gain on top of TSMC N5 will be quite challenging in such a short time frame.

News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Diamond Member

Lifer

Moderator Emeritus, Elite Member

Golden Member

Golden Member

Lifer

Diamond Member

Platinum Member

Lifer

Platinum Member

Platinum Member

Diamond Member

Member

Diamond Member

Lifer

Diamond Member