News "Aurora’s Troubles Move Frontier into Pole Exascale Position" - HPCwire

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
With Sapphire Rapids delayed Intel's Aurora exascale supercomputer misses yet another date. This means Frontier is now on route of becoming the first exascale supercomputer.

 

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
It seems that also in raw performance, since it takes 9,000 nodes for Mi200 and PVC to get to 1.5 exaflop, but 16,000 for A100 (according to the estimates in the article). That's almost 2:1 ratio.
The article also indirectly mentions that it's AMD's approach to MCM that made this possible: Technically it's not one CPU but 8 chiplets, and not four GPUs but 8 let's call them "GPUlets". Nvidia has no product (yet) that fits the same amount of performance in that space.
 

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.
 

NTMBK

Lifer
Nov 14, 2011
10,236
5,018
136
Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.

So Raja continues to be full of it?
 

Panino Manino

Senior member
Jan 28, 2017
821
1,022
136
Along with Intel promising an ambitious node cadence despite troubled and delayed nodes in the past decade, it is now despite delaying Aurora multiple times also pushing for Zettascale (so 1000x Exascale) by 2027-28. STH met with Raja to find out how:

I expect a big part of this being optimization for specific formats. E.g. coming from MI100 AMD increased Matrix BF16 peak throughput from 92,3 to 383 TFLOPS on MI200. Also mentioned is the "Packed FP32" software optimization to further double FP32 throughput on MI200.

RAJA IS BACK BABY!
But can he really do it now that he is in a place that have "unlimited" resources?
 
  • Like
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
So Raja continues to be full of it?
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:
  • Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.
  • As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.
  • Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.
  • Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:
  • Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.
  • As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.
  • Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.
  • Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.
Almost like a poor Volta interview.
 

NTMBK

Lifer
Nov 14, 2011
10,236
5,018
136
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

Looking at this in detail may help imagining future development in the server space:
  • Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.
  • As I wrote before in architecture there's a lot of wiggle room by supporting and accelerating specific formats. Nvidia currently excels at INT8, both AMD and Intel will want to catch up there. By supporting full speed double precision in MI200 AMD built a lot of resources that still need to be efficiently made use of with lower precision formats. Packed FP32 is a first step in that direction.
  • Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.
  • Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

16X for Architecture is the one that really feels like nonsense to me. The article specified that this was for FP64, not any reduced precision format- which makes sense, as that's the benchmark for general purpose HPC. So where on earth is a 16x efficiency increase for FP64 going to come from architecturally? Maybe some specialised instructions to make it more efficient to operate on sparse matrices...? But come on, that isn't going to give you a 16X improvement.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
It depends a lot on where the current bottlenecks in performance are and how that us affected by current technology advances. If the bottlenecks exist mainly in data throughput, then the current progress in DRAM performance will likely lead to a doubling in throughput there. Stacking large caches can make for multiples of performance improvements in working sets that fit it. We expect that circuit density for computational units will double at least twice in that time frame, likely giving a 4x throughput increase there. Those things alone, when combined, can, in VERY specific circumstances, lead to a more than 16x total throughput increase.

What I don't see is system cooling allowing anything like the kind of improvement in performance without massively increasing the footprint of these systems.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
16X for Architecture is the one that really feels like nonsense to me. The article specified that this was for FP64, not any reduced precision format- which makes sense, as that's the benchmark for general purpose HPC. So where on earth is a 16x efficiency increase for FP64 going to come from architecturally? Maybe some specialised instructions to make it more efficient to operate on sparse matrices...? But come on, that isn't going to give you a 16X improvement.
They could do an inverse HT moment (again) and get not only full speed DP, but design an architecture to bring in resources from other dimensions as well, making it double or quadruple speed DP.
 

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
16X for Architecture is the one that really feels like nonsense to me.
"but Raja told me that Intel knows the architectural changes to scale well beyond that."

🤣 🤷

I guess the question is rather: how? Exascale is mainly possible due to the high amount of GPU and CPU cores. I kinda doubt Raja talks about within the same power and temperature envelope, as that's covered by power/thermals... dunno, it's really a classic Raja. Better to not read too much into it and set a reminder for 2027.
 

andermans

Member
Sep 11, 2020
151
153
76
It's certainly perfectly in character for Raja, eh?

While 1000x seems outrageous the math behind it makes it a tad more feasible, though I doubt Intel manages it until 2027:
Current Aurora >= 2 EFLOPS x 16 architecture x 2 power/thermals x 3 data movement x 5 process nodes

I feel like a bunch of those are not multiplicative though. Like if you improve computation you also need to improve data movement, just to keep up.
 
  • Like
Reactions: Tlh97 and NTMBK

moinmoin

Diamond Member
Jun 1, 2017
4,948
7,656
136
I feel like a bunch of those are not multiplicative though. Like if you improve computation you also need to improve data movement, just to keep up.
The multiplication with data movement is not wrt the bandwidth (where you are right, but bandwidth is never multiplicative unless it's a bottleneck where it's still not multiplicative, just a hindrance...) but wrt to making it more efficient and using the saved power for computation instead.
 

Ajay

Lifer
Jan 8, 2001
15,448
7,858
136
I agree. 6-7 years down the road ? AMD may be so far ahead in servers that nobody will use Intel hardly.

Wishful thinking.
The biggest problem for AMD is getting enough wafers to actually compete with Intel in terms of volume. It's kind of the Opteron situation all over again.
In 6-7 years, Intel could be doing better if Gelsinger can actually change the culture at Intel, and if they actually get some profitable volume clients with IDM 2.0.
It's surely will be a tough road for them, no doubt. I just hope they don't auger in.
 

Joe NYC

Golden Member
Jun 26, 2021
1,938
2,280
106
  • Aurora is already said to be at half the efficiency of AMD's Frontier, so that's room Intel needs to catch up anyway.

If some of the rumors of increase in efficiency of RDNA3 transfer to CDNA3, there could be another 2x efficiency gain for Mi300 vs. Aurora.

From latest public info from the Supercomputer Conference, it seems that Aurora will be launching against El Capitain, in the same time frame. so AMD may be 4x the efficiency in El Capitain vs. Aurora.

  • Data movement is the old story of power consumption for uncore and I/O. Due to needing a lot of bandwidth GPU and compute units are essentially the worst case scenarios for I/O. I expect more and more focus on caching (see AMD's Infinite Cache) and different packaging techniques.
I think Intel is quite aware of this and is working on it, as we can see from Ponte Vecchio. But based on the exotic cache configuration that Ponte Vecchio alread has, I am not surprised that it did not make greater efficiency gains.


  • Power/thermals and process nodes seem to be one thing to me. With PVC Intel kinda cheated by using TSMC, it's interesting that they still see room for 6 times improvement in 6 years there. That's where I expect the deadline to slip, though Intel is behind AMD by efficiency so for AMD it's actually "only" 3 times, which should be more feasible.

Intel is apparently using TSMC N7 and N5. I would assume that N5 is being use for the most dense compute part. So yeah, if Intel is already taking advantage of that (and AMD is not, for comparison) huge further gain on top of TSMC N5 will be quite challenging in such a short time frame.
 
  • Like
Reactions: Tlh97