Question Here comes A64FX: Fugaku is now the fastest supercomputer in the world

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hitman928

Diamond Member
Apr 15, 2012
5,237
7,785
136
I think he was kidding.

A64FX is a pretty direct variation of the same microarchitecture Fujitsu has been iterating on since SPARC64 V, across three different instruction sets. (GS, SPARC, ARM)

You would think he was kidding, but. . .

Keller left AMD in 1999 when AMD canceled his new big K8a core based on Alpha EV8 (EV8 was super wide core with SMT4 and Keller was ex-Alpha engineer). But unfortunately AMD decided to just to evolve K7 core and implement memory controler into CPU like Alpha EV7 did.Then Keller was at beginning of PA semi, then bought by Apple, layed down first independent Apple uarch A6 (2xALU, OoO, 32-bit ARM) and A7 Cyclone (first 64-bit ARM core ever, 4xALU OoO pretty powerfull core, similar to Intel Haswell released in the same 2013, so yes, Apple had very competitive state of the art core like Intel since 2013) and Keller left Apple in 2012, one year before A7 release (but surely taped out) and left A8, A9 and A10 in development. He probably set goals for 6xALU monster A11 Monsoon family, including AMX and SVE. Then he decided to help AMD to return to the top and build super wide core with modern SIMD and SMT4 like EV8. Obviously he decided to create hybrid of A7 and Bulldozer first (Zen1) and then for next uarch to choose ARM ISA, super wide 6xALU+SVE/AMX core like in Apple.... and finally to implement the main feature of revolutionary EV8, the SMT4. But AMD staff was scared by parameters he has chosen for K12, they thought he is risking already a lot by deciding that Zen1 to be 4xALU (remember in 2012 there was no 4xALU on market, Haswell and Apple A7 came 2013). K12 spec was sci-fi like original K8a before cancelation. So later on K12 was cut down to 4xALU and SVE and later on sold to Fujitsu. Zen3 is a x86 version of K12 lite, so probably still 4xALU but powerfull FPUs similar to A64FX (no surprise, it has same roots, also some Zen3 leak mentioned 50% IPC jump in FPU load, confirming that). Since Fujitsu A64FX doesn't have SMT4, it looks like SMT4 was cut down from Zen3 as well. In Intel Keller is responsible for chiplet design of Alder Lake (8 big Golden Cove cores and 8-small Gracemont cores active out of 16-core Gracemont chiplet, fully 16-core capable chiplets will be used in Snow Ridge server CPU platform). So yes, Apples 6xALU core family, SVE, AMX, Fujitsu A64FX and Zen3/Zen4, chiplet Alder Lake are all Jim Keller's babys maybe.

Still no sources.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,673
3,793
136
Keller left AMD in 1999 when AMD canceled his new big K8a core based on Alpha EV8 (EV8 was super wide core with SMT4 and Keller was ex-Alpha engineer). But unfortunately AMD decided to just to evolve K7 core and implement memory controler into CPU like Alpha EV7 did.Then Keller was at beginning of PA semi, then bought by Apple, layed down first independent Apple uarch A6 (2xALU, OoO, 32-bit ARM) and A7 Cyclone (first 64-bit ARM core ever, 4xALU OoO pretty powerfull core, similar to Intel Haswell released in the same 2013, so yes, Apple had very competitive state of the art core like Intel since 2013) and Keller left Apple in 2012, one year before A7 release (but surely taped out) and left A8, A9 and A10 in development. He probably set goals for 6xALU monster A11 Monsoon family, including AMX and SVE. Then he decided to help AMD to return to the top and build super wide core with modern SIMD and SMT4 like EV8. Obviously he decided to create hybrid of A7 and Bulldozer first (Zen1) and then for next uarch to choose ARM ISA, super wide 6xALU+SVE/AMX core like in Apple.... and finally to implement the main feature of revolutionary EV8, the SMT4. But AMD staff was scared by parameters he has chosen for K12, they thought he is risking already a lot by deciding that Zen1 to be 4xALU (remember in 2012 there was no 4xALU on market, Haswell and Apple A7 came 2013). K12 spec was sci-fi like original K8a before cancelation. So later on K12 was cut down to 4xALU and SVE and later on sold to Fujitsu. Zen3 is a x86 version of K12 lite, so probably still 4xALU but powerfull FPUs similar to A64FX (no surprise, it has same roots, also some Zen3 leak mentioned 50% IPC jump in FPU load, confirming that). Since Fujitsu A64FX doesn't have SMT4, it looks like SMT4 was cut down from Zen3 as well. In Intel Keller is responsible for chiplet design of Alder Lake (8 big Golden Cove cores and 8-small Gracemont cores active out of 16-core Gracemont chiplet, fully 16-core capable chiplets will be used in Snow Ridge server CPU platform). So yes, Apples 6xALU core family, SVE, AMX, Fujitsu A64FX and Zen3/Zen4, chiplet Alder Lake are all Jim Keller's babys maybe.

Paragraphs. Use them. Otherwise I will not read this.
 

SarahKerrigan

Senior member
Oct 12, 2014
353
506
136
Keller left AMD in 1999 when AMD canceled his new big K8a core based on Alpha EV8 (EV8 was super wide core with SMT4 and Keller was ex-Alpha engineer). But unfortunately AMD decided to just to evolve K7 core and implement memory controler into CPU like Alpha EV7 did.Then Keller was at beginning of PA semi, then bought by Apple, layed down first independent Apple uarch A6 (2xALU, OoO, 32-bit ARM) and A7 Cyclone (first 64-bit ARM core ever, 4xALU OoO pretty powerfull core, similar to Intel Haswell released in the same 2013, so yes, Apple had very competitive state of the art core like Intel since 2013) and Keller left Apple in 2012, one year before A7 release (but surely taped out) and left A8, A9 and A10 in development. He probably set goals for 6xALU monster A11 Monsoon family, including AMX and SVE. Then he decided to help AMD to return to the top and build super wide core with modern SIMD and SMT4 like EV8. Obviously he decided to create hybrid of A7 and Bulldozer first (Zen1) and then for next uarch to choose ARM ISA, super wide 6xALU+SVE/AMX core like in Apple.... and finally to implement the main feature of revolutionary EV8, the SMT4. But AMD staff was scared by parameters he has chosen for K12, they thought he is risking already a lot by deciding that Zen1 to be 4xALU (remember in 2012 there was no 4xALU on market, Haswell and Apple A7 came 2013). K12 spec was sci-fi like original K8a before cancelation. So later on K12 was cut down to 4xALU and SVE and later on sold to Fujitsu. Zen3 is a x86 version of K12 lite, so probably still 4xALU but powerfull FPUs similar to A64FX (no surprise, it has same roots, also some Zen3 leak mentioned 50% IPC jump in FPU load, confirming that). Since Fujitsu A64FX doesn't have SMT4, it looks like SMT4 was cut down from Zen3 as well. In Intel Keller is responsible for chiplet design of Alder Lake (8 big Golden Cove cores and 8-small Gracemont cores active out of 16-core Gracemont chiplet, fully 16-core capable chiplets will be used in Snow Ridge server CPU platform). So yes, Apples 6xALU core family, SVE, AMX, Fujitsu A64FX and Zen3/Zen4, chiplet Alder Lake are all Jim Keller's babys maybe.

No. A64FX is a very clear evolution of what Fujitsu was already building. It looks almost exactly like XIfx at a microarchitectural level, just enhanced. Fujitsu has a very clear uarch family starting from SPARC64 V, and they have iterated on it for specific products in the SPARC64, SPARC64fx (HPC chips prior to A64FX), GS, and now A64FX family.

A64FX also isn't particularly oriented toward general purpose loads. It is nothing like K12, lol.

a64fx_pipeline.png
xifx_pipeline.png
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Theoretically, had Intel's process advantage survived, we would still have AVX512-based Phi products out there doing essentially the same thing, but in the end Phi was still never all THAT great compared GPGPU options.

The problem with Phi was that it was too narrow, so it was bottlenecking it in real world HPC applications, and even in Linpack.

Had Intel's process been more in line, a similar Xeon line may have been possible. Fujitsu's A64FX gets its Flops with only 48 cores and 2.2GHz.

It's likely a better system than Summit because its a CPU but the perf/watt doesn't really improve and Summit is nearly 2 years old.
 

SarahKerrigan

Senior member
Oct 12, 2014
353
506
136
The problem with Phi was that it was too narrow, so it was bottlenecking it in real world HPC applications, and even in Linpack.

Had Intel's process been more in line, a similar Xeon line may have been possible. Fujitsu's A64FX gets its Flops with only 48 cores and 2.2GHz.

It's likely a better system than Summit but the perf/watt doesn't really improve and Summit is nearly 2 years old.

Perf/W in Linpack didn't improve. In HPCG performance went up by several times, so if it holds to ~28MW for Fugaku and ~10MW for Summit, that's still a win. I suspect a lot of it will come down to application performance, and that's something Linpack doesn't tell us. (I expect it to be generally both significantly faster and at least somewhat more efficient than Summit, but there are realistically going to be apps that are friendly enough to GPUs that the efficiency win doesn't materialize.)
 
  • Like
Reactions: Tlh97 and Schmide

DrMrLordX

Lifer
Apr 27, 2000
21,616
10,823
136
So yes, Apples 6xALU core family, SVE, AMX, Fujitsu A64FX and Zen3/Zen4, chiplet Alder Lake are all Jim Keller's babys maybe.

Keller had nothing to do with Golden Cove. It was too far into development by the time he joined for him to be directly responsible for it. It's successor? Sure.

The problem with Phi was that it was too narrow, so it was bottlenecking it in real world HPC applications, and even in Linpack.

I guess? Nearly every x86 design ever has been too narrow compared to GPGPU compute accelerators.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
No. A64FX is a very clear evolution of what Fujitsu was already building. It looks almost exactly like XIfx at a microarchitectural level, just enhanced. Fujitsu has a very clear uarch family starting from SPARC64 V, and they have iterated on it for specific products in the SPARC64, SPARC64fx (HPC chips prior to A64FX), GS, and now A64FX family.

A64FX also isn't particularly oriented toward general purpose loads. It is nothing like K12, lol.

View attachment 24046
View attachment 24047
Yeah, I know, That was joke about K12 sold to Fujitsu ;)
I also joked that Keller is responsible for 6xALU A11 Monsoon family, SVE and AMX instructions extension too.

Back to serious note: If AMD wouldn't cancel their K12 they could compete against Fujitsu A64FX today. And maybe the fastest SC would be AMD's one. I guess AMD would get to SVE/SVE2 instruction set in early stage. This means that K12 would have SVE too.

Time line was:
  • Fujistu A64FX manufactured 2020
  • A64FX start of development 2016
  • SVE specifications 2014-2016
  • AMD K12 was canceled 2015

Maybe Keller wanted to adopt SVE for K12 and rework FPUs?
Maybe Keller wanted to adopt SVE also for Zen3 (x86 sister core of K12) and AMD management didn't have enough courage to step out of Intel's AVX shadow (like they did with AMD64 extension)?

Either way, cancelation of K12 was horrible horrible mistake that AMD did. They did to Keller for second time (first was canceled his much ambitious K8a in 1999) and it turned that Keller was right in both cases.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I guess? Nearly every x86 design ever has been too narrow compared to GPGPU compute accelerators.

No, Phi targetted applications that were for server CPUs but with more vectors and more threads. You are talking about something different, which is vector width.

In those very applications it was narrow. The 2-issue unit limited performance. So an ideal Phi CPU will in every generation improve both scalar and vector performance significantly.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
The article in OP had several mentions of Fugaku helping with "Society 5.0" which sounded odd to me, so this is what I got:


Very Japanese approach to life if that's your thing.
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
Very impressive, though I am a bit surprised at the power use.
I just read the article, then read it again looking specifically for mention of power usage, and I saw nothing at all concerning power usage. Where did you see mention of power usage, in a separate article? If so, mind linking it for us?
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
My dual Celeron system can outperform that.

1,000,000 years Celeron vs. 1 second Fugaku
I assume you mean the 300A Celerons? Running at 450 Mhz, of course. Hmm, that means the in-order 1.6 Ghz Intel Atom CPU in an MSI Wind could complete the same task that the dual 450 Mhz 300A CPUs could, although it would take more like 10,000,000 years, instead of the speedy 1,000,000 years that the Celerons would take.;)
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
It's listed in the top 500 list.

Thanks, I had clicked on that link, but only read the article. I had not noticed there was a list below the article, and wow, 28,300 KW is crazy, especially to not be running any GPUs, even moreso since they have not finished adding all of the nodes they're planning on having. They're going to be right at 30 MW, with all of the nodes. Still, once you realize it is up to 2.8X as fast as the 2nd-place system, the power numbers become much more in line with expectations.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Thanks, I had clicked on that link, but only read the article. I had not noticed there was a list below the article, and wow, 28,300 KW is crazy, especially to not be running any GPUs, even moreso since they have not finished adding all of the nodes they're planning on having. They're going to be right at 30 MW, with all of the nodes. Still, once you realize it is up to 2.8X as fast as the 2nd-place system, the power numbers become much more in line with expectations.
I'd like to see how Fukagu efficiency stands against CPU only Xeon/Epyc systems:
  • Fugaku (ARM A64Fx SVE) ................. Rmax 414,530 TFlops / 28,335 kW .............. efficiency 14.63 Tflops/kW
  • Sumit (Power9+Volta GV100) ........... Rmax 148,600 TFlops / 10,096 kW .............. efficiency 14.72 Tflops/kW
  • Selene (Epyc 7742, Ampere A100) ... Rmax 27,580 TFlops / 1,344 kW ................... efficiency 20.52 Tflops/kW

That's damn good efficiency for CPU only Fugaku. That's huge competition for GPUs in terms of price. Nvidia is way overpriced IMHO. In 2021 coming new ARM core line-up Matterhorn with SVE2 vectors SIMD so maybe we will see some Matterhorn based supercomputers too. Who knows maybe ARM is preparing not only A58, A79 and X2 cores. Maybe ARM will release F2 core with wider FPUs specificaly for supercomputers (based on X2).

  • Cortex X1 .................. 4x128-bit NEON
  • Cortex X2 could have 4x256-bit SVE2
  • Cortex F2 could have 4x1024-bit SVE2 ( 4x times faster than Fugaku A64FX 2x512-bit SVE)

That would massacre Nvidia based supercomputers.