Question Here comes A64FX: Fugaku is now the fastest supercomputer in the world

DrMrLordX · Jun 23, 2020

Japan’s Fugaku gains title as world’s fastest supercomputer

www.riken.jp

Summit has been dethroned by A64FX. That's a heck of a linpack score, doncha think? And it did that without even using all of its nodes.

Hitman928 · Jun 23, 2020

Very impressive, though I am a bit surprised at the power use.

Frontier is supposed to come online next year I believe which will replace this at the top spot with 3X the flops and El Capitan a year after that at 4X the flops, at least on paper. It will be interesting to compare the power use and actual linpack score (if they submit it) when the new ones come online.

lyonwonder · Jun 23, 2020

And the naming makes me think of AMD's decade old Athlon 64 FX and not modern ARM CPU's

ThatBuzzkiller · Jun 23, 2020

Hitman928 said:
Frontier is supposed to come online next year I believe which will replace this at the top spot with 3X the flops and El Capitan a year after that at 4X the flops, at least on paper. It will be interesting to compare the power use and actual linpack score (if they submit it) when the new ones come online.

Frontier will also usher us into the new era of heterogeneous compute which will be the future of high performance computing.

Powerful CPU-GPU communication features as offered with ROCm/oneAPI will enable new compute offloading scenarios ...

IntelUser2000 · Jun 23, 2020

DrMrLordX said:
Japan’s Fugaku gains title as world’s fastest supercomputer

www.riken.jp

Summit has been dethroned by A64FX. That's a heck of a linpack score, doncha think? And it did that without even using all of its nodes.

What's cool is it does this without using a GPU.

mikegg · Jun 24, 2020

Hitman928 said:
Very impressive, though I am a bit surprised at the power use.

Frontier is supposed to come online next year I believe which will replace this at the top spot with 3X the flops and El Capitan a year after that at 4X the flops, at least on paper. It will be interesting to compare the power use and actual linpack score (if they submit it) when the new ones come online.

Frontier is using GPUs so it’s not a fair one to one comparison.

GPUs significantly add to the flops.

AnandThenMan · Jun 24, 2020

My dual Celeron system can outperform that.

1,000,000 years Celeron vs. 1 second Fugaku

Hitman928 · Jun 24, 2020

senttoschool said:
Frontier is using GPUs so it’s not a fair one to one comparison.

GPUs significantly add to the flops.

Yes it's not 1:1 but I don't see how it's not fair. I understand your point and agree to a point, but it's not like A64X is a vanilla ARM core, they added ML instructions and SVE 512 bit pathways to make it a competitive HPC device. It was a system design choice to not include GPUs. I'm not going to claim to be any kind of expert on supercomputers, but I'm sure that having a single programming model was a big motivation here, but this system will be compared to other recent and soon to come online supercomputers. AMD and Intel are trying to bring these heterogeneous programming models that @ThatBuzzkiller mentioned to bridge the GPU and CPU compute capabilities. Not just performance but also performance per total system power will be interesting to see because the A64X supercomputer didn't move the needle in this regard. It's actually a pretty exciting time in CPU/server tech right now with the most legit players we've seen in a long time. Should be a fun ride.

DrMrLordX · Jun 24, 2020

IntelUser2000 said:
What's cool is it does this without using a GPU.

Indeed. Fujitsu has touted A64FX as a serious competitor to GPU/compute cards. Intel mostly failed to compete with AVX512 against compute accelerators. It seems like Fujitsu has had more luck with SVE. Seems. There would have to be perf/watt and perf/area comparisons to really know what is the most-efficient way to deploy a supercomputer.

DisEnchantment · Jun 24, 2020

I hope Intel or AMD will not try to focus only on vector instructions on the main CPU core.
Fujitsu is taking this to the extreme because this chip is tailor made for this purpose, they want to have a lot of vector compute without going GPU and therefore improve programmability.
On the other hand, AMD and Intel make chips for general computing.

Intel has been trying to push a lot of vector instructions in CPU which may, imo, comes down to not having GPU accelerators and ecosystem unlike AMD which is not too keen to invest in things like AVX512 for example.
This brings no additional value for the average desktop for everyday compute in most cases.
Because of this focus, there may be not enough additional silicon die area available on chip and/or lack of focus in other things which could benefit desktop use a lot more.

I think the new die stacking and HSA architectures will alleviate this problem greatly on x86.
For AMD at least, it will allow them to focus the main core for general compute and the vector cores can be tacked on as an additional chiplets for specific SKUs.
The coherent fabric will ensure programmability is taken care and the best of both worlds are taken in a single package.

For Intel it will allow massive acceleration using the Altera FPGAs since CXL is also designed to be cache coherent.
The best part of Intel's solution is that they can even choose application specific LUTs which can be so specific it can accelerate these algorithms by factors of magnitude.

Anyway, I am really not in favour of these approaches of having too much Vector units in CPU. For a specific deployment, sure, but not as a general purpose CPU. Cache coherent HSA is way to go.
Right now Milan is not really that exciting for me unless it brings something else to the table.
There is improvement but it is not really exciting.

3D Stacked Chiplets, HSA, FPGAs etc would be a welcome change to enable new use cases. We are stuck with same programming paradigms, same frameworks, same use cases for quite a bit now.
The last few inflection points was like when we had the FPU and then GPU. It is long overdue for a new paradigm shift.

It is an interesting approach nonetheless whatever means they take take to achieve the ends.

These ARM, SPARC, RISC-V or whatever chips are interesting from technical standpoint. However they are not something that I(we) would get to buy and deploy and work with everyday. In this regard I wish AMD/Intel better bring something to the table that enables us to do more.

Richie Rich · Jun 24, 2020

It's pretty surprising that out-of-order CPU can outperform GPU what is specialized highly parallel SIMD machine based on VLIW or in-order cores. And CPU is much more flexible, it can handle general purpose code, especially OoO core, single programming model as @Hitman928 mentioned (also all data in one RAM instead splitting between GPU VRAM and RAM with all sync delays). This looks like breakthrough in supercomputing kind of thing. It's pretty exciting time about CPUs now (Apple moving to ARM, Nuvia super core under development, Tachyum VLIW CPU with Transmeta-like code morphing under development...).

SVE extension is more powerful than I'd expect. The good thing is that SVE2 vectors are coming next year for every ARM core as a part of ARMv9 ISA. Supercomputer in every pocket. It also uncover there is something wrong about AVX512. It failed also in Knight Landing GPU-like cards. ARM with SVE shows much higher performance. That could be big advantage in servers because next year the Matterhorn cores (ARMv9 and SVE2) are coming. This could be a true disaster for stagnating x86.

DisEnchantment · Jun 24, 2020

Richie Rich said:
It's pretty surprising that out-of-order CPU can outperform GPU what is specialized highly parallel SIMD machine based on VLIW or in-order cores. And CPU is much more flexible, it can handle general purpose code, especially OoO core, single programming model as @Hitman928 mentioned (also all data in one RAM instead splitting between GPU VRAM and RAM with all sync delays). This looks like breakthrough in supercomputing kind of thing. It's pretty exciting time about CPUs now (Apple moving to ARM, Nuvia super core under development, Tachyum VLIW CPU with Transmeta-like code morphing under development...).

SVE extension is more powerful than I'd expect. The good thing is that SVE2 vectors are coming next year for every ARM core as a part of ARMv9 ISA. Supercomputer in every pocket. It also uncover there is something wrong about AVX512. It failed also in Knight Landing GPU-like cards. ARM with SVE shows much higher performance. That could be big advantage in servers because next year the Matterhorn cores (ARMv9 and SVE2) are coming. This could be a true disaster for stagnating x86.

Does not matter. Everything will be beaten by Tachyum.

Every core is faster than a Xeon core or an Epyc core, and it is smaller than an Arm core, and overall, our chip is faster than a GPU on HPC

Richie Rich · Jun 24, 2020

DisEnchantment said:
Does not matter. Everything will be beaten by Tachyum.

Well, Tachyum Prodigy looked like BS when I saw it for the first time. However after some digging into it has some interesting features:

it's VLIW core and we know that Itanium and Transmeta VLIW cores worked pretty well performance and efficient wise
I was skeptical about GPU performance.... well until Fujitsu built A64FX CPU which outperforms GPUs in a massive way
Tachyum is using micro-instruction core only. If I understand correctly it's something like x86 CISC core without decoder and CISC garbage, stripping out only the RISC micro-instruction engine. So yes, it can be smaller than ARM because ARMv8 has decoder and 1000 instruction set (in compare to 1300 for x86 that's not a huge difference) and probably more power efficient as well.
But the key factor is how they will handle x86->VLIW or ARM->VLIW compiling. And the guy was working at NVidia GPU core so he has experience with VLIW GPU core compilation of code for shaders, probably much bigger experience than team aroud Transmeta at that time, and much fresh technology knowledge than Itanium team.

So, I'd not underestimate Tachyum. From every point of view it looks as reasonable and smart design. Sure it will have some major disadvantage coming from very slow VLIW re-compilation of code. However it's different approach than normal CPUs so it can be really good in some ways. Just look at Fujitsu A64FX - it's different approach and is surprisingly outperforming GPUs. Who would expect that year ago?

SarahKerrigan · Jun 24, 2020

DisEnchantment said:
Does not matter. Everything will be beaten by Tachyum.

I'll believe it when silicon is actually available and lives up to the claims. Until then, I'm skeptical. I've spent a lot of quality time with VLIW and VLIW-oid (IPF) systems, and "amazingly fast and efficient at general purpose code" was not particularly a highlight of the experience.

DisEnchantment · Jun 24, 2020

SarahKerrigan said:
I'll believe it when silicon is actually available and lives up to the claims. Until then, I'm skeptical. I've spent a lot of quality time with VLIW and VLIW-oid (IPF) systems, and "amazingly fast and efficient at general purpose code" was not particularly a highlight of the experience.

Do you think you will see silicon?

SarahKerrigan · Jun 24, 2020

DisEnchantment said:
Do you think you will see silicon?

It's possible. Tachyum has had a lot of funding coming their way. I'm a lot more skeptical that those chips will do particularly well at general-purpose code, though.

I'm willing to be convinced, but wide general-purpose in-order is not an approach that has been particularly fruitful in the past.

DrMrLordX · Jun 24, 2020

Richie Rich said:
It's pretty surprising that out-of-order CPU can outperform GPU what is specialized highly parallel SIMD machine based on VLIW or in-order cores.

Not necessarily. Again, we have no performance/watt or performance/area comparisons available. Theoretically, had Intel's process advantage survived, we would still have AVX512-based Phi products out there doing essentially the same thing, but in the end Phi was still never all THAT great compared GPGPU options. Or someone could put together a world-beater of a Cooper Lake system if they reeeeeeaaaallllly wanted to waste that many nodes and that much power just to beat something like Summit.

Doing it with Cooper Lake would be absurd, since it would require too much silicon area and too much power versus other options. A64FX appears to be less absurd.

DisEnchantment · Jun 24, 2020

SarahKerrigan said:
It's possible. Tachyum has had a lot of funding coming their way. I'm a lot more skeptical that those chips will do particularly well at general-purpose code, though.

For me I am just skeptical in general of such promises. I am skeptical of designs which are not in final shape or form which I cannot get to evaluate the advantages of developing on that platform.

I manage a number of compute clusters which support our ARM devices on the field which we sell in a B2B framework.
We process enormous amounts of data coming from the fleet which feeds into the verification and validation pipeline of the algorithms (in this day and age it has to be ML accelerated right) we develop that are deployed to said platforms.
(For obvious reasons I cannot show my LinkedIn Account or GitHub Activity)

From experience I stick(If I make a purchase decision or influence if I dont make the decision) with something known or with a vendor known to deliver.
We have had so many vendors delaying products, cancelling products, or product not reaching advertised performance. Time and time and time again.
We have aggressive contracts with most vendors and this has led us to burn bridges with several ARM SoC vendors too.
I have a personal contempt for a specific vendor for ruining the career of some of my colleagues because of promises like these. Big Promises, we evaluate a kit, they say final performance is gonna be huge. We develop our product roadmap on it, and three years later it turns out to be garbage and our Leadership takes the obvious step of culling those who make those platform decisions because of the money and effort and time lost to come up with a competitive product.

SarahKerrigan · Jun 24, 2020

DisEnchantment said:
For me I am just skeptical in general of such promises. I am skeptical of designs which are not in final shape or form which I cannot get to evaluate the advantages of developing on that platform.

I manage a number of compute clusters which support our ARM devices on the field which we sell in a B2B framework.
We process enormous amounts of data coming from the fleet which feeds into the verification and validation pipeline of the algorithms (in this day and age it has to be ML accelerated right) we develop that are deployed to said platforms.
(For obvious reasons I cannot show my LinkedIn Account or GitHub Activity)

From experience I stick(If I make a purchase decision or influence if I dont make the decision) with something known or with a vendor known to deliver.
We have had so many vendors delaying products, cancelling products, or product not reaching advertised performance. Time and time and time again.
We have aggressive contracts with most vendors and this has led us to burn bridges with several ARM SoC vendors too.
I have a personal contempt for a specific vendor for ruining the career of some of my colleagues because of promises like these. Big Promises, we evaluate a kit, they say final performance is gonna be huge. We develop our product roadmap on it, and three years later it turns out to be garbage and our Leadership takes the obvious step of culling those who make those platform decisions because of the money and effort and time lost to come up with a competitive product.

All of that is entirely fair.

Extraordinary claims require extraordinary proof. If Tachyum does in fact ship a world-beating CPU in the near future, we'll know. Until then, I have my doubts. Even releasing SPECint numbers with measured power consumption would be a good start to showing this is more than Itanium Redux.

EXCellR8 · Jun 24, 2020

let's all stop pretending like it's not gonna be used to render weird Japanese CG schoolgirl porn...

Asterox · Jun 24, 2020

lyonwonder said:
And the naming makes me think of AMD's decade old Athlon 64 FX and not modern ARM CPU's

Hm, ok maybe we can boil some weird conspiracy theory.

Richie Rich · Jun 24, 2020

Asterox said:
Hm, ok maybe we can boil some weird conspiracy theory.

That's the AMD's K12. AMD did cancel it in 2015 but sold the design to Fujitsu and they continued development

And Jim Keller left AMD due to canceled K12 and he didn't want to move to Japan either.

Hitman928 · Jun 24, 2020

Richie Rich said:
That's the AMD's K12. AMD did cancel it in 2015 but sold the design to Fujitsu and they continued development

And Jim Keller left AMD due to canceled K12 and he didn't want to move to Japan either.

Do you have a source for this info?

SarahKerrigan · Jun 24, 2020

Hitman928 said:
Do you have a source for this info?

I think he was kidding.

A64FX is a pretty direct variation of the same microarchitecture Fujitsu has been iterating on since SPARC64 V, across three different instruction sets. (GS, SPARC, ARM)

Richie Rich · Jun 24, 2020

Keller left AMD in 1999 when AMD canceled his new big K8a core based on Alpha EV8 (EV8 was super wide core with SMT4 and Keller was ex-Alpha engineer). But unfortunately AMD decided to just to evolve K7 core and implement memory controler into CPU like Alpha EV7 did.
Then Keller was at beginning of PA semi, then bought by Apple, layed down first independent Apple uarch A6 (2xALU, OoO, 32-bit ARM) and A7 Cyclone (first 64-bit ARM core ever, 4xALU OoO pretty powerfull core, similar to Intel Haswell released in the same 2013, so yes, Apple had very competitive state of the art core like Intel since 2013) and Keller left Apple in 2012, one year before A7 release (but surely taped out) and left A8, A9 and A10 in development.
He probably set goals for 6xALU monster A11 Monsoon family, including AMX and SVE.
Then he decided to help AMD to return to the top and build super wide core with modern SIMD and SMT4 like EV8. Obviously he decided to create hybrid of Apple's A7 and Bulldozer first (Zen1) and then for next uarch to choose ARM ISA, super wide 6xALU+SVE/AMX core like in Apple.... and finally to implement the main feature of revolutionary EV8, the SMT4. But AMD staff was scared by parameters he has chosen for K12, they thought he is risking already a lot by deciding that Zen1 to be 4xALU (remember in 2012 there was no 4xALU on market, Haswell and Apple A7 came 2013).
K12 spec was sci-fi like original K8a before cancelation. So later on K12 was cut down to 4xALU and SVE and later on sold to Fujitsu.
Zen3 is a x86 version of K12 lite, so probably still 4xALU but powerfull FPUs similar to A64FX (no surprise, it has same roots, also some Zen3 leak mentioned 50% IPC jump in FPU load, confirming that). Since Fujitsu A64FX doesn't have SMT4, it looks like SMT4 was cut down from Zen3 as well.
In Intel Keller is responsible for chiplet design of Alder Lake (8 big Golden Cove cores and 8-small Gracemont cores active out of 16-core Gracemont chiplet, fully 16-core chiplets will be used in Snow Ridge server CPU platform).
So yes, Apples 6xALU core family, SVE, AMX, Fujitsu A64FX and Zen3/Zen4, chiplet Alder Lake are all Jim Keller's babys maybe.

Edit: Yeah, K12 sold to Fujitsu, SVE and AMX was a joke

IMHO I think every world needs its own hero. For silicon world we have Jim Keller, for the rest there is Chuck Norris. The funny thing is that Jim Keller is real person and he really influenced industry. And he would influence even more if those morons wouldn't push him out of AMD twice.

Question Here comes A64FX: Fugaku is now the fastest supercomputer in the world

Lifer

Diamond Member

Member

Golden Member

Elite Member

Golden Member

Diamond Member

Diamond Member

Lifer

Golden Member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Lifer

Golden Member

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Senior member