Cascade Lake beats Rome in the race for 2019 TACC Supercomputer

itsmydamnation · Aug 31, 2018

tamz_msc said:
AVX is niche? There is no need for AVX when GPUs exist? I don't mean to sound rude but some of you guys are woefully unaware of just how many applications use AVX these days. Frostbite, UE4, idTech use it. Driving Sims like Project Cars 2 use it, Path of Exile uses it for particle effects. The Ashes of The Singularity engine, whose Devs are AMD partners, uses it. These are but a few examples in games alone. Emulators use it. One of them even uses TSX. Then there is video encoding, openssl, distributed computing, plugins for video editing, etc. A lot many things use AVX that some of you might be unaware of.

The Stilt is right, Zen2 without AVX2 at least would be extremely disappointing.

Would help you you were correct before educating everyone. Zen has AVX2 , what it doesn't have is 256bit wide vectors. What people call a niche is the middle of nowhere 512bit vector. GPUs are significantly wider and General compute significantly narrower , 512 is just kind of stuck in the middle. What workload is great of 512bit but not 4096? What workload targeting games gets big advantages using 256bit vectors over 128?

beginner99 · Aug 31, 2018

The Stilt said:
I think at this point we already have enough of cores.
So for the consumer market having even more is not much of an advantage.
>= 8 cores with as high IPC, high frequency and wide on demand (>= 256-bit) as possible is pretty much the optimum.

Pretty much what Intel is doing with i9-9900K.
AMD needs to respond to that, and preferably exceed it in every way.

But how many apps actually use AVX512? I think x265 is one of the few a consumer would use and the speed up there isn't that big.

Server market for "simple" servers that host a ton of VMs which run internal business apps is large enough for AMD to make profit with a "narrow core".

tamz_msc said:
A lot many things use AVX that some of you might be unaware of.

The Stilt is right, Zen2 without AVX2 at least would be extremely disappointing.

Using it and having a large speed increase are 2 different things. Are there any benches of a game with/wo AVX enabled?

But yeah for a scientific supercomputer it makes sense they choose intel with AVX-512.

slashy16 · Aug 31, 2018

Not sure why people think AVX512 is not important. Intel wouldn't be placing it in their consumer chips if application developers weren't asking for it. They don't given away silicon for no reason.

KompuKare · Aug 31, 2018

slashy16 said:
Not sure why people think AVX512 is not important. Intel wouldn't be placing it in their consumer chips if application developers weren't asking for it. They don't given away silicon for no reason.

Well, maybe.
But where Intel think they are behind like Atom in mobile, or x86 in GPU-accelerated HPC, Intel have very often shown they are willing to pour billions in to try to gain a foothold.
AVX-512 fits that pattern.
Obviously that doesn't mean that they don't have HPC customers who asked for and will use that feature but that's not necessarily the whole picture. And I'm sure for HPC servers Intel consider AVX-512 to be one of strategic value and would indeed be stopping to 'give' some silicon away for that.

tamz_msc · Aug 31, 2018

itsmydamnation said:
Would help you you were correct before educating everyone. Zen has AVX2 , what it doesn't have is 256bit wide vectors. What people call a niche is the middle of nowhere 512bit vector. GPUs are significantly wider and General compute significantly narrower , 512 is just kind of stuck in the middle. What workload is great of 512bit but not 4096? What workload targeting games gets big advantages using 256bit vectors over 128?

AVX2 in Zen needs reworking the code to issue instructions so that you do not send 2 FMAs at the same time as the split 2x128-bit micro-op can only handle one at a time, unlike Intel which due to having 2x256 bit FMA units can execute simultaneously. Now since Intel has the majority market-share almost nobody implements AVX2 attuned to the specifics of the architecture of Zen. As a result, the speedup going from AVX->AVX2 in Zen is much lower than what you get on Haswell+. As for GPUs, well there are so many available codes for CPUs that it is fare easier to apply AVX2/512 updates to legacy code than to rewrite them completely for GPUs, and then there are many applications like CFD where it is quite easy to add more cores to achieve the same level of parallelization offered by GPUs.

beginner99 said:
Using it and having a large speed increase are 2 different things. Are there any benches of a game with/wo AVX enabled?

Path of Exile got more than double FPS after including AVX optimizations for particle physics.From what I have read it is also important for VR performance.

Nothingness · Aug 31, 2018

slashy16 said:
Not sure why people think AVX512 is not important. Intel wouldn't be placing it in their consumer chips if application developers weren't asking for it. They don't given away silicon for no reason.

Intel are completely schizophrenic as far as vector extensions go. Just look how many current chips have AVX disabled. Even some Coffee Lake chips lack AVX. Now ask yourself why as a software developer you'd want to add code for AVX (or AVX2 or AVX-512) when the brain dead marketers of Intel decided that it would be good to castrate a lot of their chips.

itsmydamnation · Aug 31, 2018

tamz_msc said:
AVX2 in Zen needs reworking the code to issue instructions so that you do not send 2 FMAs at the same time as the split 2x128-bit micro-op can only handle one at a time, unlike Intel which due to having 2x256 bit FMA units can execute simultaneously.

You better tell agner that he is wrong, or it could be exactly like i said ,

https://www.agner.org/optimize/microarchitecture.pdf

256-bit instructions are split into two µops each so that the throughput for 256-bit vectors is the half of the throughput for 128-bit vectors. Most instructions have two or more pipes to choose between, as table 20.1 shows, so that most 256-bit instructions can execute with a throughput of at least one instruction per clock cycle.

Why do you need to rewrite anything? if you have 512bits of data width to operate on it doesn't matter how Zen executes it ( uop 1 from op 1 and 2 then uop 2 from op1 and 2 or uop 1 and 2 from op 1 then uop 1 and 2 from op2) its going to take the same amount of cycles. next point is FMA is generally speaking only a very small percentage of operations that will be executed.

Now since Intel has the majority market-share almost nobody implements AVX2 attuned to the specifics of the architecture of Zen. As a result, the speedup going from AVX->AVX2 in Zen is much lower than what you get on Haswell+

Sorry you just dont understand what you are talking about, Sandy bridge has 256bit avx units yet its performance is worse then Zens. What makes Haswell+ better at 256bit SIMD thoughput then Zen is that it has 256bit load/store pipeline along with the 256bit AVX units while Zen is 128bit end to end.

Also AVX1/2 can be both 128bit or 256bit. 128bit AVX2 code (if zen2 is still 128bit you will see alot of game code like this from consoles) will be better on Zen the Haswell+ because over all it has more 128bit execution resources wile both have the same amount of load store for 128bit ops.

. As for GPUs, well there are so many available codes for CPUs that it is fare easier to apply AVX2/512 updates to legacy code than to rewrite them completely for GPUs, and then there are many applications like CFD where it is quite easy to add more cores to achieve the same level of parallelization offered by GPUs.

By far easier you mean unless intel has hand written the lib with the functions you need its really really hard? Its just painful to admit but regardless of the underlying hardware CUDA is basically the best language for writing parallel code (NV call themselves SIMT). If AVX512 was so great knights whatever they are upto wouldn't have been discarded and the family tree killed off like it has.

JoeRambo · Aug 31, 2018

itsmydamnation said:
Sorry you just dont understand what you are talking about, Sandy bridge has 256bit avx units yet its performance is worse then Zens. What makes Haswell+ better at 256bit SIMD thoughput then Zen is that it has 256bit load/store pipeline along with the 256bit AVX units while Zen is 128bit end to end.

Only partially right. SB was first implementation of AVX, and it had 2x16B load and 1x16B store, obviously not enough to feed two 256bit wide AVX units. And it is CPU from 2011, no wonder AMD can beat it with 2017 design that has 2+2 128bit wide ADD/MUL.
HSW+ simply has too much resources for ZEN to touch, 2x32B load and 1x32B store, and two pipes of 256bit wide FMA units.

But all this is ignoring the real impact of new new instructions. People focus on bit width too much, forgetting the new stuff instructions bring. For example on 32bit code, using VEX encoded instructions for 128bit vectors, there was huge win in compiler generated code, since spills to memory were reduced big time and even brain dead compilers could generate better code.

AVX512 brings more of this greatness, making completely new classes of code vectorizable due to variuos goods they have in. Of course this is currently hurt by lowered clocks and no presence in consumer grade hw, but once/if 10nm is fixed, AVX512 will come.

jpiniero · Aug 31, 2018

Nothingness said:
Intel are completely schizophrenic as far as vector extensions go. Just look how many current chips have AVX disabled. Even some Coffee Lake chips lack AVX. Now ask yourself why as a software developer you'd want to add code for AVX (or AVX2 or AVX-512) when the brain dead marketers of Intel decided that it would be good to castrate a lot of their chips.

"A lot"? It's only Celeron and Pentium.

Nothingness · Aug 31, 2018

jpiniero said:
"A lot"? It's only Celeron and Pentium.

Do you know how much of the sold CPU that represent? Certainly enough that if your app only supports AVX2 you'll be in trouble. So you have no other choice than to develop two code paths, or just not support AVX2 and stick to SSEn.

That's completely stupid given that those dies have AVX, it's just fused off to create different SKUs. You can turn that how you want, it's not the best way to get rapid adoption 🙂

jpiniero · Aug 31, 2018

Nothingness said:
Do you know how much of the sold CPU that represent?

Core Celeron and Pentium really isn't that much volume. Atom (and the Atom based Celerons and Pentiums) on the other hand is quite a bit.

DrMrLordX · Aug 31, 2018

I'm not sure that it's a big deal that Rome wasn't used in TACC. So what? It's not intrinsically an HPC-oriented chip anyway. Anyone looking to port over an AVX-heavy codebase is still going to want to use an Intel product, and anyone not doing so might do better to use Power9 or Power10. Honestly I think Rome would do fine in a Summit-style CPU+GPGPU setup, too, so long as the HPC software involved did not sit in the sweet spot where having AVX2 or AVX512 is actually of benefit (while also hosting GPGPU devices).

Right now AMD needs to focus on getting their product into as many data centers as possible. Getting the TACC win would be nice advertising, but in terms of total sales, it's not a big deal.

Not sure if/when AMD will "get serious" about 256-bit SIMD. We still don't know exactly what their plans are for Fusion/HSA, other than that they seem to be leaning heavily on OpenCL 2.x right now.

Nothingness · Aug 31, 2018

jpiniero said:
Core Celeron and Pentium really isn't that much volume. Atom (and the Atom based Celerons and Pentiums) on the other hand is quite a bit.

For Atom, AVX was never part of the design due to Intel thinking it was pointless for mobile platforms. That's different from Core-derived Pentium and Celeron that have the AVX units on silicon.

But yeah Atom is enough to impede AVX adoption.

tamz_msc · Aug 31, 2018

itsmydamnation said:
You better tell agner that he is wrong, or it could be exactly like i said ,

https://www.agner.org/optimize/microarchitecture.pdf

Agner also says

The Ryzen supports the AVX2 instruction set. 256-bit AVX and AVX2 instructions are split into two µops that do 128 bits each. The throughput for most 256-bit instructions is one instruction per clock cycle because there are two multiplication units and two addition units. The 256-bit instructions are decoded at a rate of four instructions per clock cycle. Therefore, it is more efficient to use 256-bit instructions than 128-bit instructions when instruction fetch and decoding is a bottleneck.
The maximum throughput for floating point calculations in a single thread is one 256-bit vector multiplication or FMA instruction and one 256-bit vector addition per clock cycle.

You cannot execute two 256b FMAs simultaneously.

Why do you need to rewrite anything? if you have 512bits of data width to operate on it doesn't matter how Zen executes it ( uop 1 from op 1 and 2 then uop 2 from op1 and 2 or uop 1 and 2 from op 1 then uop 1 and 2 from op2) its going to take the same amount of cycles. next point is FMA is generally speaking only a very small percentage of operations that will be executed.

I was wrong to say rewriting but it will need compiler intrinsics to account for the way Zen does AVX since cannot do 2FMA/cycle in a straightforward manner. FMA is the basis of the dot product, so any scientific code with lots of vector and matrix operations would fare quite poorly on Zen. Stilt has shown this to be true in the case of Linpack.

Sorry you just dont understand what you are talking about, Sandy bridge has 256bit avx units yet its performance is worse then Zens. What makes Haswell+ better at 256bit SIMD thoughput then Zen is that it has 256bit load/store pipeline along with the 256bit AVX units while Zen is 128bit end to end.

Also AVX1/2 can be both 128bit or 256bit. 128bit AVX2 code (if zen2 is still 128bit you will see alot of game code like this from consoles) will be better on Zen the Haswell+ because over all it has more 128bit execution resources wile both have the same amount of load store for 128bit ops.

Yeah but sandy cannot do 2FMAs/cycle right? In terms of throughput it is more like 1FMA/2cycle for SB, 1FMA/cycle for Zen and 2FMA/cycle for Haswell+.

By far easier you mean unless intel has hand written the lib with the functions you need its really really hard? Its just painful to admit but regardless of the underlying hardware CUDA is basically the best language for writing parallel code (NV call themselves SIMT). If AVX512 was so great knights whatever they are upto wouldn't have been discarded and the family tree killed off like it has.

By far easier I mean reading some books on optimization and adding more cores since they're already familiar with openmp and mpi. Knights was cancelled because AVX512 is becoming part of the regular CPUs, which is why even a half-broken Cannon Lake U has it.

itsmydamnation · Aug 31, 2018

tamz_msc said:
Agner also says

You cannot execute two 256b FMAs simultaneously.

It will need rewriting because you cannot do 2FMA/cycle. FMA is the basis of the dot product, so any scientific code with lots of vector and matrix operations would fare quite poorly on Zen. Stilt has shown this to be true in the case of Linpack.

Yeah but sandy cannot do 2FMAs/cycle right? In terms of throughput it is more like 1FMA/2cycle for SB, 1FMA/cycle for Zen and 2FMA/cycle for Haswell+.

Man...... serious.......

1. why are you so hung up on FMA, its not any different to any other SIMD op except for on Zen were it will borrow a read port from the standalone ADD pipe to the FPRF.

2. You are missing the point ,you dont rewrite anything for Zen, Zen simply has 1/2 the throughput for 256bit ops. There is no rewriting, if you took the time to actual understand what i was saying that would be obvious. Zen can start the execution of 2 256bit FMA in a cycle if it for some reason that means something for some reason but it wont be able to execute any on the next cycle because it will be doing the 2nd uops for both of them.

dark zero · Sep 1, 2018

But... Cascade Lake is Meltdown free?

beginner99 · Sep 1, 2018

tamz_msc said:
Path of Exile got more than double FPS after including AVX optimizations for particle physics.From what I have read it is also important for VR performance.

Your own link states something different. it went from 70fps to 103 fps. So 33 fps increase. That is a 47% increase using AVX.
(all based on their multi-threaded engine numbers)

Then later down in the text there is following info:

The actual particle subsystem by itself is roughly 4X faster using these instructions. For CPUs without AVX support, we also have an SSE2 implementation which is roughly 2X faster than before, which will still have a fairly significant end result on your frame rate.

Meaning an old CPU with SSE2 will still get half of that increase, so 23.5%. 70fps * 1.235 = 86.45 fps or 16.55 less than with AVX.

86.45 / 16.55 = 19%. So the actually benefit of AVX vs legacy CPU (SSE2 is from 2000, any CPU capable playing this game has it) is roughly 19% and not 200% as you claim. Yes it still matters in this day and age with lack of CPU progress. But saying it double FPS is plain wrong for this game.

Abwx · Sep 1, 2018

JoeRambo said:
HSW+ simply has too much resources for ZEN to touch, .

HSW has less ressources when looking at the uarch, exe units are clustered, meaning that integer and FP will use the same exe clusters such that ressoures are not available simultaneously contrary to Zen where the FP part is litterally a co processor.

IF HSW had more ressources it would be better in benchmarks gvien that everything is optimsed for Intel, just look at Harfdware.fr wich said that they optimised their benchs for the latest intel uarch, yet HSW trail Zen.

tamz_msc · Sep 1, 2018

itsmydamnation said:
Man...... serious.......

1. why are you so hung up on FMA, its not any different to any other SIMD op except for on Zen were it will borrow a read port from the standalone ADD pipe to the FPRF.

2. You are missing the point ,you dont rewrite anything for Zen, Zen simply has 1/2 the throughput for 256bit ops. There is no rewriting, if you took the time to actual understand what i was saying that would be obvious. Zen can start the execution of 2 256bit FMA in a cycle if it for some reason that means something for some reason but it wont be able to execute any on the next cycle because it will be doing the 2nd uops for both of them.

Can Zen execute 2 256b FMAs in a single cycle? It cannot, because at most it is executing 2x uop1 and 2x uop2 per cycle. That's why Haswell+ is double the throughput in Linpack using MSVC. In fact Zen does 8DP/clock, when Skylake can do 16/clock and Skylake-X should do 32/clock but does 24/clock due to memory bandwidth limitations. That's why you get this:

beginner99 said:
Your own link states something different. it went from 70fps to 103 fps. So 33 fps increase. That is a 47% increase using AVX.
(all based on their multi-threaded engine numbers)

Then later down in the text there is following info:

Meaning an old CPU with SSE2 will still get half of that increase, so 23.5%. 70fps * 1.235 = 86.45 fps or 16.55 less than with AVX.

86.45 / 16.55 = 19%. So the actually benefit of AVX vs legacy CPU (SSE2 is from 2000, any CPU capable playing this game has it) is roughly 19% and not 200% as you claim. Yes it still matters in this day and age with lack of CPU progress. But saying it double FPS is plain wrong for this game.

It is 2x when comparing single-threaded unoptimized code. Particle effects in older engines is single threaded and doesn't even use SSE2, like Source. Which is the reason why FPS tanks when you throw a smoke grenade. Also, the performance gains achieved by the particle simulation does not map linearly into FPS gains, so the comparison you're making is not accurate.

Besides, when the code you wrote as a student which needs a 18C server-grade CPU to run can perform just as well with AVX512 optimizations on your dual-core laptop, then there is little reason to doubt the usefulness of the applicability of AVX512 in the real world.

tamz_msc · Sep 1, 2018

Abwx said:
HSW has less ressources when looking at the uarch, exe units are clustered, meaning that integer and FP will use the same exe clusters such that ressoures are not available simultaneously contrary to Zen where the FP part is litterally a co processor.

You're referring to the block-diagram layout, which should have nothing to do with the number of execution resources available. Haswell has 64B/cycle load and 32B/cycle store while Zen has 32B/cycle for both.

beginner99 · Sep 1, 2018

tamz_msc said:
It is 2x when comparing single-threaded unoptimized code. Particle effects in older engines is single threaded and doesn't even use SSE2, like Source. Which is the reason why FPS tanks when you throw a smoke grenade. Also, the performance gains achieved by the particle simulation does not map linearly into FPS gains, so the comparison you're making is not accurate.

Well my question was merely what AVX brings over non-avx (or what cascade-lake has over zen 2). Both can do multi threading hence that doesn't matter. The fact remains AVX was only a small part fo that 2x fold increase and they also added SSE2 support which gives half of AVX gains. So again my point remains the difference between AVXand a legacy cpu with SSE2 isn't that great.

And of course my calculations are very hypothetical but it clearly debunks that AVX leads to 2x increase. Nothing more, nothing less.

tamz_msc said:
Besides, when the code you wrote as a student which needs a 18C server-grade CPU to run can perform just as well with AVX512 optimizations on your dual-core laptop, then there is little reason to doubt the usefulness of the applicability of AVX512 in the real world.

Well it's a benchmark for exactly this so it scales. No real-world software uses 100% "vectorizable" code heck even supposedly ideal-case scenarios don't get that huge gains like x265. It mostly less than 10%.

NostaSeronx · Sep 1, 2018

AMD is pretty safe in regards to modern ISAs.
-> One of the techs from SVE. (Broadcom's with Vulcan SMT4 and future arch w/ ARMv8.3-SVE)
-> One of the techs from AVX512. (Intel's with Skylake-SP and Xeon Phi)

I'm pretty sure AMD is going to destroy with XOP2(AMD's version of SVE) or AVX512 support.
AMD going the SVE route gets the most complete set;
- Scalable Vector Length
- Vector Length Agnostic
- Enhanced Auto-vec support
- Predicated Execution
- Speculative Vectorization
- Gather-scatter
Which some AVX512 also supports as well. Implied automatic support of SSE/AVX/AVX512.

Having an ISA that go from current LP cores(>128-bit peak width) to future HPC cores(>2048-bit peak widths) would be ideal for a new vector ISA.

So, lets say a customer of AMD utilizes
1. semi-custom 7nm for special LP-HPC uses a modified Zen core.
2. Latest Zen X(number) core for HPC.

Utilizing something like AMD64-SVE or Extended Operations 2, etc. Would reduce port costs between the two systems and leave open for upgrades. Of the upgrades, smaller node and more cores, etc.

On the consumer side, the programs can just be optimized for largest vector if need be and AMD will get their some day.

Dayman1225 · Sep 1, 2018

dark zero said:
But... Cascade Lake is Meltdown free?

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9MLzQvNzkyNjE2L29yaWdpbmFsLzIuMTVfSW50ZWxfQ2FzY2FkZV9MYWtlX0hDMzAuSW50ZWwuQWtoaWxlc2guQ0xYQ1BVLlN1Ym1pdHRlZC1wYWdlLTAyNS5qcGc=

Has in silicon/hardware mitigations

DrMrLordX · Sep 1, 2018

Should be interesting to see if those hardware mitigations produce better performance than a fully-patched Skylake-X server running the same clocks.

tamz_msc · Sep 2, 2018

beginner99 said:
Well my question was merely what AVX brings over non-avx (or what cascade-lake has over zen 2). Both can do multi threading hence that doesn't matter. The fact remains AVX was only a small part fo that 2x fold increase and they also added SSE2 support which gives half of AVX gains. So again my point remains the difference between AVXand a legacy cpu with SSE2 isn't that great.

And of course my calculations are very hypothetical but it clearly debunks that AVX leads to 2x increase. Nothing more, nothing less.

The point is that in situations where ST performance is the bottleneck, AVX can be up to 2x faster than SSE under certain conditions.

beginner99 said:
Well it's a benchmark for exactly this so it scales. No real-world software uses 100% "vectorizable" code heck even supposedly ideal-case scenarios don't get that huge gains like x265. It mostly less than 10%.

It isn't just a benchmark - it is real-world code written for conducting someone's research. It's as real as it gets as far as code written by an average guy doing his PhD is concerned. People need to stop making it sound like that the only thing "real-world" refers to is some .exe that you download off the Internet. Linpack is real world code for solving real-world problems.

Cascade Lake beats Rome in the race for 2019 TACC Supercomputer

Diamond Member

Diamond Member

Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member