AMD EPYC Server Processor Thread - EPYC 7000 series specs and performance leaked

tamz_msc · Sep 2, 2017

TheGiant said:
So what is the bottleneck of SKL-SP with AVX512? Memory bandwitch? L2 cache, mesh frequency? L3 cache?

I think it could be tested with 18C SKL-X and DDR4 2666 as a baseline what is really behind it as you can change DDR4 frequency, mesh frequency and other settings you cannot with xeon

I don't believe avx512 is utterly useless....

According to STREAM triad, it seems to be memory bandwidth/core. What you can do by fiddling with desktop SKL-X is irrelevant here, these Xeons will run as is according to what they officially support. There is little to no added performance improvement in the SPEC2017 suite when using AVX512 over AVX2.

IntelUser2000 · Sep 2, 2017

TheGiant said:
I don't believe avx512 is utterly useless....

I don't either.

tamz_msc: What is the performance increase due to AVX in Sandy Bridge and AVX2 in Haswell for SpecCPU? The gains if I remember correctly were always quite small for Spec tests. Its in the real applications the benefits were shown.

tamz_msc · Sep 2, 2017

IntelUser2000 said:
I don't either.

tamz_msc: What is the performance increase due to AVX in Sandy Bridge and AVX2 in Haswell for SpecCPU? The gains if I remember correctly were always quite small for Spec tests. Its in the real applications the benefits were shown.

They don't test SB or HSW here. Only BD-E, SKL-SP and Naples. SPEC is as close to a real-world benchmark as it gets. Here's the description of one of the benchmarks:

649.fotonik3d_s:
Fotonik3D computes the transmission coefficient of a photonic waveguide using the finite-difference time-domain (FDTD) method for the Maxwell equations. UPML for dielectric materials is used to terminate the computational domain.

The core of the FDTD method is second-order accurate central-difference approximations of the Faraday's and Ampere's laws. These central-differences are employed on a staggered Cartesian grid resulting in an explicit finite-difference method. The FDTD method is also referred to as the Yee scheme. It is the standard time-domain method within CEM.

If this isn't real-world, I don't know what is.

IntelUser2000 · Sep 2, 2017

tamz_msc said:
They don't test SB or HSW here. Only BD-E, SKL-SP and Naples. SPEC is as close to a real-world benchmark as it gets. Here's the description of one of the benchmarks:

I agree on Spec being close to a real world benchmark. However AVX has no effect on SPEC. Otherwise, we'd have seen outliers on SpecFP results when Sandy Bridge launched, then Haswell.

In fact, there's zero difference between SSE, AVX and AVX2
https://www.pressreader.com/netherlands/ct-magazine/20170620/283609580195527
http://www.hardware.fr/articles/847-10/performances-spec-gombk-hmmer-sjeng.html

By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.

Arachnotronic · Sep 2, 2017

IntelUser2000 said:
By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.

Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.

IntelUser2000 · Sep 2, 2017

Arachnotronic said:
Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.

That is true, but people want to know, and its also important. Because you are talking about a general purpose chip. A benchmark gives you a ball park idea. But if you want to know the actual result, sometimes you have to dig deep, really deep. Oftentimes people stop digging when they find the result they themselves find satisfactory.

But in this case, in an enthusiast forum, we talk too much about things that don't help us at all. Like the huge thread about Vega. Who cares about the details we'll never get to know? It's a hobby that's all.

tamz_msc · Sep 2, 2017

IntelUser2000 said:
I agree on Spec being close to a real world benchmark. However AVX has no effect on SPEC. Otherwise, we'd have seen outliers on SpecFP results when Sandy Bridge launched, then Haswell.

In fact, there's zero difference between SSE, AVX and AVX2
https://www.pressreader.com/netherlands/ct-magazine/20170620/283609580195527
http://www.hardware.fr/articles/847-10/performances-spec-gombk-hmmer-sjeng.html

By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.

AVX didn't really get a proper implementation till Haswell and AVX2 was spotty until Broadwell. So SB results really don't tell much; also those are for the 2006 edition of the benchmark. For example, there are clear differences in AVX2 and SSE3 performance with the Intel compiler in the first heise.de article you linked to, especially on the 7700K. Just the margin of improvement is not as huge as PR would have you believe.

Arachnotronic said:
Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.

Updated software, meaning updated compiler with new optimization flags, which is not doing much at the moment. It's total fantasy if you believe a researcher is going to rewrite his whole code just so that it can do better with AVX512. If the compiler doesn't do it for him, it's over.

JoeRambo · Sep 2, 2017

tamz_msc said:
What more optimization do you need than -O3 and -avx512 on the Intel compiler? It doesn't help at all on real workloads, at least not in its present state. In many cases you're going to be better off with explicit vectorization than relying on compiler autovectorization, which is basically the whole point of avx.

Really? Compiler autovectorization is academic bs that usually requires stars to align in code and bunch of restrict keywords thrown in on top of that. While AVX512 helps with "auto" vectorization of loops, that is not going to happen in generic "real world code" unless compiler can prove "basics" like two arrays not overlapping each other.

Where AVX and co shines, is in hand tuned code, usually written by hand or by using vendor provided libraries. And just throwing in AVX512 support randomly, can in fact hurt the performance, as glibc guys found out after using AVX512 in memcpy that in turn lowered clocks for millions of instructions resulting net loss in mem copy and world of pain when next context switch had to save/restore truckload of 512bit wide registers.

tamz_msc said:
Updated software, meaning updated compiler with new optimization flags, which is not doing much at the moment. It's total fantasy if you believe a researcher is going to rewrite his whole code just so that it can do better with AVX512. If the compiler doesn't do it for him, it's over.

Yeah, they will buy expensive hardware and leave up to half of performance on the table. Sounds about right

IntelUser2000 · Sep 2, 2017

tamz_msc said:
AVX didn't really get a proper implementation till Haswell and AVX2 was spotty until Broadwell. So SB results really don't tell much; also those are for the 2006 edition of the benchmark. For example, there are clear differences in AVX2 and SSE3 performance with the Intel compiler in the first heise.de article you linked to, especially on the 7700K. Just the margin of improvement is not as huge as PR would have you believe.

Take a closer look. FP results don't show the difference. Integer shows under 5% improvement. That suggests to me Integer benefitted from AVX just being a superior ISA over SSE. Since AVX is for vector extensions though, it shows zero gains.

And answering both you and your reply to Arachronic. Yes, they do those sort of optimizations. Sandy Bridge EP release had floods of results based on it. Of course general market as a whole doesn't. But significant part of the HPC market does.

tamz_msc · Sep 2, 2017

IntelUser2000 said:
Take a closer look. FP results don't show the difference. Integer shows under 5% improvement. That suggests to me Integer benefitted from AVX just being a superior ISA over SSE. Since AVX is for vector extensions though, it shows zero gains.

And answering both you and your reply to Arachronic. Yes, they do those sort of optimizations. Sandy Bridge EP release had floods of results based on it. Of course general market as a whole doesn't. But significant part of the HPC market does.

HPC market mostly does gcc/gfortran with added flags, unless you're writing code for CERN or LIGO. Once a code reaches production-level, it's all about results - going back and using new and updated libraries, which basically amounts to rewriting the whole code, is very rare, unless it's a big-name project.

JoeRambo said:
Really? Compiler autovectorization is academic bs that usually requires stars to align in code and bunch of restrict keywords thrown in on top of that. While AVX512 helps with "auto" vectorization of loops, that is not going to happen in generic "real world code" unless compiler can prove "basics" like two arrays not overlapping each other.

Where AVX and co shines, is in hand tuned code, usually written by hand or by using vendor provided libraries. And just throwing in AVX512 support randomly, can in fact hurt the performance, as glibc guys found out after using AVX512 in memcpy that in turn lowered clocks for millions of instructions resulting net loss in mem copy and world of pain when next context switch had to save/restore truckload of 512bit wide registers.

Yeah, they will buy expensive hardware and leave up to half of performance on the table. Sounds about right

Yeah and people will go back to rewrite code that has been squeezed dry for 15-20 years, whose original authors are probably dead/retired just because the uni decided to buy some new workstations bearing shiny logos of the latest Intel processor. Sounds about right.

Can you give some examples of "non-generic" real-world code?

IntelUser2000 · Sep 2, 2017

tamz_msc said:
HPC market mostly does gcc/gfortran with added flags, unless you're writing code for CERN or LIGO. Once a code reaches production-level, it's all about results - going back and using new and updated libraries, which basically amounts to rewriting the whole code, is very rare, unless it's a big-name project.

Intel probably sees a justification for extending their vector extensions, unlike you do. There's also enough group of people doing optimizations to allow a separate market for HPC just based on Nvidia GPUs.

AVX extensions are likely a defensive strategy. I agree with what you are saying, that most don't like to rewrite code. And even those that do, they want to do minimal work as possible. By potentially doubling performance in the applications that are starting to get suited for porting to GPUs, they encourage people to stay with Xeons for HPC work. Working with existing toolset and knowledge is much easier and cheaper than to move to a completely new one.

And I know it works. Because they still have something like 90% of the HPC market.

JoeRambo · Sep 2, 2017

tamz_msc said:
Yeah and people will go back to rewrite code that has been squeezed dry for 15-20 years, whose original authors are probably dead/retired just because the uni decided to buy some new workstations bearing shiny logos of the latest Intel processor. Sounds about right.

They are already using some Intel provided library and by upgrading it they will benefit from AVX512. And world does not end with academics, business runs with different rules.

Anyway this discussion is moot, AVX512 as it is implemented is grave error by Intel that will cost them dearly.

tamz_msc · Sep 2, 2017

JoeRambo said:
They are already using some Intel provided library and by upgrading it they will benefit from AVX512. And world does not end with academics, business runs with different rules.

Anyway this discussion is moot, AVX512 as it is implemented is grave error by Intel that will cost them dearly.

What is it that the libraries do? What are the real-world codes that businesses run that supposedly benefit from said libraries? All I'm asking for is a concrete example.

Arachnotronic · Sep 2, 2017

JoeRambo said:
Anyway this discussion is moot, AVX512 as it is implemented is grave error by Intel that will cost them dearly.

Why?

JoeRambo · Sep 2, 2017

tamz_msc said:
What is it that the libraries do? What are the real-world codes that businesses run that supposedly benefit from said libraries? All I'm asking for is a concrete example.

I meant stuff like:

https://www.cloudera.com/downloads/partner/intel.html

so if your app is using those functions from libraries, you can benefit from simple library upgrade to AVX512.

Arachnotronic said:
Why?

I have already explained it a bit in Skylake thread, but reasoning is:

1) Instruction set is OK, lots of nice stuff there. If this instruction set was applied to 256 bit vectors while adding mask registers, i'd be happy.
2) 512 bit wide SIMD registers in CPU is a solution looking for a problem, GPU's are much wider and more efficient in HPC, AVX2/FMA 256 bits are already wide enough.
3) AVX512 drops clocks even harder than AVX2 mode, resulting in strict requirements to code speed up to not loose performance overall.

So it is perfect storm for Intel. They blew area and power while chasing HPC pipe dream, while AMD built a chip that is completely opposite in philosophy - 4 independant pipes that combine in two pairs, but can execute 2+2 mul/add per clock when stars align. And SPEC tests, "Cinebenches" of web testing show the results of that debacle, 4 pipes being more flexible than 3 pipes. Add Epyc memory bw advantage once core counts get high, and...

Ajay · Sep 2, 2017

tamz_msc said:
What more optimization do you need than -O3 and -avx512 on the Intel compiler? It doesn't help at all on real workloads, at least not in its present state. In many cases you're going to be better off with explicit vectorization than relying on compiler autovectorization, which is basically the whole point of avx.

Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?

estarkey7 · Sep 2, 2017

Ajay said:
Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?

Software engineers write libraries too

!

TheGiant · Sep 2, 2017

tamz_msc said:
According to STREAM triad, it seems to be memory bandwidth/core. What you can do by fiddling with desktop SKL-X is irrelevant here, these Xeons will run as is according to what they officially support. There is little to no added performance improvement in the SPEC2017 suite when using AVX512 over AVX2.

Why irrelevant. SKL-X is basically Xeon without ECC. Otherwise its the same (well except 6CH vs 4CH mem support).

But with Xeon you cannot experiment with so many settings. So I wonder if anyone can do a test with skl-x 8core and 16 core with the same SPEC2017 benches with 4ch memory and with various memory speed and mesh speed settings.

From the results it seems Intel didn't balance the architecture well enough. AVX512 needs 12CH memory controller...

tamz_msc · Sep 2, 2017

Ajay said:
Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?

Yeah what "analytics" are you actually doing, could you please elaborate?

Basic Linear Algebra Subprograms(BLAS) have become such a routine feature of computations that it's is actually easier to use library functions instead of writing your own sparse matrix solver, or dusting off your old numerical analysis textbook to figure out that LU-factorization method that people toyed with in the sixties and seventies. Same for FFT.

You can get away with the most sloppy implementation of your gauss-jordans or tookey-cooleys(unless it's huge) and it would not matter because processors are orders of magnitude faster than they were during the times these things were invented.

What is ML but a fancy name for statistics? Just because it's now in fashion to use libraries with hipster-sounding names for doing standard things like parameter estimation doesn't mean that people haven't tried their hand in writing their own libraries to deal with the same issues that proprietary libraries with support for new CPU instructions are addressed at. These routine tasks are rarely a bottleneck to achieving optimal performance.

You have a crazy system of ten second-order PDEs that need to be solved on a staggered grid which requires reconstructing intermediate values through interpolation before each subsequent time-step - now that's your heavy-duty computation right there, and those who understand what it means to navigate these kind of problems know better than to fall for 'over 2X better performance in gemm' bull***p.

Just do a simple math - people are going gaga over 600GFLOP/s DP performance with AVX512 on LINPACK - how much DDR memory BW do you think is needed to sustain that figure?

TheGiant · Sep 2, 2017

estarkey7 said:
Software engineers write libraries too !

Maybe software engineers do, but the latest engineer level doesn't. I am using mostly outside written code for my CFD calculations that is optimized for the HW. Latest one improved performance on my BW Xeon 14C about 12% in general, sometime more than 80% for pre-calculations.

So you guys think AVX512 is in SKL-X/SP implementation a big fat fail and isn't worth a buy?

I generally like more the AMD approach because it may not produce the fastest performance if optimized correctly but let's face it. No coder is doing more than it's needed and if the performance is "acceptable", which is very relative they coders won't do more.
So generally the AMD approach seems better....

Looks to me again Intel is going the P4 way.....

krumme · Sep 2, 2017

In a monopoly situation its darn smart for Intel to take on and fight for the hpc market with avx512.
Then enters epyc and its obvious its dragging you down. Not smart.
But planning for an epyc 10 years in a row and losing billions of opportunities to even greaters competitors is even less smart.
Business is taking risk. Thats where the profit is.
Now its just back in the saddle and take a new direction.

Ajay · Sep 3, 2017

tamz_msc said:
Yeah what "analytics" are you actually doing, could you please elaborate?

Map Reduce for example. Large data sets are always a problem for run of the mill x86 server CPUs - they simply don't have the bandwidth required per socket for HPC-like performance. Kind of a no brainer. I'm not actually doing any sftw dev right now, but I've done plenty.

tamz_msc · Sep 3, 2017

Ajay said:
Map Reduce for example. Large data sets are always a problem for run of the mill x86 server CPUs - they simply don't have the bandwidth required per socket for HPC-like performance. Kind of a no brainer. I'm not actually doing any sftw dev right now, but I've done plenty.

And that is true regardless of how wide your vector SIMD units are. Memory bandwidth increase hasn't kept up with ALU performance increase, and suddenly making registers 2-4X wider than before means that you'll run into memory bottlenecks even faster, which is precisely what is happening with AVX512.

Run-of-the-mill x86 server CPUs can't handle HPC? Then tell me what is actually used for HPC, other than supercomputers?

Topweasel · Sep 3, 2017

JoeRambo said:
2) 512 bit wide SIMD registers in CPU is a solution looking for a problem, GPU's are much wider and more efficient in HPC, AVX2/FMA 256 bits are already wide enough.

More specifically it one or two volume partners specifically asking for those registers. It really wasn't meant for general computing and even general server work. It was meant for a handful of very specific tasks by specific companies. That's why Intel fractured it in the first place to support portions of it on products meant for certain markets and therefore their specific vendors (and because including all of it in one device might double core sizes). It's kind of irritating because so many people without knowing what AVX512 is treat it like the next big thing in computing, when in reality its probably more of an anchor for any device that adopts it.

itsmydamnation · Sep 3, 2017

Arachnotronic said:
Why?

Skylake-X is exactly like Vega, its not that its bad, its just at either end of the spectrum there are more targeted options. On the scalar/low vector width sides there is Zen and on the really wide vector side their is GP100 and GV100. For Vega there is GP104 on the optimized game side and GP100 on the optimized compute side, Vega floats in the middle.

But lets be honest here heap of orgs will by skylake-X regardless, it will take a few generations of deficit to move the more conservative orgs .

AMD EPYC Server Processor Thread - EPYC 7000 series specs and performance leaked

Diamond Member

Elite Member

Diamond Member

Elite Member

Lifer

Elite Member

Diamond Member

Golden Member

Elite Member

Diamond Member

Elite Member

Golden Member

Diamond Member

Lifer

Golden Member

Lifer

Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member