AMD EPYC Server Processor Thread - EPYC 7000 series specs and performance leaked

Page 16 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
So what is the bottleneck of SKL-SP with AVX512? Memory bandwitch? L2 cache, mesh frequency? L3 cache?

I think it could be tested with 18C SKL-X and DDR4 2666 as a baseline what is really behind it as you can change DDR4 frequency, mesh frequency and other settings you cannot with xeon

I don't believe avx512 is utterly useless....
According to STREAM triad, it seems to be memory bandwidth/core. What you can do by fiddling with desktop SKL-X is irrelevant here, these Xeons will run as is according to what they officially support. There is little to no added performance improvement in the SPEC2017 suite when using AVX512 over AVX2.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
I don't believe avx512 is utterly useless....

I don't either.

tamz_msc: What is the performance increase due to AVX in Sandy Bridge and AVX2 in Haswell for SpecCPU? The gains if I remember correctly were always quite small for Spec tests. Its in the real applications the benefits were shown.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I don't either.

tamz_msc: What is the performance increase due to AVX in Sandy Bridge and AVX2 in Haswell for SpecCPU? The gains if I remember correctly were always quite small for Spec tests. Its in the real applications the benefits were shown.
They don't test SB or HSW here. Only BD-E, SKL-SP and Naples. SPEC is as close to a real-world benchmark as it gets. Here's the description of one of the benchmarks:

649.fotonik3d_s:
Fotonik3D computes the transmission coefficient of a photonic waveguide using the finite-difference time-domain (FDTD) method for the Maxwell equations. UPML for dielectric materials is used to terminate the computational domain.

The core of the FDTD method is second-order accurate central-difference approximations of the Faraday's and Ampere's laws. These central-differences are employed on a staggered Cartesian grid resulting in an explicit finite-difference method. The FDTD method is also referred to as the Yee scheme. It is the standard time-domain method within CEM.
If this isn't real-world, I don't know what is.
 
  • Like
Reactions: nnunn

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
They don't test SB or HSW here. Only BD-E, SKL-SP and Naples. SPEC is as close to a real-world benchmark as it gets. Here's the description of one of the benchmarks:

I agree on Spec being close to a real world benchmark. However AVX has no effect on SPEC. Otherwise, we'd have seen outliers on SpecFP results when Sandy Bridge launched, then Haswell.

In fact, there's zero difference between SSE, AVX and AVX2
https://www.pressreader.com/netherlands/ct-magazine/20170620/283609580195527
http://www.hardware.fr/articles/847-10/performances-spec-gombk-hmmer-sjeng.html

By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.
 
Mar 10, 2006
11,715
2,012
126
By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.

Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.

That is true, but people want to know, and its also important. Because you are talking about a general purpose chip. A benchmark gives you a ball park idea. But if you want to know the actual result, sometimes you have to dig deep, really deep. Oftentimes people stop digging when they find the result they themselves find satisfactory.

But in this case, in an enthusiast forum, we talk too much about things that don't help us at all. Like the huge thread about Vega. Who cares about the details we'll never get to know? It's a hobby that's all.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I agree on Spec being close to a real world benchmark. However AVX has no effect on SPEC. Otherwise, we'd have seen outliers on SpecFP results when Sandy Bridge launched, then Haswell.

In fact, there's zero difference between SSE, AVX and AVX2
https://www.pressreader.com/netherlands/ct-magazine/20170620/283609580195527
http://www.hardware.fr/articles/847-10/performances-spec-gombk-hmmer-sjeng.html

By not being sensitive on vector ISA extensions(which is sort of a specialized function), it allows fair, architectural-based comparisons.
AVX didn't really get a proper implementation till Haswell and AVX2 was spotty until Broadwell. So SB results really don't tell much; also those are for the 2006 edition of the benchmark. For example, there are clear differences in AVX2 and SSE3 performance with the Intel compiler in the first heise.de article you linked to, especially on the 7700K. Just the margin of improvement is not as huge as PR would have you believe.
Such comparisons are moot. All that matters is delivered performance in your desired workload, and a lot of people who care about performance will update their software to use the new features on the CPU.
Updated software, meaning updated compiler with new optimization flags, which is not doing much at the moment. It's total fantasy if you believe a researcher is going to rewrite his whole code just so that it can do better with AVX512. If the compiler doesn't do it for him, it's over.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
What more optimization do you need than -O3 and -avx512 on the Intel compiler? It doesn't help at all on real workloads, at least not in its present state. In many cases you're going to be better off with explicit vectorization than relying on compiler autovectorization, which is basically the whole point of avx.

Really? Compiler autovectorization is academic bs that usually requires stars to align in code and bunch of restrict keywords thrown in on top of that. While AVX512 helps with "auto" vectorization of loops, that is not going to happen in generic "real world code" unless compiler can prove "basics" like two arrays not overlapping each other.

Where AVX and co shines, is in hand tuned code, usually written by hand or by using vendor provided libraries. And just throwing in AVX512 support randomly, can in fact hurt the performance, as glibc guys found out after using AVX512 in memcpy that in turn lowered clocks for millions of instructions resulting net loss in mem copy and world of pain when next context switch had to save/restore truckload of 512bit wide registers.

Updated software, meaning updated compiler with new optimization flags, which is not doing much at the moment. It's total fantasy if you believe a researcher is going to rewrite his whole code just so that it can do better with AVX512. If the compiler doesn't do it for him, it's over.

Yeah, they will buy expensive hardware and leave up to half of performance on the table. Sounds about right :)
 
  • Like
Reactions: Arachnotronic

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
AVX didn't really get a proper implementation till Haswell and AVX2 was spotty until Broadwell. So SB results really don't tell much; also those are for the 2006 edition of the benchmark. For example, there are clear differences in AVX2 and SSE3 performance with the Intel compiler in the first heise.de article you linked to, especially on the 7700K. Just the margin of improvement is not as huge as PR would have you believe.

Take a closer look. FP results don't show the difference. Integer shows under 5% improvement. That suggests to me Integer benefitted from AVX just being a superior ISA over SSE. Since AVX is for vector extensions though, it shows zero gains.

And answering both you and your reply to Arachronic. Yes, they do those sort of optimizations. Sandy Bridge EP release had floods of results based on it. Of course general market as a whole doesn't. But significant part of the HPC market does.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Take a closer look. FP results don't show the difference. Integer shows under 5% improvement. That suggests to me Integer benefitted from AVX just being a superior ISA over SSE. Since AVX is for vector extensions though, it shows zero gains.

And answering both you and your reply to Arachronic. Yes, they do those sort of optimizations. Sandy Bridge EP release had floods of results based on it. Of course general market as a whole doesn't. But significant part of the HPC market does.
HPC market mostly does gcc/gfortran with added flags, unless you're writing code for CERN or LIGO. Once a code reaches production-level, it's all about results - going back and using new and updated libraries, which basically amounts to rewriting the whole code, is very rare, unless it's a big-name project.
Really? Compiler autovectorization is academic bs that usually requires stars to align in code and bunch of restrict keywords thrown in on top of that. While AVX512 helps with "auto" vectorization of loops, that is not going to happen in generic "real world code" unless compiler can prove "basics" like two arrays not overlapping each other.

Where AVX and co shines, is in hand tuned code, usually written by hand or by using vendor provided libraries. And just throwing in AVX512 support randomly, can in fact hurt the performance, as glibc guys found out after using AVX512 in memcpy that in turn lowered clocks for millions of instructions resulting net loss in mem copy and world of pain when next context switch had to save/restore truckload of 512bit wide registers.



Yeah, they will buy expensive hardware and leave up to half of performance on the table. Sounds about right :)
Yeah and people will go back to rewrite code that has been squeezed dry for 15-20 years, whose original authors are probably dead/retired just because the uni decided to buy some new workstations bearing shiny logos of the latest Intel processor. Sounds about right.

Can you give some examples of "non-generic" real-world code?
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
HPC market mostly does gcc/gfortran with added flags, unless you're writing code for CERN or LIGO. Once a code reaches production-level, it's all about results - going back and using new and updated libraries, which basically amounts to rewriting the whole code, is very rare, unless it's a big-name project.

Intel probably sees a justification for extending their vector extensions, unlike you do. There's also enough group of people doing optimizations to allow a separate market for HPC just based on Nvidia GPUs.

AVX extensions are likely a defensive strategy. I agree with what you are saying, that most don't like to rewrite code. And even those that do, they want to do minimal work as possible. By potentially doubling performance in the applications that are starting to get suited for porting to GPUs, they encourage people to stay with Xeons for HPC work. Working with existing toolset and knowledge is much easier and cheaper than to move to a completely new one.

And I know it works. Because they still have something like 90% of the HPC market.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Yeah and people will go back to rewrite code that has been squeezed dry for 15-20 years, whose original authors are probably dead/retired just because the uni decided to buy some new workstations bearing shiny logos of the latest Intel processor. Sounds about right.

They are already using some Intel provided library and by upgrading it they will benefit from AVX512. And world does not end with academics, business runs with different rules.

Anyway this discussion is moot, AVX512 as it is implemented is grave error by Intel that will cost them dearly.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
They are already using some Intel provided library and by upgrading it they will benefit from AVX512. And world does not end with academics, business runs with different rules.

Anyway this discussion is moot, AVX512 as it is implemented is grave error by Intel that will cost them dearly.
What is it that the libraries do? What are the real-world codes that businesses run that supposedly benefit from said libraries? All I'm asking for is a concrete example.
 
  • Like
Reactions: Drazick

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
What is it that the libraries do? What are the real-world codes that businesses run that supposedly benefit from said libraries? All I'm asking for is a concrete example.

I meant stuff like:

https://www.cloudera.com/downloads/partner/intel.html

so if your app is using those functions from libraries, you can benefit from simple library upgrade to AVX512.


I have already explained it a bit in Skylake thread, but reasoning is:

1) Instruction set is OK, lots of nice stuff there. If this instruction set was applied to 256 bit vectors while adding mask registers, i'd be happy.
2) 512 bit wide SIMD registers in CPU is a solution looking for a problem, GPU's are much wider and more efficient in HPC, AVX2/FMA 256 bits are already wide enough.
3) AVX512 drops clocks even harder than AVX2 mode, resulting in strict requirements to code speed up to not loose performance overall.

So it is perfect storm for Intel. They blew area and power while chasing HPC pipe dream, while AMD built a chip that is completely opposite in philosophy - 4 independant pipes that combine in two pairs, but can execute 2+2 mul/add per clock when stars align. And SPEC tests, "Cinebenches" of web testing show the results of that debacle, 4 pipes being more flexible than 3 pipes. Add Epyc memory bw advantage once core counts get high, and...
 
  • Like
Reactions: moinmoin and .vodka

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
What more optimization do you need than -O3 and -avx512 on the Intel compiler? It doesn't help at all on real workloads, at least not in its present state. In many cases you're going to be better off with explicit vectorization than relying on compiler autovectorization, which is basically the whole point of avx.

Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?
 
Last edited:

estarkey7

Member
Nov 29, 2006
108
20
91
Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?
Software engineers write libraries too !
 

TheGiant

Senior member
Jun 12, 2017
748
353
106
According to STREAM triad, it seems to be memory bandwidth/core. What you can do by fiddling with desktop SKL-X is irrelevant here, these Xeons will run as is according to what they officially support. There is little to no added performance improvement in the SPEC2017 suite when using AVX512 over AVX2.
Why irrelevant. SKL-X is basically Xeon without ECC. Otherwise its the same (well except 6CH vs 4CH mem support).

But with Xeon you cannot experiment with so many settings. So I wonder if anyone can do a test with skl-x 8core and 16 core with the same SPEC2017 benches with 4ch memory and with various memory speed and mesh speed settings.

From the results it seems Intel didn't balance the architecture well enough. AVX512 needs 12CH memory controller...
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Libraries, software engineers use libraries. That's the whole point of AVX. If you are doing vector math for arrays you'll be using a BLAS libraries of some sort already compiled. Real workloads like data analytics and such. Do you know what you're talking about?
Yeah what "analytics" are you actually doing, could you please elaborate?

Basic Linear Algebra Subprograms(BLAS) have become such a routine feature of computations that it's is actually easier to use library functions instead of writing your own sparse matrix solver, or dusting off your old numerical analysis textbook to figure out that LU-factorization method that people toyed with in the sixties and seventies. Same for FFT.

You can get away with the most sloppy implementation of your gauss-jordans or tookey-cooleys(unless it's huge) and it would not matter because processors are orders of magnitude faster than they were during the times these things were invented.

What is ML but a fancy name for statistics? Just because it's now in fashion to use libraries with hipster-sounding names for doing standard things like parameter estimation doesn't mean that people haven't tried their hand in writing their own libraries to deal with the same issues that proprietary libraries with support for new CPU instructions are addressed at. These routine tasks are rarely a bottleneck to achieving optimal performance.

You have a crazy system of ten second-order PDEs that need to be solved on a staggered grid which requires reconstructing intermediate values through interpolation before each subsequent time-step - now that's your heavy-duty computation right there, and those who understand what it means to navigate these kind of problems know better than to fall for 'over 2X better performance in gemm' bull***p.

Just do a simple math - people are going gaga over 600GFLOP/s DP performance with AVX512 on LINPACK - how much DDR memory BW do you think is needed to sustain that figure?
 
  • Like
Reactions: Drazick

TheGiant

Senior member
Jun 12, 2017
748
353
106
Software engineers write libraries too !
Maybe software engineers do, but the latest engineer level doesn't. I am using mostly outside written code for my CFD calculations that is optimized for the HW. Latest one improved performance on my BW Xeon 14C about 12% in general, sometime more than 80% for pre-calculations.

So you guys think AVX512 is in SKL-X/SP implementation a big fat fail and isn't worth a buy?

I generally like more the AMD approach because it may not produce the fastest performance if optimized correctly but let's face it. No coder is doing more than it's needed and if the performance is "acceptable", which is very relative they coders won't do more.
So generally the AMD approach seems better....

Looks to me again Intel is going the P4 way.....
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
In a monopoly situation its darn smart for Intel to take on and fight for the hpc market with avx512.
Then enters epyc and its obvious its dragging you down. Not smart.
But planning for an epyc 10 years in a row and losing billions of opportunities to even greaters competitors is even less smart.
Business is taking risk. Thats where the profit is.
Now its just back in the saddle and take a new direction.
 
Last edited:
  • Like
Reactions: Tuna-Fish

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
Yeah what "analytics" are you actually doing, could you please elaborate?

Map Reduce for example. Large data sets are always a problem for run of the mill x86 server CPUs - they simply don't have the bandwidth required per socket for HPC-like performance. Kind of a no brainer. I'm not actually doing any sftw dev right now, but I've done plenty.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Map Reduce for example. Large data sets are always a problem for run of the mill x86 server CPUs - they simply don't have the bandwidth required per socket for HPC-like performance. Kind of a no brainer. I'm not actually doing any sftw dev right now, but I've done plenty.
And that is true regardless of how wide your vector SIMD units are. Memory bandwidth increase hasn't kept up with ALU performance increase, and suddenly making registers 2-4X wider than before means that you'll run into memory bottlenecks even faster, which is precisely what is happening with AVX512.

Run-of-the-mill x86 server CPUs can't handle HPC? Then tell me what is actually used for HPC, other than supercomputers?
 
  • Like
Reactions: Drazick

Topweasel

Diamond Member
Oct 19, 2000
5,437
1,659
136
2) 512 bit wide SIMD registers in CPU is a solution looking for a problem, GPU's are much wider and more efficient in HPC, AVX2/FMA 256 bits are already wide enough.

More specifically it one or two volume partners specifically asking for those registers. It really wasn't meant for general computing and even general server work. It was meant for a handful of very specific tasks by specific companies. That's why Intel fractured it in the first place to support portions of it on products meant for certain markets and therefore their specific vendors (and because including all of it in one device might double core sizes). It's kind of irritating because so many people without knowing what AVX512 is treat it like the next big thing in computing, when in reality its probably more of an anchor for any device that adopts it.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,045
3,835
136

Skylake-X is exactly like Vega, its not that its bad, its just at either end of the spectrum there are more targeted options. On the scalar/low vector width sides there is Zen and on the really wide vector side their is GP100 and GV100. For Vega there is GP104 on the optimized game side and GP100 on the optimized compute side, Vega floats in the middle.

But lets be honest here heap of orgs will by skylake-X regardless, it will take a few generations of deficit to move the more conservative orgs .