Intel Knights Landing Yields Big Bang For The Buck Jump

ShintaiDK · Jun 20, 2016

http://www.nextplatform.com/2016/06/20/intel-knights-landing-yields-big-bang-buck-jump/

Wuischpard said that Intel expects to ship more than 100,000 Xeon Phi units this year into the HPC market, and there is a good chance that more than a few hyperscalers are going to buy a bunch, too, for machine learning and possibly other workloads. More than 30 software vendors have ported their code to the new chip, and others will no doubt follow. And more than 30 system makers are bending metal around the Knights Landing processors.

We also hear that Intel may do something special with Knights Landing for the machine learning crowd from a keynote address at the ISC 2016 conference today, so stay tuned for that.

The single-socket Knights Landing processor is compatible with both the Linux and Windows Server operating systems that dominate datacenters today, and indeed any application that has been certified to run on either can run on a Knights Landing. That opens up the market for them pretty wide, too.

Now we find out how customers will use it.

TheRyuu · Jun 20, 2016

While not specifically about Knights Landing but still interesting because it includes it, AVX-512 is interesting because unlike a lot of the other instruction sets it seems to have taken a(n) (ease) of programming first point of view as opposed to ease of implementation (on the processor). It's been said that AVX-512 is nicer to use than some previous instruction sets that Intel has introduced over the years.

Nothingness · Jun 21, 2016

Performance wise this is significantly behind NVIDIA P100, even for power efficiency.

On the other hand, the price looks better and, of course, ease of use should also be better.

IntelUser2000 · Jun 21, 2016

http://www.nextplatform.com/wp-content/uploads/2015/08/intel-knights-landing-performance.jpg

They improved on the SpecFP_Rate performance. On the early graph, they were showing about ~80% of 2P Xeon E5 2697 v3 performance. Now, its on par. I guess someone could say that arrow means they'll do better... the good thing is they did.

No doubt the 14nm delayed this product likely by nearly a year! We should have seen this announcement last year.

Things are not that bad though. SpecFP_Rate is general purpose enough, and IVB vs HSW comparisons show there's no gain to be had from FMA. If the performance across individual SpecFP results are similar to 2P Xeon E5 2697 v3, then for folks that use applications similar to SpecFP subtests, they'll be able to run applications with no recompile and have the opportunity to recompile for much better performance later, all in a 1P platform!

http://ark.intel.com/products/94034/Intel-Xeon-Phi-Processor-7230-16GB-1_30-GHz-64-core

PCI Express Revision 3.0
PCI Express Configurations ‡ x16 port (Port and 3) may negotiate down to x8, x4, x2, or x1. x4 port (Port1) may negotiate down to x2, or x1
Max # of PCI Express Lanes 36

1x Xeon Phi 7250 + 2x Nvidia Pascal P100 PCI Express

Imagine that haha. It is a full CPU not a co-processor anymore.

Nothingness · Jun 21, 2016

IntelUser2000 said:
If the performance across individual SpecFP results are similar to 2P Xeon E5 2697 v3, then for folks that use applications similar to SpecFP subtests, they'll be able to run applications with no recompile and have the opportunity to recompile for much better performance later, all in a 1P platform!

No recompile? Then it's easy to see where it would stand: Atom-level performance. And a slow clocked one.

SAAA · Jun 21, 2016

Nothingness said:
No recompile? Then it's easy to see where it would stand: Atom-level performance. And a slow clocked one.

It's not really your standard Atom considered it supports AVX512 instructions and four threads per core, yes the clockspeed is slow but this version should be a good 50% ahead of airmont cores.
Goldmont is already 30% better and if I remember correctely this processor was derived from the latest atom architecture.
It's easily core-level performance, at only 1.5GHz, but with 72 cores!

IntelUser2000 said:
1x Xeon Phi 7250 + 2x Nvidia Pascal P100 PCI Express

Imagine that haha. It is a full CPU not a co-processor anymore.

It would be fun to test Ashes on such a monster... does it scale to 288 threads? :sneaky:

IntelUser2000 · Jun 21, 2016

It's different core really. They call it "heavily modified Silvermont". Silvermont, because that's what would have been available for design back when they conceived the idea.

Goldmont successor is in the works apparently. Probably the successor Knights Hill.

Also the ARK site shows a Turbo mode. 200MHz above base.

They already have a developer system for $5K, a Desktop tower. Windows Server support is rumored so what you are suggesting isn't far-fetched. ()🙂

Nothingness · Jun 21, 2016

SAAA said:
It's not really your standard Atom considered it supports AVX512 instructions and four threads per core, yes the clockspeed is slow but this version should be a good 50% ahead of airmont cores.
Goldmont is already 30% better and if I remember correctely this processor was derived from the latest atom architecture.
It's easily core-level performance, at only 1.5GHz, but with 72 cores!

Now explain us how without recompiling you can benefit from AVX-512 and 72 cores 😉

ShintaiDK · Jun 21, 2016

Nothingness said:
Now explain us how without recompiling you can benefit from AVX-512 and 72 cores 😉

You have been able to generate AVX512 code for a long time before its launch early this year 😉

Nothingness · Jun 21, 2016

ShintaiDK said:
You have been able to generate AVX512 code for a long time before its launch early this year 😉

And I'm sure if I pick a random pre-compiled package on any Linux distro, I'll get AVX-512 support 😉

Anyway that'd not give you good MT support. Taking the best out of Phi is difficult no matter what. Most likely easier than CUDA, but still it's not a matter of recompiling.

It's also interesting to note that on TOP500 KNL result shows that it's farther from its peak rate than CUDA based accelerators, and you can bet Intel optimized the most out of LINPACK.

ShintaiDK · Jun 21, 2016

The things you run on these in the first place are inherited multithreaded by nature.

DrMrLordX · Jun 21, 2016

Nothingness said:
Now explain us how without recompiling you can benefit from AVX-512 and 72 cores 😉

Well okay, I'll try.

Let's say you have an "embarassingly parallel" workload like . . . Dr. Ian Curtress' 3DPM v2. Now depending on how you code it, you can set it up to spawn as many threads as there are processors to handle them. Or 1.5x as many actually since that actually speeds things up sometimes.

So you can at least scale to 288 threads that way.

As for AVX512, it depends on how you set up your code and how well your compiler autovectorizes. But if you have simple loops with lots of iterations, the compiler can probably do it for you if the code is sufficiently free of branches breaking up the add/multiply operations.

The Java version I did of 3DPM v1 Stage 1 was specifically aimed at AVX2 with computational blocks broken into octets of sorts, and the JVM did support AVX/AVX2, so it optimized with that target in realtime. Not sure if it could go the extra mile and optimize that for AVX512 if/when the JVM supports that ISA extension. It will sooner or later, assuming Oracle continues to cooperate with Intel as they have in the past. Maybe Java 9 supports AVX512? Hmm:

https://bugs.openjdk.java.net/browse/JDK-8081247

looks like support for AVX512 has been in the works for awhile actually.

Nothingness · Jun 21, 2016

DrMrLordX said:
Well okay, I'll try.

OK, I pick Cinebench... Still don't get it? 🙂

SAAA · Jun 21, 2016

Nothingness said:
OK, I pick Cinebench... Still don't get it? 🙂

It's a server chip, any program that already runs on dozens of cores and threads (like the dual xeons they compared with, that have 28 cores!) will scale to just twice the amount. If it doesn't either you don't need this chip or have to recompile/fix the algorithms, in which case the speedup may easily cover any initial cost for reprogramming.

The Stilt · Jun 21, 2016

Cannonlake and Skylake-E are the first CPUs which have AVX-512 enabled, right?

jhu · Jun 21, 2016

The Stilt said:
Cannonlake and Skylake-E are the first CPUs which have AVX-512 enabled, right?

Xeon Phi actually

The Stilt · Jun 21, 2016

jhu said:
Xeon Phi actually

Outside Xeon Phi, obviously.

Burpo · Jun 21, 2016

", Cannonlake processors will be the first processors to have AVX 512 support from launch."

Intel Skylake-S Processors will have AVX 512 disabled

http://wccftech.com/mainstream-intel-core-processors-support-avx-512-skylake-xeon/

jpiniero · Jun 21, 2016

Burpo said:
", Cannonlake processors will be the first processors to have AVX 512 support from launch."
http://wccftech.com/mainstream-intel-core-processors-support-avx-512-skylake-xeon/

That's not what the article says, Skylake Server does support AVX-512. Cannonlake supports additional instructions though, at least in it's full implementation.

ShintaiDK · Jun 22, 2016

The Stilt said:
Cannonlake and Skylake-E are the first CPUs which have AVX-512 enabled, right?

I wouldn't count on AVX512 outside the server segment yet. Unless you meant Cannonlake-E 😉

The Stilt · Jun 22, 2016

ShintaiDK said:
I wouldn't count on AVX512 outside the server segment yet. Unless you meant Cannonlake-E 😉

:\

DrMrLordX · Jun 22, 2016

Nothingness said:
OK, I pick Cinebench... Still don't get it? 🙂

Why not 3DPM v2 instead? It should scale better with core count than Cinebench which, if I recall, can max out beyond a certain number of cores (I know R10 does, not sure if that applies to R15).

IntelUser2000 · Jun 22, 2016

Knights Landing is FULLY compatible with most Intel CPUs, all the way until AVX2. With AVX3-512 it diverges a bit, but it shares common instructions with Skylake. And its going to benefit from it unlike Knights Corner.

They did a great thing by beefing up from a meagre P54 to a Silvermont one with updated ISA. Performance results show that each KNL cores are 3-3.5x as fast compared to KNC cores.

-Itanium: OK performance with native, still lot of effort to optimize. You needed native to be useful at all. Poor x86 performance.
-Atom on Android: Entrenched ARM user base and code. Software performance with ARM native code was much better, but ecosystem is massively in favor of ARM

With Knights Landing, you don't have the biggest issues that plagued Itanium and Atom Android. SpecFP results are showing decent performance with basically no optimization. Sure, there is some. But Haswell results show that things like FMA isn't being used. Unlike Itanium SpecFP is isn't the best result, because its quite a general purpose benchmark. AVX3 is going to be developed on the most widely used server platform, and they can "specialize" with KNL specific AVX3.

jhu · Jun 23, 2016

DrMrLordX said:
Why not 3DPM v2 instead? It should scale better with core count than Cinebench which, if I recall, can max out beyond a certain number of cores (I know R10 does, not sure if that applies to R15).

You didn't explain how a program can use AVX512 and 72 cores without recompiling. That was the challenge proposed. What you explained was how a program can use AVX512 or/xor 72 cores without recompiling.

DrMrLordX · Jun 23, 2016

jhu said:
You didn't explain how a program can use AVX512 and 72 cores without recompiling. That was the challenge proposed. What you explained was how a program can use AVX512 or/xor 72 cores without recompiling.

I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.

Or if you had used Java to begin with, all you have to do is run the bench under Java9 without having to recompile anything.

The two different versions of AVX512 may complicate things a bit, or maybe not.

Intel Knights Landing Yields Big Bang For The Buck Jump

Lifer

Diamond Member

Diamond Member

Elite Member

Diamond Member

Senior member

Elite Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Senior member

Golden Member

Lifer

Golden Member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Elite Member

Lifer

Lifer