Intel Knights Landing Yields Big Bang For The Buck Jump

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
http://www.nextplatform.com/2016/06/20/intel-knights-landing-yields-big-bang-buck-jump/

Wuischpard said that Intel expects to ship more than 100,000 Xeon Phi units this year into the HPC market, and there is a good chance that more than a few hyperscalers are going to buy a bunch, too, for machine learning and possibly other workloads. More than 30 software vendors have ported their code to the new chip, and others will no doubt follow. And more than 30 system makers are bending metal around the Knights Landing processors.

We also hear that Intel may do something special with Knights Landing for the machine learning crowd from a keynote address at the ISC 2016 conference today, so stay tuned for that.

The single-socket Knights Landing processor is compatible with both the Linux and Windows Server operating systems that dominate datacenters today, and indeed any application that has been certified to run on either can run on a Knights Landing. That opens up the market for them pretty wide, too.

Now we find out how customers will use it.

intel-knights-landing-overview.jpg

intel-xeon-phi-firsts.jpg

intel-xeon-phi-compare-table.jpg

intel-xeon-phi-performance.jpg

intel-xeon-phi-ecosystem.jpg
 

TheRyuu

Diamond Member
Dec 3, 2005
5,479
14
81
While not specifically about Knights Landing but still interesting because it includes it, AVX-512 is interesting because unlike a lot of the other instruction sets it seems to have taken a(n) (ease) of programming first point of view as opposed to ease of implementation (on the processor). It's been said that AVX-512 is nicer to use than some previous instruction sets that Intel has introduced over the years.
 

Nothingness

Platinum Member
Jul 3, 2013
2,413
748
136
Performance wise this is significantly behind NVIDIA P100, even for power efficiency.

On the other hand, the price looks better and, of course, ease of use should also be better.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
http://www.nextplatform.com/wp-content/uploads/2015/08/intel-knights-landing-performance.jpg

They improved on the SpecFP_Rate performance. On the early graph, they were showing about ~80% of 2P Xeon E5 2697 v3 performance. Now, its on par. I guess someone could say that arrow means they'll do better... the good thing is they did.

No doubt the 14nm delayed this product likely by nearly a year! We should have seen this announcement last year.

Things are not that bad though. SpecFP_Rate is general purpose enough, and IVB vs HSW comparisons show there's no gain to be had from FMA. If the performance across individual SpecFP results are similar to 2P Xeon E5 2697 v3, then for folks that use applications similar to SpecFP subtests, they'll be able to run applications with no recompile and have the opportunity to recompile for much better performance later, all in a 1P platform!


PCI Express Revision 3.0
PCI Express Configurations ‡ x16 port (Port and 3) may negotiate down to x8, x4, x2, or x1. x4 port (Port1) may negotiate down to x2, or x1
Max # of PCI Express Lanes 36

1x Xeon Phi 7250 + 2x Nvidia Pascal P100 PCI Express

Imagine that haha. It is a full CPU not a co-processor anymore.
 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,413
748
136
If the performance across individual SpecFP results are similar to 2P Xeon E5 2697 v3, then for folks that use applications similar to SpecFP subtests, they'll be able to run applications with no recompile and have the opportunity to recompile for much better performance later, all in a 1P platform!
No recompile? Then it's easy to see where it would stand: Atom-level performance. And a slow clocked one.
 

SAAA

Senior member
May 14, 2014
541
126
116
No recompile? Then it's easy to see where it would stand: Atom-level performance. And a slow clocked one.

It's not really your standard Atom considered it supports AVX512 instructions and four threads per core, yes the clockspeed is slow but this version should be a good 50% ahead of airmont cores.
Goldmont is already 30% better and if I remember correctely this processor was derived from the latest atom architecture.
It's easily core-level performance, at only 1.5GHz, but with 72 cores!

1x Xeon Phi 7250 + 2x Nvidia Pascal P100 PCI Express

Imagine that haha. It is a full CPU not a co-processor anymore.

It would be fun to test Ashes on such a monster... does it scale to 288 threads? :sneaky:
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
It's different core really. They call it "heavily modified Silvermont". Silvermont, because that's what would have been available for design back when they conceived the idea.

Goldmont successor is in the works apparently. Probably the successor Knights Hill.

Also the ARK site shows a Turbo mode. 200MHz above base.

They already have a developer system for $5K, a Desktop tower. Windows Server support is rumored so what you are suggesting isn't far-fetched. ():)
 

Nothingness

Platinum Member
Jul 3, 2013
2,413
748
136
It's not really your standard Atom considered it supports AVX512 instructions and four threads per core, yes the clockspeed is slow but this version should be a good 50% ahead of airmont cores.
Goldmont is already 30% better and if I remember correctely this processor was derived from the latest atom architecture.
It's easily core-level performance, at only 1.5GHz, but with 72 cores!
Now explain us how without recompiling you can benefit from AVX-512 and 72 cores ;)
 

Nothingness

Platinum Member
Jul 3, 2013
2,413
748
136
You have been able to generate AVX512 code for a long time before its launch early this year ;)
And I'm sure if I pick a random pre-compiled package on any Linux distro, I'll get AVX-512 support ;)

Anyway that'd not give you good MT support. Taking the best out of Phi is difficult no matter what. Most likely easier than CUDA, but still it's not a matter of recompiling.

It's also interesting to note that on TOP500 KNL result shows that it's farther from its peak rate than CUDA based accelerators, and you can bet Intel optimized the most out of LINPACK.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
The things you run on these in the first place are inherited multithreaded by nature.
 

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
Now explain us how without recompiling you can benefit from AVX-512 and 72 cores ;)

Well okay, I'll try.

Let's say you have an "embarassingly parallel" workload like . . . Dr. Ian Curtress' 3DPM v2. Now depending on how you code it, you can set it up to spawn as many threads as there are processors to handle them. Or 1.5x as many actually since that actually speeds things up sometimes.

So you can at least scale to 288 threads that way.

As for AVX512, it depends on how you set up your code and how well your compiler autovectorizes. But if you have simple loops with lots of iterations, the compiler can probably do it for you if the code is sufficiently free of branches breaking up the add/multiply operations.

The Java version I did of 3DPM v1 Stage 1 was specifically aimed at AVX2 with computational blocks broken into octets of sorts, and the JVM did support AVX/AVX2, so it optimized with that target in realtime. Not sure if it could go the extra mile and optimize that for AVX512 if/when the JVM supports that ISA extension. It will sooner or later, assuming Oracle continues to cooperate with Intel as they have in the past. Maybe Java 9 supports AVX512? Hmm:

https://bugs.openjdk.java.net/browse/JDK-8081247

looks like support for AVX512 has been in the works for awhile actually.
 

SAAA

Senior member
May 14, 2014
541
126
116
OK, I pick Cinebench... Still don't get it? :)

It's a server chip, any program that already runs on dozens of cores and threads (like the dual xeons they compared with, that have 28 cores!) will scale to just twice the amount. If it doesn't either you don't need this chip or have to recompile/fix the algorithms, in which case the speedup may easily cover any initial cost for reprogramming.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Cannonlake and Skylake-E are the first CPUs which have AVX-512 enabled, right?
 

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
OK, I pick Cinebench... Still don't get it? :)

Why not 3DPM v2 instead? It should scale better with core count than Cinebench which, if I recall, can max out beyond a certain number of cores (I know R10 does, not sure if that applies to R15).
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Knights Landing is FULLY compatible with most Intel CPUs, all the way until AVX2. With AVX3-512 it diverges a bit, but it shares common instructions with Skylake. And its going to benefit from it unlike Knights Corner.

They did a great thing by beefing up from a meagre P54 to a Silvermont one with updated ISA. Performance results show that each KNL cores are 3-3.5x as fast compared to KNC cores.

-Itanium: OK performance with native, still lot of effort to optimize. You needed native to be useful at all. Poor x86 performance.
-Atom on Android: Entrenched ARM user base and code. Software performance with ARM native code was much better, but ecosystem is massively in favor of ARM

With Knights Landing, you don't have the biggest issues that plagued Itanium and Atom Android. SpecFP results are showing decent performance with basically no optimization. Sure, there is some. But Haswell results show that things like FMA isn't being used. Unlike Itanium SpecFP is isn't the best result, because its quite a general purpose benchmark. AVX3 is going to be developed on the most widely used server platform, and they can "specialize" with KNL specific AVX3.
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
Why not 3DPM v2 instead? It should scale better with core count than Cinebench which, if I recall, can max out beyond a certain number of cores (I know R10 does, not sure if that applies to R15).

You didn't explain how a program can use AVX512 and 72 cores without recompiling. That was the challenge proposed. What you explained was how a program can use AVX512 or/xor 72 cores without recompiling.
 

DrMrLordX

Lifer
Apr 27, 2000
21,632
10,845
136
You didn't explain how a program can use AVX512 and 72 cores without recompiling. That was the challenge proposed. What you explained was how a program can use AVX512 or/xor 72 cores without recompiling.

I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.

Or if you had used Java to begin with, all you have to do is run the bench under Java9 without having to recompile anything.

The two different versions of AVX512 may complicate things a bit, or maybe not.