Intel Knights Landing Yields Big Bang For The Buck Jump

jhu · Jun 23, 2016

DrMrLordX said:
I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.

The challenge was without recompiling. Also and and or are not the same thing.

DrMrLordX said:
Or if you had used Java to begin with, all you have to do is run the bench under Java9 without having to recompile anything.

This is the only solution that I can see as long as the JVM supports AVX512.

Nothingness · Jun 23, 2016

DrMrLordX said:
I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.

That's exactly what I have been saying since the beginning: you have to recompile.

And in fact, you'll also have some heavy tuning to do because you can't rely on the compiler to vectorize existing code without some massaging. But that's another discussion

Nothingness · Jun 23, 2016

jhu said:
This is the only solution that I can see as long as the JVM supports AVX512.

That'd be cheating :biggrin:

And my remark above about the rework needed to benefit from 512-bit vectors still applies to the Jave source.

Headfoot · Jun 23, 2016

If you have control of the application, recompiling should be trivially easy. If you don't, and its a highly threaded application that may benefit from AVX512 and your vendor doesn't recompile, your vendor sucks and you should call their support line and complain.

The recompile argument smacks of grasping at straws.

DrMrLordX · Jun 23, 2016

Eh. It depends on who is maintaining the benchmark at that point. "Just recompile" becomes "wait for the code maintainer to recompile". Then it's a matter of whenever support for AVX512 is added to the autovectorization routines for the compiler(s) used by whoever maintains the code.

Obviously you are not going to see the benefits of AVX512 running an old Cinebench since Maxon is not going to recompile for it. They do not even use AVX2 anyway . . .

As for massaging the code, if you look at some of Intel's own style guides for programmers, there isn't much they recommend aside from reducing code dependency in loops. Stuff they wrote in collaboration with Oracle to showcase Java's support for AVX2 was pretty simplistic.

I could see any of those examples featuring enough loop iterations for the JVM to optimize to AVX512, especially in the cases where they use 64-bit operands.

PPB · Jun 23, 2016

Imagine taking one of those for a ride with Vray and other CPU oriented renderers....... things would be blazing fast. This needs a prosumer variant soon.

IntelUser2000 · Jul 12, 2016

Interestingly, the first post with the TFlops figure quoted is incorrect. Only the Xeon Phi 7290 model achieves the promised 3TFlops figure.

The Xeon Phi 7250 for example...

1.6GHz 1 core frequency
1.5GHz 1-2 core frequency
1.4GHz all core frequency

Not mentioned is AVX frequency, which for this part is 1.2GHz.

AVX frequency:
7290: 1.3GHz
7250: 1.2GHz
7210/7230: 1.1GHz

It explains the seeming "low efficiency" of only having 2TFlops in measurements versus 3TFlops theoretical. For the 7210 and 7230 the peak flops is only 2.25TFlops. For the 7250 its 2611TFlops. The "efficiency" is actually in the 80-90% range typical with regular Xeons.

Obviously they stumbled a bit with 14nm problems and associated delays. A Knights Landing part with 20% higher average performance a year ago would have been very impressive.

Nothingness · Jul 12, 2016

IntelUser2000 said:
Not mentioned is AVX frequency, which for this part is 1.2GHz.

AVX frequency:
7290: 1.3GHz
7250: 1.2GHz
7210/7230: 1.1GHz

Where did you get that from?

If it's correct then Intel lied when quoting their peak DP GFLOPS, which I'd find surprising.

Nothingness · Jul 12, 2016

It looks like you are right:

http://www.intel.com/content/dam/ww...t-briefs/xeon-phi-processor-product-brief.pdf

Code:

      Freq*
7290  1.5 GHz
7250  1.4 GHz
7230  1.3 GHz
7210  1.3 GHz

*Frequency listed is nominal (non-AVX) TDP frequency.
For all-tile turbo frequency, add 100 MHz. For single-tile turbo
frequency, add 200 MHz. For high-AVX instruction frequency,
subtract 200 MHz.

I'm surprised...

DrMrLordX · Jul 12, 2016

Huh. So KNL has to downclock for AVX512? Disturbing actually.

jhu · Jul 12, 2016

DrMrLordX said:
Huh. So KNL has to downclock for AVX512? Disturbing actually.

Maybe not that surprising given that GPUs are also massively parallel with low clocks. My Radeon RX 480 is also only 1.2 GHz.

Edrick · Jul 13, 2016

DrMrLordX said:
Huh. So KNL has to downclock for AVX512? Disturbing actually.

So you lose 200mhz of clock speed for almost double the flops (with optimized code). Haven't Xeons been doing that since Haswell?

Nothingness · Jul 13, 2016

Edrick said:
So you lose 200mhz of clock speed for almost double the flops (with optimized code). Haven't Xeons been doing that since Haswell?

It started with Broadwell IIRC. The difference here is that 200 MHz is more than 10% of base clock so the impact is higher than on most Xeon chips.

But I agree this had to be expected: we are talking of 2*(68--72) 512-bit FMA units on a die, that's massive. Intel nonetheless missed their 3+ TFLOPS target by a little :biggrin:

mikk · Jul 13, 2016

Nothingness said:
Intel nonetheless missed their 3+ TFLOPS target by a little :biggrin:

They didn't miss, 7290 is still easily over 3+ TFLOP.

TeknoBug · Jul 13, 2016

"Knights Landing"? Whoever came with the codename is hooked on GoT.

But those are crazy crazy CPU's, would make a great ESXi server.

Nothingness · Jul 13, 2016

mikk said:
They didn't miss, 7290 is still easily over 3+ TFLOP.

No it isn't: 1.3 GHz * 72 cores * 8 DP * 2 FMA * 2 units = 2995.2. I call that 3 TFLOPS not 3+ (and certainly not the 3.46 TFLOPS found in the original post of this thread). As far as I'm concerned they missed their targets (though I still find the chip impressive).

EDIT: to clarify, the original post that contains the PEAK DP seems to be coming from a web site not directly from Intel.

jhu · Jul 13, 2016

TeknoBug said:
"Knights Landing"? Whoever came with the codename is hooked on GoT.

But those are crazy crazy CPU's, would make a great ESXi server.

Sure, if you don't mind running on a bunch of Atom cores at 1.2 GHz.

DrMrLordX · Jul 13, 2016

Edrick said:
So you lose 200mhz of clock speed for almost double the flops (with optimized code). Haven't Xeons been doing that since Haswell?

I expect "general purpose" CPUs to maybe have to do this for AVX/AVX2. They are built for more than just enormous, highly-parallel series of calculations. Xeon Phi is built for a fairly narrow set of use cases without the need to support a ton of legacy code (though the fact that it can handle some legacy code is kinda neat, I guess).

You would think that anyone "serious" about getting the most out of Phi would be using it only for AVX512 workloads. Any clockspeed potential for non-AVX512 workloads would be mostly irrelevant.

Edrick · Jul 13, 2016

DrMrLordX said:
You would think that anyone "serious" about getting the most out of Phi would be using it only for AVX512 workloads. Any clockspeed potential for non-AVX512 workloads would be mostly irrelevant.

I agree, non AVX speeds seem to be irrelevant for most purposes of these chips. I suppose Intel could have just stated the base speed is 1.2Ghz with a turbo boost of 200mhz for non AVX workloads. :\

IntelUser2000 · Jul 14, 2016

SpecFP comparisons Xeon Phi 7250 versus Xeon E5 v4.

(Base/Peak)
Xeon Phi 7250 68 cores: 842/870
Xeon E5 2699 v4 1P 22 cores: 551/568
Xeon E5 2699 v4 2P 44 cores: 1100/1130

Compare that to Xeon Phi Knights Corner.

http://iopscience.iop.org/article/10.1088/1742-6596/513/5/052024/pdf

According to the above result, HEPSPEC06 is a Spec06 equivalent for High Energy Nuclear Physics. Search also indicates that Spec isn't easily vectorized.

Xeon Phi KNC: 140-210
4 core 3.1GHz Haswell Xeon(looks like a Xeon E3 1220 v3): 119

SpecFP Xeon E3 1220 v3: 137/139

Xeon Phi extrapolated assuming similar ratios: 161-245
Ivy Bridge Xeon E5-2697 v2 1P 12 cores: 335/345

So with Knights Corner, basically unoptimized but high bandwidth/high thread loving SpecFP would have got 48-73% of a single Xeon E5 of its day. With Knights Landing, it does 1.5-1.55x the performance of the top 1P Xeon of its day. This is its strength. Its basically "out of the box", with better performance than Xeon, and the chance to vectorize which will boost it further.

Further comparison also shows that Xeon Phi 7250 is consistenly faster even in subtests of the SpecFP benchmark than the Broadwell Xeon E5.

Intel Knights Landing Yields Big Bang For The Buck Jump

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Elite Member

Diamond Member

Diamond Member

Lifer

Lifer

Golden Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Lifer

Golden Member

Elite Member