Intel Knights Landing Yields Big Bang For The Buck Jump

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jhu

Lifer
Oct 10, 1999
11,918
9
81
I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.

The challenge was without recompiling. Also and and or are not the same thing.

Or if you had used Java to begin with, all you have to do is run the bench under Java9 without having to recompile anything.

This is the only solution that I can see as long as the JVM supports AVX512.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
I don't see the two explanations as being mutually exclusive. If you have a program like 3DPM v2 with multi-iteration loops using something like OpenMP on an autovectorizing compiler, all you need to do is recompile using a newer compiler that supports AVX512 and boom, you have both.
That's exactly what I have been saying since the beginning: you have to recompile.

And in fact, you'll also have some heavy tuning to do because you can't rely on the compiler to vectorize existing code without some massaging. But that's another discussion :)
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
This is the only solution that I can see as long as the JVM supports AVX512.
That'd be cheating :biggrin:

And my remark above about the rework needed to benefit from 512-bit vectors still applies to the Jave source.
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
If you have control of the application, recompiling should be trivially easy. If you don't, and its a highly threaded application that may benefit from AVX512 and your vendor doesn't recompile, your vendor sucks and you should call their support line and complain.

The recompile argument smacks of grasping at straws.
 

DrMrLordX

Lifer
Apr 27, 2000
22,702
12,652
136
Eh. It depends on who is maintaining the benchmark at that point. "Just recompile" becomes "wait for the code maintainer to recompile". Then it's a matter of whenever support for AVX512 is added to the autovectorization routines for the compiler(s) used by whoever maintains the code.

Obviously you are not going to see the benefits of AVX512 running an old Cinebench since Maxon is not going to recompile for it. They do not even use AVX2 anyway . . .

As for massaging the code, if you look at some of Intel's own style guides for programmers, there isn't much they recommend aside from reducing code dependency in loops. Stuff they wrote in collaboration with Oracle to showcase Java's support for AVX2 was pretty simplistic.

I could see any of those examples featuring enough loop iterations for the JVM to optimize to AVX512, especially in the cases where they use 64-bit operands.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
Imagine taking one of those for a ride with Vray and other CPU oriented renderers....... things would be blazing fast. This needs a prosumer variant soon.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Interestingly, the first post with the TFlops figure quoted is incorrect. Only the Xeon Phi 7290 model achieves the promised 3TFlops figure.

The Xeon Phi 7250 for example...

1.6GHz 1 core frequency
1.5GHz 1-2 core frequency
1.4GHz all core frequency

Not mentioned is AVX frequency, which for this part is 1.2GHz.

AVX frequency:
7290: 1.3GHz
7250: 1.2GHz
7210/7230: 1.1GHz

It explains the seeming "low efficiency" of only having 2TFlops in measurements versus 3TFlops theoretical. For the 7210 and 7230 the peak flops is only 2.25TFlops. For the 7250 its 2611TFlops. The "efficiency" is actually in the 80-90% range typical with regular Xeons.

Obviously they stumbled a bit with 14nm problems and associated delays. A Knights Landing part with 20% higher average performance a year ago would have been very impressive.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Not mentioned is AVX frequency, which for this part is 1.2GHz.

AVX frequency:
7290: 1.3GHz
7250: 1.2GHz
7210/7230: 1.1GHz
Where did you get that from?

If it's correct then Intel lied when quoting their peak DP GFLOPS, which I'd find surprising.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
So you lose 200mhz of clock speed for almost double the flops (with optimized code). Haven't Xeons been doing that since Haswell?
It started with Broadwell IIRC. The difference here is that 200 MHz is more than 10% of base clock so the impact is higher than on most Xeon chips.

But I agree this had to be expected: we are talking of 2*(68--72) 512-bit FMA units on a die, that's massive. Intel nonetheless missed their 3+ TFLOPS target by a little :biggrin:
 

TeknoBug

Platinum Member
Oct 2, 2013
2,084
31
91
"Knights Landing"? Whoever came with the codename is hooked on GoT.

But those are crazy crazy CPU's, would make a great ESXi server.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
They didn't miss, 7290 is still easily over 3+ TFLOP.
No it isn't: 1.3 GHz * 72 cores * 8 DP * 2 FMA * 2 units = 2995.2. I call that 3 TFLOPS not 3+ (and certainly not the 3.46 TFLOPS found in the original post of this thread). As far as I'm concerned they missed their targets (though I still find the chip impressive).

EDIT: to clarify, the original post that contains the PEAK DP seems to be coming from a web site not directly from Intel.
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
"Knights Landing"? Whoever came with the codename is hooked on GoT.

But those are crazy crazy CPU's, would make a great ESXi server.

Sure, if you don't mind running on a bunch of Atom cores at 1.2 GHz.
 

DrMrLordX

Lifer
Apr 27, 2000
22,702
12,652
136
So you lose 200mhz of clock speed for almost double the flops (with optimized code). Haven't Xeons been doing that since Haswell?

I expect "general purpose" CPUs to maybe have to do this for AVX/AVX2. They are built for more than just enormous, highly-parallel series of calculations. Xeon Phi is built for a fairly narrow set of use cases without the need to support a ton of legacy code (though the fact that it can handle some legacy code is kinda neat, I guess).

You would think that anyone "serious" about getting the most out of Phi would be using it only for AVX512 workloads. Any clockspeed potential for non-AVX512 workloads would be mostly irrelevant.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
You would think that anyone "serious" about getting the most out of Phi would be using it only for AVX512 workloads. Any clockspeed potential for non-AVX512 workloads would be mostly irrelevant.

I agree, non AVX speeds seem to be irrelevant for most purposes of these chips. I suppose Intel could have just stated the base speed is 1.2Ghz with a turbo boost of 200mhz for non AVX workloads. :\
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
SpecFP comparisons Xeon Phi 7250 versus Xeon E5 v4.

(Base/Peak)
Xeon Phi 7250 68 cores: 842/870
Xeon E5 2699 v4 1P 22 cores: 551/568
Xeon E5 2699 v4 2P 44 cores: 1100/1130

Compare that to Xeon Phi Knights Corner.

http://iopscience.iop.org/article/10.1088/1742-6596/513/5/052024/pdf

According to the above result, HEPSPEC06 is a Spec06 equivalent for High Energy Nuclear Physics. Search also indicates that Spec isn't easily vectorized.

Xeon Phi KNC: 140-210
4 core 3.1GHz Haswell Xeon(looks like a Xeon E3 1220 v3): 119

SpecFP Xeon E3 1220 v3: 137/139

Xeon Phi extrapolated assuming similar ratios: 161-245
Ivy Bridge Xeon E5-2697 v2 1P 12 cores: 335/345

So with Knights Corner, basically unoptimized but high bandwidth/high thread loving SpecFP would have got 48-73% of a single Xeon E5 of its day. With Knights Landing, it does 1.5-1.55x the performance of the top 1P Xeon of its day. This is its strength. Its basically "out of the box", with better performance than Xeon, and the chance to vectorize which will boost it further.

Further comparison also shows that Xeon Phi 7250 is consistenly faster even in subtests of the SpecFP benchmark than the Broadwell Xeon E5.
 
Last edited: