[NextPlatform] Case study: achieving 6/3 TFLOPS SP/DP on Xeon Phi

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Physics Code Modifications Push Xeon Phi Peak Performance

Interesting read about a physicist who needs a lot of computation for his code to analyze some experiment. He decides to try out Intel's x86 Xeon Phi instead of dealing with CUDA. After a lot of trying out, he reaches and even surpasses Intel's performance claims of 6 TFLOPS FMA single precision, and half of that in double precision.

Sadly, the article does not finish with a comparison / evaluation of what this hardware experiment yielded. But nice to hear about this machine in action in real life (and thus real life benchmark).

“I thought I could just write a C/C++ program with obviously vectorizable code that used OpenMP for threading, and I would have no trouble getting 256 hyperthreads to exercise 64 cores in the Xeon Phi processor. Most of that went easily, but for single precision I was stuck at 1 or maybe 2 Tflops per second for a week or two, which was disappointing.”

“It does exactly what I want and just raw computes, with no extraneous memory accesses or use of cache,” explained Dunham. In the end, Dunham said he was able to surpass the performance claims made by Intel (6 TFLOP/s for single precision and 3 TFLOP/s for double precision) by about 5 – 10%.

Read the full article in the link above.

For those who missed it by the way, in July BK announced that their Xeon Phi had in the first two quarters (and I think the first quarter was still limited availability IIRC), they had already sold 8x 2015. Not all that surprising given Knights Corner was at the end of its life, but nice to see the momentum growing of what came out of Larrabee.

Progress in the data center extended beyond our CPU product lines. Our latest Xeon Phi accelerator, formerly known as Knights Landing, continued to ramp after shipping the first limited production units in December of last year. Xeon Phi revenue grew 8x in the first six months of this year versus all of 2015, gaining share in the supercomputing and machine learning segments.


Phycode2.jpg
 
  • Like
Reactions: Dresdenboy

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
So, I will rant for a moment. NO CODE. Neither in the Dunham's article for DeSerio's original publication. Why do computer scientist have to suck at the most fundamental of things - the code, data set, OS, compiler and optimization flags are necessary to reproduce work :mad:.

Otherwise, I sure Intel will do well with KNL. I wonder how much Intel makes on each VPU vs what it makes on other system components (I suppose the integrator could be any company that offers HPC installation).
 

DrMrLordX

Lifer
Apr 27, 2000
21,633
10,845
136
Yeah lack-of-code makes that article less-than-impressive. But it probably is closed-source so . . . there you go. What else can you do?
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
VPU = Vector Processing Unit?

Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
Yeah lack-of-code makes that article less-than-impressive. But it probably is closed-source so . . . there you go. What else can you do?

Well, that is a long standing rant of mine (and others). When I went back to graduate school in 1996, these sort of articles really hacked me off. In general, IBM hates this sort of stuff. It really makes it hard to reproduce results, which, of course, is a corner stone of science. Articles w/o code are computer art, not science, IMHO (or even, just a humble brag).
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.
Intel just calls it a processor.
 

DrMrLordX

Lifer
Apr 27, 2000
21,633
10,845
136
Well, that is a long standing rant of mine (and others). When I went back to graduate school in 1996, these sort of articles really hacked me off. In general, IBM hates this sort of stuff. It really makes it hard to reproduce results, which, of course, is a corner stone of science. Articles w/o code are computer art, not science, IMHO (or even, just a humble brag).

That's why, on those rare occasions when I produce some code that might be of interest to others, I just slap on a BSD 2-clause license and let it go. That is, when I'm allowed to do so (assuming I am not directly hacking on someone else's closed-source code that they published somehow).
 

NTMBK

Lifer
Nov 14, 2011
10,237
5,020
136
Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.

In Intel lingo, a VPU is the vector unit, with two VPUs per core:

KnightsLanding.png
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
That's why, on those rare occasions when I produce some code that might be of interest to others, I just slap on a BSD 2-clause license and let it go. That is, when I'm allowed to do so (assuming I am not directly hacking on someone else's closed-source code that they published somehow).

Yeah, anything interesting I've done has been proprietary. This is from a University professor - too restrictive an environment in a time when IP is so valuable - it's wrecking computer science, IMHO.
 

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
Intel just calls it a processor.

In Intel lingo, a VPU is the vector unit, with two VPUs per core:

So Intel looks at KNL as a processor with enhanced vector units - that fine. I seem to recall it is built around an ATOM core, so it makes sense. An OS can be run on KNL - certainly qualifies as a general processor. It will be interesting so see how the next iteration proceeds http://www.anandtech.com/show/10575/intel-announces-knights-mill-a-xeon-phi-for-deep-learning
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
So Intel looks at KNL as a processor with enhanced vector units - that fine. I seem to recall it is built around an ATOM core, so it makes sense. An OS can be run on KNL - certainly qualifies as a general processor. It will be interesting so see how the next iteration proceeds http://www.anandtech.com/show/10575/intel-announces-knights-mill-a-xeon-phi-for-deep-learning
Knights Mill mainly will add half-precision (with 32-bit defined as single precision). Knights Hill will be the next big upgrade. They probably inserted Mill because 10nm is delayed and for sure they won't come with a 700mm² chip in 2018.
 
Mar 10, 2006
11,715
2,012
126
Knights Mill mainly will add half-precision (with 32-bit defined as single precision). Knights Hill will be the next big upgrade. They probably inserted Mill because 10nm is delayed and for sure they won't come with a 700mm² chip in 2018.

Not quite. Intel is claiming a significant boost in single precision perf compared to Knights Landing.

Looks to me like a GP102-equivalent.
 

imported_ats

Senior member
Mar 21, 2008
422
63
86
Not quite. Intel is claiming a significant boost in single precision perf compared to Knights Landing.

Looks to me like a GP102-equivalent.

I think you are confusing Mill and Hill. Hill is the 3rd gen on 10nm that will power the next generation of leadership supercomputers which is expected to be a significant performance increase over Knights Landing. Mill is largely a modification to Knights Landing to support FP16 and potentially FP8 to add additional performance in the machine learning realm. Now, it might have some additional process/circuit enhancement that make it slightly better than KL, but it isn't really intended to be the 3rd generation, more of a 2.2/2.5 step in line with say Kaby Lake.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I can't even tell if this benchmark is performing a real task or just a loop with unrolled register FMA. The comment where he says it's "raw compute" with "no extraneous memory accesses or usage of cache" makes me skeptical. Calling the demonstration a "speed of light" program, while still rather ambiguous, is not that inspiring either.

Naturally you always should be able to achieve very close to the advertised peak (or in the case where they were rounding it down like this, higher) for some kind of program. That is, unless the advertisement was dishonest or something ended up very broken with the design. But that doesn't say an awful lot.

I think you can take a quick look at the uarch details and determine that for a lot of real code you won't be able to get very close to the peak FP throughput, because it can only issue up to two instructions per cycle. So in particular stores, pointer arithmetic, comparisons and branches will erode peak performance, assuming the kernel was otherwise pure FMA.
 
Mar 10, 2006
11,715
2,012
126
I think you are confusing Mill and Hill. Hill is the 3rd gen on 10nm that will power the next generation of leadership supercomputers which is expected to be a significant performance increase over Knights Landing. Mill is largely a modification to Knights Landing to support FP16 and potentially FP8 to add additional performance in the machine learning realm. Now, it might have some additional process/circuit enhancement that make it slightly better than KL, but it isn't really intended to be the 3rd generation, more of a 2.2/2.5 step in line with say Kaby Lake.

intel_knights_phi_mill_678x452.jpg
 
Mar 10, 2006
11,715
2,012
126
So what? Where does it say "significant boost in single precision"?

Y-axis is single precision floating point performance.

14nm+ is probably giving Intel a frequency boost, and they could be unlocking the other six cores on the die that were disabled in the top Knights Landing SKUs.
 
  • Like
Reactions: witeken

imported_ats

Senior member
Mar 21, 2008
422
63
86
Y-axis is single precision floating point performance.

14nm+ is probably giving Intel a frequency boost, and they could be unlocking the other six cores on the die that were disabled in the top Knights Landing SKUs.

Um, that graph could be inverted log for all we know. Seriously, basing expectations off of unlabeled graphs is just a plain bad idea.

And apparently everyone in the industry agrees with me as basically you are the only person who seems to think there is any significant performance increase.

As far as enabling more cores, they might be able to enable another 2 cores, but it is highly unlikely that they'll end up selling 76 core parts. That's a pretty rare occurrence for that size die to have a 100% functional die. The disabled cores are a significant yield salvage and cost reduction.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,454
7,862
136
I can't even tell if this benchmark is performing a real task or just a loop with unrolled register FMA. The comment where he says it's "raw compute" with "no extraneous memory accesses or usage of cache" makes me skeptical. Calling the demonstration a "speed of light" program, while still rather ambiguous, is not that inspiring either.

Naturally you always should be able to achieve very close to the advertised peak (or in the case where they were rounding it down like this, higher) for some kind of program. That is, unless the advertisement was dishonest or something ended up very broken with the design. But that doesn't say an awful lot.

I think you can take a quick look at the uarch details and determine that for a lot of real code you won't be able to get very close to the peak FP throughput, because it can only issue up to two instructions per cycle. So in particular stores, pointer arithmetic, comparisons and branches will erode peak performance, assuming the kernel was otherwise pure FMA.

Strange uArch. Dual issue with six execution ports - wth?
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Strange uArch. Dual issue with six execution ports - wth?

Yes, well, anyone who thinks x86 is an excellent fit for this application has gone straight from drinking to smoking the kool-aid. Or is required by their employer to think that.

But Goldmont went three wide, so maybe Knight's Hill (?) will as well.
 
Mar 10, 2006
11,715
2,012
126
Yes, well, anyone who thinks x86 is an excellent fit for this application has gone straight from drinking to smoking the kool-aid. Or is required by their employer to think that.

But Goldmont went three wide, so maybe Knight's Hill (?) will as well.

I believe Hill is based on a heavily modified Goldmont core.