[NextPlatform] Case study: achieving 6/3 TFLOPS SP/DP on Xeon Phi

witeken · Nov 4, 2016

Physics Code Modifications Push Xeon Phi Peak Performance

Interesting read about a physicist who needs a lot of computation for his code to analyze some experiment. He decides to try out Intel's x86 Xeon Phi instead of dealing with CUDA. After a lot of trying out, he reaches and even surpasses Intel's performance claims of 6 TFLOPS FMA single precision, and half of that in double precision.

Sadly, the article does not finish with a comparison / evaluation of what this hardware experiment yielded. But nice to hear about this machine in action in real life (and thus real life benchmark).

“I thought I could just write a C/C++ program with obviously vectorizable code that used OpenMP for threading, and I would have no trouble getting 256 hyperthreads to exercise 64 cores in the Xeon Phi processor. Most of that went easily, but for single precision I was stuck at 1 or maybe 2 Tflops per second for a week or two, which was disappointing.”

“It does exactly what I want and just raw computes, with no extraneous memory accesses or use of cache,” explained Dunham. In the end, Dunham said he was able to surpass the performance claims made by Intel (6 TFLOP/s for single precision and 3 TFLOP/s for double precision) by about 5 – 10%.

Read the full article in the link above.

For those who missed it by the way, in July BK announced that their Xeon Phi had in the first two quarters (and I think the first quarter was still limited availability IIRC), they had already sold 8x 2015. Not all that surprising given Knights Corner was at the end of its life, but nice to see the momentum growing of what came out of Larrabee.

Progress in the data center extended beyond our CPU product lines. Our latest Xeon Phi accelerator, formerly known as Knights Landing, continued to ramp after shipping the first limited production units in December of last year. Xeon Phi revenue grew 8x in the first six months of this year versus all of 2015, gaining share in the supercomputing and machine learning segments.

Ajay · Nov 5, 2016

So, I will rant for a moment. NO CODE. Neither in the Dunham's article for DeSerio's original publication. Why do computer scientist have to suck at the most fundamental of things - the code, data set, OS, compiler and optimization flags are necessary to reproduce work

.

Otherwise, I sure Intel will do well with KNL. I wonder how much Intel makes on each VPU vs what it makes on other system components (I suppose the integrator could be any company that offers HPC installation).

witeken · Nov 5, 2016

Ajay said:
VPU

VPU = Vector Processing Unit?

DrMrLordX · Nov 5, 2016

Yeah lack-of-code makes that article less-than-impressive. But it probably is closed-source so . . . there you go. What else can you do?

Ajay · Nov 5, 2016

witeken said:
VPU = Vector Processing Unit?

Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.

Ajay · Nov 5, 2016

DrMrLordX said:
Yeah lack-of-code makes that article less-than-impressive. But it probably is closed-source so . . . there you go. What else can you do?

Well, that is a long standing rant of mine (and others). When I went back to graduate school in 1996, these sort of articles really hacked me off. In general, IBM hates this sort of stuff. It really makes it hard to reproduce results, which, of course, is a corner stone of science. Articles w/o code are computer art, not science, IMHO (or even, just a humble brag).

witeken · Nov 5, 2016

Ajay said:
Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.

Intel just calls it a processor.

DrMrLordX · Nov 5, 2016

Ajay said:
Well, that is a long standing rant of mine (and others). When I went back to graduate school in 1996, these sort of articles really hacked me off. In general, IBM hates this sort of stuff. It really makes it hard to reproduce results, which, of course, is a corner stone of science. Articles w/o code are computer art, not science, IMHO (or even, just a humble brag).

That's why, on those rare occasions when I produce some code that might be of interest to others, I just slap on a BSD 2-clause license and let it go. That is, when I'm allowed to do so (assuming I am not directly hacking on someone else's closed-source code that they published somehow).

NTMBK · Nov 5, 2016

Ajay said:
Yeah, it's not really a GPU and it's more than a CPU. I heard Xeon Phis referred to this way. Don't really know what the convention is, or if there is one.

In Intel lingo, a VPU is the vector unit, with two VPUs per core:

Ajay · Nov 5, 2016

DrMrLordX said:
That's why, on those rare occasions when I produce some code that might be of interest to others, I just slap on a BSD 2-clause license and let it go. That is, when I'm allowed to do so (assuming I am not directly hacking on someone else's closed-source code that they published somehow).

Yeah, anything interesting I've done has been proprietary. This is from a University professor - too restrictive an environment in a time when IP is so valuable - it's wrecking computer science, IMHO.

Ajay · Nov 5, 2016

witeken said:
Intel just calls it a processor.

NTMBK said:
In Intel lingo, a VPU is the vector unit, with two VPUs per core:

So Intel looks at KNL as a processor with enhanced vector units - that fine. I seem to recall it is built around an ATOM core, so it makes sense. An OS can be run on KNL - certainly qualifies as a general processor. It will be interesting so see how the next iteration proceeds http://www.anandtech.com/show/10575/intel-announces-knights-mill-a-xeon-phi-for-deep-learning

witeken · Nov 5, 2016

Ajay said:
So Intel looks at KNL as a processor with enhanced vector units - that fine. I seem to recall it is built around an ATOM core, so it makes sense. An OS can be run on KNL - certainly qualifies as a general processor. It will be interesting so see how the next iteration proceeds http://www.anandtech.com/show/10575/intel-announces-knights-mill-a-xeon-phi-for-deep-learning

Knights Mill mainly will add half-precision (with 32-bit defined as single precision). Knights Hill will be the next big upgrade. They probably inserted Mill because 10nm is delayed and for sure they won't come with a 700mm² chip in 2018.

Arachnotronic · Nov 5, 2016

witeken said:
Knights Mill mainly will add half-precision (with 32-bit defined as single precision). Knights Hill will be the next big upgrade. They probably inserted Mill because 10nm is delayed and for sure they won't come with a 700mm² chip in 2018.

Not quite. Intel is claiming a significant boost in single precision perf compared to Knights Landing.

Looks to me like a GP102-equivalent.

imported_ats · Nov 6, 2016

Arachnotronic said:
Not quite. Intel is claiming a significant boost in single precision perf compared to Knights Landing.

Looks to me like a GP102-equivalent.

I think you are confusing Mill and Hill. Hill is the 3rd gen on 10nm that will power the next generation of leadership supercomputers which is expected to be a significant performance increase over Knights Landing. Mill is largely a modification to Knights Landing to support FP16 and potentially FP8 to add additional performance in the machine learning realm. Now, it might have some additional process/circuit enhancement that make it slightly better than KL, but it isn't really intended to be the 3rd generation, more of a 2.2/2.5 step in line with say Kaby Lake.

Exophase · Nov 6, 2016

I can't even tell if this benchmark is performing a real task or just a loop with unrolled register FMA. The comment where he says it's "raw compute" with "no extraneous memory accesses or usage of cache" makes me skeptical. Calling the demonstration a "speed of light" program, while still rather ambiguous, is not that inspiring either.

Naturally you always should be able to achieve very close to the advertised peak (or in the case where they were rounding it down like this, higher) for some kind of program. That is, unless the advertisement was dishonest or something ended up very broken with the design. But that doesn't say an awful lot.

I think you can take a quick look at the uarch details and determine that for a lot of real code you won't be able to get very close to the peak FP throughput, because it can only issue up to two instructions per cycle. So in particular stores, pointer arithmetic, comparisons and branches will erode peak performance, assuming the kernel was otherwise pure FMA.

Arachnotronic · Nov 6, 2016

imported_ats said:
I think you are confusing Mill and Hill. Hill is the 3rd gen on 10nm that will power the next generation of leadership supercomputers which is expected to be a significant performance increase over Knights Landing. Mill is largely a modification to Knights Landing to support FP16 and potentially FP8 to add additional performance in the machine learning realm. Now, it might have some additional process/circuit enhancement that make it slightly better than KL, but it isn't really intended to be the 3rd generation, more of a 2.2/2.5 step in line with say Kaby Lake.

witeken · Nov 6, 2016

Arachnotronic said:

So what? Where does it say "significant boost in single precision"?

NTMBK · Nov 6, 2016

witeken said:
So what? Where does it say "significant boost in single precision"?

On the graph

Arachnotronic · Nov 6, 2016

witeken said:
So what? Where does it say "significant boost in single precision"?

Y-axis is single precision floating point performance.

14nm+ is probably giving Intel a frequency boost, and they could be unlocking the other six cores on the die that were disabled in the top Knights Landing SKUs.

imported_ats · Nov 6, 2016

Arachnotronic said:
Y-axis is single precision floating point performance.

14nm+ is probably giving Intel a frequency boost, and they could be unlocking the other six cores on the die that were disabled in the top Knights Landing SKUs.

Um, that graph could be inverted log for all we know. Seriously, basing expectations off of unlabeled graphs is just a plain bad idea.

And apparently everyone in the industry agrees with me as basically you are the only person who seems to think there is any significant performance increase.

As far as enabling more cores, they might be able to enable another 2 cores, but it is highly unlikely that they'll end up selling 76 core parts. That's a pretty rare occurrence for that size die to have a 100% functional die. The disabled cores are a significant yield salvage and cost reduction.

Ajay · Nov 8, 2016

Exophase said:
I can't even tell if this benchmark is performing a real task or just a loop with unrolled register FMA. The comment where he says it's "raw compute" with "no extraneous memory accesses or usage of cache" makes me skeptical. Calling the demonstration a "speed of light" program, while still rather ambiguous, is not that inspiring either.

Naturally you always should be able to achieve very close to the advertised peak (or in the case where they were rounding it down like this, higher) for some kind of program. That is, unless the advertisement was dishonest or something ended up very broken with the design. But that doesn't say an awful lot.

I think you can take a quick look at the uarch details and determine that for a lot of real code you won't be able to get very close to the peak FP throughput, because it can only issue up to two instructions per cycle. So in particular stores, pointer arithmetic, comparisons and branches will erode peak performance, assuming the kernel was otherwise pure FMA.

Strange uArch. Dual issue with six execution ports - wth?

Exophase · Nov 8, 2016

Ajay said:
Strange uArch. Dual issue with six execution ports - wth?

Yes, well, anyone who thinks x86 is an excellent fit for this application has gone straight from drinking to smoking the kool-aid. Or is required by their employer to think that.

But Goldmont went three wide, so maybe Knight's Hill (?) will as well.

Arachnotronic · Nov 8, 2016

Exophase said:
Yes, well, anyone who thinks x86 is an excellent fit for this application has gone straight from drinking to smoking the kool-aid. Or is required by their employer to think that.

But Goldmont went three wide, so maybe Knight's Hill (?) will as well.

I believe Hill is based on a heavily modified Goldmont core.

[NextPlatform] Case study: achieving 6/3 TFLOPS SP/DP on Xeon Phi

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Senior member

Lifer

Diamond Member

Lifer