Knights Landing package pictures

NTMBK · Mar 29, 2015

The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.

DrMrLordX · Mar 30, 2015

bronxzv said:
someone simply made up that adapter story for some reason

Might have, and no I've never seen an adapter for high-end CPUs (or ultra-high-end, in the case of Knight's Landing). It's just that seeing Intel share sockets between Xeons and Xeon Phis would be awesome, and reminiscent of the original AMD Fusion plan (wherein CPUs and GPUs would fit into sockets on MP boards and share the same HT links).

NTMBK said:
The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.

You bring up an interesting point here. Despite Nvidia's current prominence in the HPC market when it comes to high-performance compute devices, it's really quite obvious that they have very little market control over the underlying platforms upon which their products are reliant.

Nvidia's future in the HPC market is based on the idea that HPC machines will continue to feature PCIe slots, and that everyone will be quite happy dispatching work queues to PCIe-connected devices hosted on otherwise-traditional server hardware.

Intel wants to eliminate the PCIe element entirely with a high-performance compute device that can also run the underlying OS and handle "non-compute" functions generally ill-suited to GPGPU devices.

AMD wants to fuse the CPU and GPU into one cheap little SoC node unit and hope that people will buy a whole lot of them (Berlin, or what Berlin could have been if it had seen the light of day).

Neither Intel's nor AMD's intended HPC cluster paradigm "of the future" leaves much room for Nvidia's products. Maybe that's one of the reasons why that new ORNL supercomputer is using POWER + Tesla instead of Xeons.

Nothingness · Mar 30, 2015

NTMBK said:
The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.

Agreed this is interesting, but OTOH the low single thread performance will be an issue for many tasks (as Intel apologists like to [correctly] remind us all when talking about microserver SoCs

), so high end Xeons will still be needed.

JoeRambo · Mar 30, 2015

Intel has advantage of 14nm vs 28nm for Nvidia, so they are simply in different league in perf/watt.

And x86 compability is never to be underestimated. Intel moving Xeon Phi project towards vanilla x86 + AVX512 is an alarm bell for NV/AMD compute projects. How they would stack up if all were on same process, I don't know, but i'd take AVX anything over CUDA/OpenCL any day EVEN if Xeon was half as fast.

Arachnotronic · Mar 30, 2015

Nothingness said:
Agreed this is interesting, but OTOH the low single thread performance will be an issue for many tasks (as Intel apologists like to [correctly] remind us all when talking about microserver SoCs ), so high end Xeons will still be needed.

I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

What I find more interesting is that Intel was able to deliver significantly more per clock performance with the Knights Landing Silvermont even though it remains 2-issue, which might suggest that future Atom CPUs will stay 2-issue for quite a while yet.

JoeRambo · Mar 30, 2015

Arachnotronic said:
I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

Why is ST important at all??? It's FP monster and FP is all about flops and memory hierarchy bandwith/latency. So you have your Atom core with HT and you run xHT threads with your AVX512 code that gets executed on SIMD units. It's all about thoughput.

EDIT: What i mean, is that while ST performance is relevant for OS etc, Intel has been moving everything including kitchen sink into AVX512 instruction set, you can do ton of stuff that was done outside of vectorization or had to be explicitly set up for accelerator (like converting structure, processing head/tails of structure etc). Scatter/gather, masks for logic/flow control, explicit rounding are all there.

Nothingness · Mar 30, 2015

Arachnotronic said:
I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

Did Intel say they improved the integer core?

What I find more interesting is that Intel was able to deliver significantly more per clock performance with the Knights Landing Silvermont even though it remains 2-issue, which might suggest that future Atom CPUs will stay 2-issue for quite a while yet.

Given what is achievable with a two way decode chip (cf. Cortex-A17), there surely is room for improvement over Silvermont. But I'm not sure it would make sense to go too far for a multi core chip targeting HPC. OTOH the integer code probably looks tiny compared to the 512-bit ALUs

Nothingness · Mar 30, 2015

JoeRambo said:
Why is ST important at all??? It's FP monster and FP is all about flops and memory hierarchy bandwith/latency. So you have your Atom core with HT and you run xHT threads with your AVX512 code that gets executed on SIMD units. It's all about thoughput.

EDIT: What i mean, is that while ST performance is relevant for OS etc, Intel has been moving everything including kitchen sink into AVX512 instruction set, you can do ton of stuff that was done outside of vectorization or had to be explicitly set up for accelerator (like converting structure, processing head/tails of structure etc). Scatter/gather, masks for logic/flow control, explicit rounding are all there.

Look for Amdahl law

Arachnotronic · Mar 30, 2015

Nothingness said:
Did Intel say they improved the integer core?

Given what is achievable with a two way decode chip (cf. Cortex-A17), there surely is room for improvement over Silvermont. But I'm not sure it would make sense to go too far for a multi core chip targeting HPC. OTOH the integer code probably looks tiny compared to the 512-bit ALUs

Deeper OoO buffers, advanced prediction, and high cache bandwidth all sound like they would benefit integer perf/clock.

Nothingness · Mar 30, 2015

Thanks for the slide, I had missed it

Arachnotronic said:
Deeper OoO buffers, advanced prediction, and high cache bandwidth all sound like they would benefit integer perf/clock.

Cache BW and deeper OoO could apply only to the SIMD units, but advanced branch prediction will surely benefit integer code too

DrMrLordX · Mar 30, 2015

JoeRambo said:
How they would stack up if all were on same process, I don't know, but i'd take AVX anything over CUDA/OpenCL any day EVEN if Xeon was half as fast.

GPGPU stuff can be pretty difficult to approach from the programming angle. Just looking at AMD's 1.0 final HSA documents is enough to give one a headache.

Stuff like OpenMP and other compiler-based solutions are supposed to "cover up" the problem and make it easier for coders to use compute resources, but hell compiler support for Knight's Landing is practically already there (just not necessarily the AVX512 part, yet). If you can code for a 16-core Xeon, you can pretty much code for Knight's Landing.

bronxzv · Mar 30, 2015

DrMrLordX said:
but hell compiler support for Knight's Landing is practically already there (just not necessarily the AVX512 part, yet).

note that the Intel compiler fully supports (i.e using the vectorizer not only code using intrinsics) AVX-512 for KNL targets since more than one year and AVX-512 for SKX targets since several weeks now

you can see a discussion about the different compiler back-end code generation strategies for KNL and SKX here: http://www.realworldtech.com/forum/?threadid=147882&curpostid=147882

JoeRambo · Mar 30, 2015

DrMrLordX said:
GPGPU stuff can be pretty difficult to approach from the programming angle. Just looking at AMD's 1.0 final HSA documents is enough to give one a headache.

I am well aware of those difficulties, been toying with converting some of the code we are using to Cuda, and while it worked, we have stayed with old Intel code. Simply cause Haswell started pushing instructions damn fast and Intel's math libs beeing so sweet.

Nothingness said:
Look for Amdahl law

Not sure if this is joke, but HPC and Amdahl's law? Aren't those guys all about insanely parallel processing and always looking for algorithms to expand parallel domains?
They are certainly not buying these cpus for Atom cores even if they are with 4 HT and improved ST performance

There are cheaper higher clocked Haswells for that. And if you run 4 integer code threads (like 7z) on that core, i doubt you will get much speed up from it, probably would get slow down instead. But run some real world code, with real memory bw requirements and latencies and 4H T threads will help to mask latencies and increase utilization. HPC is all about working towards theoretical throughput while still having sensible perf per watt.

Abwx · Mar 30, 2015

bronxzv said:
you can see a discussion about the different compiler back-end code generation strategies for KNL and SKX here: http://www.realworldtech.com/forum/?threadid=147882&curpostid=147882

Some other eventual discussion initiated by the author here :

https://forum.beyond3d.com/threads/knights-landing-details-at-rwt.55338/

DrMrLordX · Mar 30, 2015

bronxzv said:
note that the Intel compiler fully supports (i.e using the vectorizer not only code using intrinsics) AVX-512 for KNL targets since more than one year and AVX-512 for SKX targets since several weeks now

Gotta hand it to Intel. They are once more on the ball with compiler support.

JoeRambo said:
I am well aware of those difficulties, been toying with converting some of the code we are using to Cuda, and while it worked, we have stayed with old Intel code. Simply cause Haswell started pushing instructions damn fast and Intel's math libs beeing so sweet.

CUDA is supposed to have one of the better toolsets out there for GPGPU work. OpenCL and HSA are even less dev-friendly.

IntelUser2000 · Mar 31, 2015

Burpo said:
He was playing with the new Phi a week ago & said..
"The new Knights Landing is over 5 times faster than its predecessor and uses 1/3 the power." Each tile has an Atom core section = 1 execution unit, 30 EU (essentially 8 x 30 processors) for a total of 240 processors on the die and is only using around 100 watts. He had to boot up firmware for the phi processors prior to booting the machine, but says final product will have those instructions included, and will probably use an adapter to fit LGA 2011 socket. The target for release is 13 teraflops per die, x 2 is 26 teraflops. He says 1 of these can replace 20 servers & has Nvidia shaking in their boots. He is NOT an Intel fanboy either, but says Intel will take over HPC soon..

Out of this nonsense we may find some truth in it.

"Over 5 times faster" - Knights Corner is pretty poor in some code. This seems realistic since its a second generation and would have learned a lot from the first one

"uses 1/3 the power" - Although I doubt we'll see this, because socketable versions will be available, we might see something in the 100-130W range. That's 1/3 the power

"The target for release is 13 teraflops per die, x 2 is 26 teraflops."

Not sure why he thinks this way, as Intel is aiming for "14-16 DP Flops/watt". That means 13 TF = 813W

cool

in the best case scenario. If your friend is sort of clueless and its a SP figure then its still 407W, still way too much.

The real figures are half that since Knights Landing will be at the 160-215W range. Perhaps maybe for some reason they'll boost SP by 2x and result in 3.25 TFLOP DP/13 TFLOP SP?

Not sure if this is joke, but HPC and Amdahl's law? Aren't those guys all about insanely parallel processing and always looking for algorithms to expand parallel domains?

The weird thing is though they tout "3x performance in scalar over Knights Corner". And obviously that slide above shows further improvements over Silvermont, which in itself is a huge gain over P54C. Either they are aiming Knights Landing for parallel but not so quite HPC code, or Knights Corner's weak performance was due to certain limits in single threaded performance. Or both. The successor is going to boost that even further, by using Goldmont cores.

Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

If we assume certain SKUs of Knights Landing achieves peak stated goals of 16 DP FLOPs/W, and the top SKU is stated to be 215W, we get 3.4TFLOPs of performance. That results in 1.45-1.5GHz. It's possible though we might see Turbo for less utilized cores or vector units.

jhu · Mar 31, 2015

That chin...

DrMrLordX · Mar 31, 2015

jhu said:
That chin...

Hmm! The resemblance is there . . .

bronxzv · Mar 31, 2015

IntelUser2000 said:
The weird thing is though they tout "3x performance in scalar over Knights Corner".

3 x *single thread* peak *theoretical* throughput (see fine prints), typically achieved with 1.5 x the clock frequency and a dual vector unit in KNL instead of a single vector unit in KNC

near 3 x better *effective* throughput will be achievable only with perfectly cache blocked algorithms, though

we can expect way less than 3 x speedup (from KNC to KNL) for workloads with worksets in local memory since local memory bandwidth is only slightly improved from KNC to KNL (not even 1.5 x more bandwidth)

IntelUser2000 · Mar 31, 2015

bronxzv said:
3 x *single thread* peak *theoretical* throughput (see fine prints), typically achieved with 1.5 x the clock frequency and a dual vector unit in KNL instead of a single vector unit in KNC

KNC has a P54C core, that's back a decade ago. KNL has Silvermont core. Silvermont is 50% faster than first Atom alone, nevermind a 1995 P54C Pentium one per clock. The fastest KNC is at 1.238GHz. KNL only has ~20% clock advantage. BTW, the original Atom was almost as fast as first Pentium 4's except in code where it took advantage of SSE and FP since Pentium 4's. That's way faster than P54C.

Silvermont>>First Atom>>P54C

Vector unit performance is explicitely stated seperately for 3+TFlops performance.

Those are two DIFFERENT things.

bronxzv · Apr 1, 2015

IntelUser2000 said:
KNL only has ~20% clock advantage

source ?

IntelUser2000 said:
Vector unit performance is explicitely stated seperately for 3+TFlops performance.
hose are two DIFFERENT things.

IMHO much the same thing (but for a *single thread*) from fine print (6) in the slide posted above: "Projected peak theoretical single-thread performance [...]"

or do you have a source where *scalar* performance and/or *effective* performance is claimed ?

pcgeek11 · Apr 1, 2015

Ramses said:
anyone old enough to keep hearing knots landing when reading that name?

Yeah. That is what I saw too. :\

NTMBK · Apr 1, 2015

bronxzv said:
source ?

6 TFLOPs single precision performance should be all the information you need.

6TFLOPs / (72 cores * 2 vector units * 16 vector lanes * 2 ops [FMA is two ops] ) = 1302083333.3 Hz = 1.3GHz

Or if it has "only" 60 cores, it comes out to 1.56GHz.

bronxzv · Apr 1, 2015

NTMBK said:
6TFLOPs / (72 cores * 2 vector units * 16 vector lanes * 2 ops [FMA is two ops] ) = 1302083333.3 Hz = 1.3GHz

Or if it has "only" 60 cores, it comes out to 1.56GHz.

sure, but the slide says in fineprint (0) "Over 3 Teraflops [...]", it doesn't make much sense to do too much precise predictions based on the fact that it isn't 3 TFlops (DP theoretical peak) but > 3 TFlops

one possible indicator for KNL frequency will be the frequency range for Airmont, i.e. at which frequency it is with a power consistant with the expected power for KNL

this SKU for example is given at 2 W SDP for 4 cores (+ iGPU) and @ 1.6-2.4 GHz:
http://ark.intel.com/products/85475/Intel-Atom-x7-Z8700-Processor-2M-Cache-up-to-2_40-GHz

we must also take into account the 2 wide vector units, so that's not so simple

NTMBK · Apr 1, 2015

bronxzv said:
sure, but the slide says in fineprint (0) "Over 3 Teraflops [...]", it doesn't make much sense to do too much precise predictions based on the fact that it isn't 3 TFlops (DP theoretical peak) but > 3 TFlops

one possible indicator for KNL frequency will be the frequency range for Airmont, i.e. at which frequency it is with a power consistant with the expected power for KNL

Well, we know it isn't much over 3TFLOPs... if it was >3.5TFLOPs, that's what they would put on the slide.

It's enough for a rough ballpark figure, and it means we know it isn't going to be some crazy >2GHz thing. I'm sure my clock speed estimate won't turn out to be precisely correct, but I'm pretty confident that it's not that far off.

Airmont is an extremely different core to KL, it isn't crunching 512-bit vectors, or performing big fat 512-bit load/stores through the entire cache hierarchy. I wouldn't use that as a guide for clock speeds. The thing is basically a GPU, expect GPU clock speeds.

Knights Landing package pictures

Lifer

Lifer

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Golden Member

Lifer

Lifer

Elite Member

Lifer

Lifer

Senior member

Elite Member

Senior member

Lifer

Lifer

Senior member

Lifer