Knights Landing package pictures

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Lifer
Nov 14, 2011
10,523
6,048
136
The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.
 

DrMrLordX

Lifer
Apr 27, 2000
23,222
13,300
136
someone simply made up that adapter story for some reason

Might have, and no I've never seen an adapter for high-end CPUs (or ultra-high-end, in the case of Knight's Landing). It's just that seeing Intel share sockets between Xeons and Xeon Phis would be awesome, and reminiscent of the original AMD Fusion plan (wherein CPUs and GPUs would fit into sockets on MP boards and share the same HT links).

The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.

You bring up an interesting point here. Despite Nvidia's current prominence in the HPC market when it comes to high-performance compute devices, it's really quite obvious that they have very little market control over the underlying platforms upon which their products are reliant.

Nvidia's future in the HPC market is based on the idea that HPC machines will continue to feature PCIe slots, and that everyone will be quite happy dispatching work queues to PCIe-connected devices hosted on otherwise-traditional server hardware.

Intel wants to eliminate the PCIe element entirely with a high-performance compute device that can also run the underlying OS and handle "non-compute" functions generally ill-suited to GPGPU devices.

AMD wants to fuse the CPU and GPU into one cheap little SoC node unit and hope that people will buy a whole lot of them (Berlin, or what Berlin could have been if it had seen the light of day).

Neither Intel's nor AMD's intended HPC cluster paradigm "of the future" leaves much room for Nvidia's products. Maybe that's one of the reasons why that new ORNL supercomputer is using POWER + Tesla instead of Xeons.
 

Nothingness

Diamond Member
Jul 3, 2013
3,373
2,469
136
The interesting thing about Knight's Landing is that it gets rid of the "host" CPU. Today, a Xeon Phi or Tesla based system needs a fast Xeon to negotiate network traffic, figure out what data the accelerator is going to need, and get it off the network and onto the accelerator's DRAM (via the host memory). Knight's Landing gets rid of that extra processor, and has the ultra-parallel processor plugged directly into the network. This should significantly improve power efficiency, cost, and overall network latencies.
Agreed this is interesting, but OTOH the low single thread performance will be an issue for many tasks (as Intel apologists like to [correctly] remind us all when talking about microserver SoCs :D), so high end Xeons will still be needed.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Intel has advantage of 14nm vs 28nm for Nvidia, so they are simply in different league in perf/watt.

And x86 compability is never to be underestimated. Intel moving Xeon Phi project towards vanilla x86 + AVX512 is an alarm bell for NV/AMD compute projects. How they would stack up if all were on same process, I don't know, but i'd take AVX anything over CUDA/OpenCL any day EVEN if Xeon was half as fast.
 
Mar 10, 2006
11,715
2,012
126
Agreed this is interesting, but OTOH the low single thread performance will be an issue for many tasks (as Intel apologists like to [correctly] remind us all when talking about microserver SoCs :D), so high end Xeons will still be needed.

I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

What I find more interesting is that Intel was able to deliver significantly more per clock performance with the Knights Landing Silvermont even though it remains 2-issue, which might suggest that future Atom CPUs will stay 2-issue for quite a while yet.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

Why is ST important at all??? It's FP monster and FP is all about flops and memory hierarchy bandwith/latency. So you have your Atom core with HT and you run xHT threads with your AVX512 code that gets executed on SIMD units. It's all about thoughput.

EDIT: What i mean, is that while ST performance is relevant for OS etc, Intel has been moving everything including kitchen sink into AVX512 instruction set, you can do ton of stuff that was done outside of vectorization or had to be explicitly set up for accelerator (like converting structure, processing head/tails of structure etc). Scatter/gather, masks for logic/flow control, explicit rounding are all there.
 
Last edited:

Nothingness

Diamond Member
Jul 3, 2013
3,373
2,469
136
I'm very interested to see how much ST perf that enhanced Silvermont delivers. Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.
Did Intel say they improved the integer core?

What I find more interesting is that Intel was able to deliver significantly more per clock performance with the Knights Landing Silvermont even though it remains 2-issue, which might suggest that future Atom CPUs will stay 2-issue for quite a while yet.
Given what is achievable with a two way decode chip (cf. Cortex-A17), there surely is room for improvement over Silvermont. But I'm not sure it would make sense to go too far for a multi core chip targeting HPC. OTOH the integer code probably looks tiny compared to the 512-bit ALUs :D
 

Nothingness

Diamond Member
Jul 3, 2013
3,373
2,469
136
Why is ST important at all??? It's FP monster and FP is all about flops and memory hierarchy bandwith/latency. So you have your Atom core with HT and you run xHT threads with your AVX512 code that gets executed on SIMD units. It's all about thoughput.

EDIT: What i mean, is that while ST performance is relevant for OS etc, Intel has been moving everything including kitchen sink into AVX512 instruction set, you can do ton of stuff that was done outside of vectorization or had to be explicitly set up for accelerator (like converting structure, processing head/tails of structure etc). Scatter/gather, masks for logic/flow control, explicit rounding are all there.
Look for Amdahl law ;)
 
Mar 10, 2006
11,715
2,012
126
Did Intel say they improved the integer core?


Given what is achievable with a two way decode chip (cf. Cortex-A17), there surely is room for improvement over Silvermont. But I'm not sure it would make sense to go too far for a multi core chip targeting HPC. OTOH the integer code probably looks tiny compared to the 512-bit ALUs :D

kIAYJaV.png


Deeper OoO buffers, advanced prediction, and high cache bandwidth all sound like they would benefit integer perf/clock.
 

Nothingness

Diamond Member
Jul 3, 2013
3,373
2,469
136
Thanks for the slide, I had missed it :)

Deeper OoO buffers, advanced prediction, and high cache bandwidth all sound like they would benefit integer perf/clock.
Cache BW and deeper OoO could apply only to the SIMD units, but advanced branch prediction will surely benefit integer code too
 

DrMrLordX

Lifer
Apr 27, 2000
23,222
13,300
136
How they would stack up if all were on same process, I don't know, but i'd take AVX anything over CUDA/OpenCL any day EVEN if Xeon was half as fast.

GPGPU stuff can be pretty difficult to approach from the programming angle. Just looking at AMD's 1.0 final HSA documents is enough to give one a headache.

Stuff like OpenMP and other compiler-based solutions are supposed to "cover up" the problem and make it easier for coders to use compute resources, but hell compiler support for Knight's Landing is practically already there (just not necessarily the AVX512 part, yet). If you can code for a 16-core Xeon, you can pretty much code for Knight's Landing.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
but hell compiler support for Knight's Landing is practically already there (just not necessarily the AVX512 part, yet).

note that the Intel compiler fully supports (i.e using the vectorizer not only code using intrinsics) AVX-512 for KNL targets since more than one year and AVX-512 for SKX targets since several weeks now

you can see a discussion about the different compiler back-end code generation strategies for KNL and SKX here: http://www.realworldtech.com/forum/?threadid=147882&curpostid=147882
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
GPGPU stuff can be pretty difficult to approach from the programming angle. Just looking at AMD's 1.0 final HSA documents is enough to give one a headache.

I am well aware of those difficulties, been toying with converting some of the code we are using to Cuda, and while it worked, we have stayed with old Intel code. Simply cause Haswell started pushing instructions damn fast and Intel's math libs beeing so sweet.

Look for Amdahl law ;)

Not sure if this is joke, but HPC and Amdahl's law? Aren't those guys all about insanely parallel processing and always looking for algorithms to expand parallel domains?
They are certainly not buying these cpus for Atom cores even if they are with 4 HT and improved ST performance :) There are cheaper higher clocked Haswells for that. And if you run 4 integer code threads (like 7z) on that core, i doubt you will get much speed up from it, probably would get slow down instead. But run some real world code, with real memory bw requirements and latencies and 4H T threads will help to mask latencies and increase utilization. HPC is all about working towards theoretical throughput while still having sensible perf per watt.
 

DrMrLordX

Lifer
Apr 27, 2000
23,222
13,300
136
note that the Intel compiler fully supports (i.e using the vectorizer not only code using intrinsics) AVX-512 for KNL targets since more than one year and AVX-512 for SKX targets since several weeks now

Gotta hand it to Intel. They are once more on the ball with compiler support.

I am well aware of those difficulties, been toying with converting some of the code we are using to Cuda, and while it worked, we have stayed with old Intel code. Simply cause Haswell started pushing instructions damn fast and Intel's math libs beeing so sweet.

CUDA is supposed to have one of the better toolsets out there for GPGPU work. OpenCL and HSA are even less dev-friendly.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
He was playing with the new Phi a week ago & said..
"The new Knights Landing is over 5 times faster than its predecessor and uses 1/3 the power." Each tile has an Atom core section = 1 execution unit, 30 EU (essentially 8 x 30 processors) for a total of 240 processors on the die and is only using around 100 watts. He had to boot up firmware for the phi processors prior to booting the machine, but says final product will have those instructions included, and will probably use an adapter to fit LGA 2011 socket. The target for release is 13 teraflops per die, x 2 is 26 teraflops. He says 1 of these can replace 20 servers & has Nvidia shaking in their boots. He is NOT an Intel fanboy either, but says Intel will take over HPC soon..

Out of this nonsense we may find some truth in it.

"Over 5 times faster" - Knights Corner is pretty poor in some code. This seems realistic since its a second generation and would have learned a lot from the first one

"uses 1/3 the power" - Although I doubt we'll see this, because socketable versions will be available, we might see something in the 100-130W range. That's 1/3 the power

"The target for release is 13 teraflops per die, x 2 is 26 teraflops."

Not sure why he thinks this way, as Intel is aiming for "14-16 DP Flops/watt". That means 13 TF = 813W:)cool:) in the best case scenario. If your friend is sort of clueless and its a SP figure then its still 407W, still way too much.

The real figures are half that since Knights Landing will be at the 160-215W range. Perhaps maybe for some reason they'll boost SP by 2x and result in 3.25 TFLOP DP/13 TFLOP SP?

Not sure if this is joke, but HPC and Amdahl's law? Aren't those guys all about insanely parallel processing and always looking for algorithms to expand parallel domains?

The weird thing is though they tout "3x performance in scalar over Knights Corner". And obviously that slide above shows further improvements over Silvermont, which in itself is a huge gain over P54C. Either they are aiming Knights Landing for parallel but not so quite HPC code, or Knights Corner's weak performance was due to certain limits in single threaded performance. Or both. The successor is going to boost that even further, by using Goldmont cores.

Everything about that core looks a lot better than the bog standard Silvermont, but it might run at a significantly lower frequency.

If we assume certain SKUs of Knights Landing achieves peak stated goals of 16 DP FLOPs/W, and the top SKU is stated to be 215W, we get 3.4TFLOPs of performance. That results in 1.45-1.5GHz. It's possible though we might see Turbo for less utilized cores or vector units.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
The weird thing is though they tout "3x performance in scalar over Knights Corner".

3 x *single thread* peak *theoretical* throughput (see fine prints), typically achieved with 1.5 x the clock frequency and a dual vector unit in KNL instead of a single vector unit in KNC

near 3 x better *effective* throughput will be achievable only with perfectly cache blocked algorithms, though

we can expect way less than 3 x speedup (from KNC to KNL) for workloads with worksets in local memory since local memory bandwidth is only slightly improved from KNC to KNL (not even 1.5 x more bandwidth)
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
3 x *single thread* peak *theoretical* throughput (see fine prints), typically achieved with 1.5 x the clock frequency and a dual vector unit in KNL instead of a single vector unit in KNC

KNC has a P54C core, that's back a decade ago. KNL has Silvermont core. Silvermont is 50% faster than first Atom alone, nevermind a 1995 P54C Pentium one per clock. The fastest KNC is at 1.238GHz. KNL only has ~20% clock advantage. BTW, the original Atom was almost as fast as first Pentium 4's except in code where it took advantage of SSE and FP since Pentium 4's. That's way faster than P54C.

Silvermont>>First Atom>>P54C

Vector unit performance is explicitely stated seperately for 3+TFlops performance.

Those are two DIFFERENT things.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
KNL only has ~20% clock advantage

source ?


Vector unit performance is explicitely stated seperately for 3+TFlops performance.
hose are two DIFFERENT things.

IMHO much the same thing (but for a *single thread*) from fine print (6) in the slide posted above: "Projected peak theoretical single-thread performance [...]"

or do you have a source where *scalar* performance and/or *effective* performance is claimed ?
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,523
6,048
136

6 TFLOPs single precision performance should be all the information you need. ;)

6TFLOPs / (72 cores * 2 vector units * 16 vector lanes * 2 ops [FMA is two ops] ) = 1302083333.3 Hz = 1.3GHz

Or if it has "only" 60 cores, it comes out to 1.56GHz.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
6TFLOPs / (72 cores * 2 vector units * 16 vector lanes * 2 ops [FMA is two ops] ) = 1302083333.3 Hz = 1.3GHz

Or if it has "only" 60 cores, it comes out to 1.56GHz.

sure, but the slide says in fineprint (0) "Over 3 Teraflops [...]", it doesn't make much sense to do too much precise predictions based on the fact that it isn't 3 TFlops (DP theoretical peak) but > 3 TFlops

one possible indicator for KNL frequency will be the frequency range for Airmont, i.e. at which frequency it is with a power consistant with the expected power for KNL

this SKU for example is given at 2 W SDP for 4 cores (+ iGPU) and @ 1.6-2.4 GHz:
http://ark.intel.com/products/85475/Intel-Atom-x7-Z8700-Processor-2M-Cache-up-to-2_40-GHz

we must also take into account the 2 wide vector units, so that's not so simple
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,523
6,048
136
sure, but the slide says in fineprint (0) "Over 3 Teraflops [...]", it doesn't make much sense to do too much precise predictions based on the fact that it isn't 3 TFlops (DP theoretical peak) but > 3 TFlops

one possible indicator for KNL frequency will be the frequency range for Airmont, i.e. at which frequency it is with a power consistant with the expected power for KNL

Well, we know it isn't much over 3TFLOPs... if it was >3.5TFLOPs, that's what they would put on the slide. ;) It's enough for a rough ballpark figure, and it means we know it isn't going to be some crazy >2GHz thing. I'm sure my clock speed estimate won't turn out to be precisely correct, but I'm pretty confident that it's not that far off.

Airmont is an extremely different core to KL, it isn't crunching 512-bit vectors, or performing big fat 512-bit load/stores through the entire cache hierarchy. I wouldn't use that as a guide for clock speeds. The thing is basically a GPU, expect GPU clock speeds.