Knights Landing announced

NTMBK · Jun 18, 2013

Any news on how the socketed version will work? Can it be used in a multisocket system- and more importantly, can it be used alongside a standard Xeon in a different socket? (Although obviously you'd need an OS smart enough to know where to schedule threads.) Will KL support MMX, SSE1-4, AVX1-2?

mrmt · Jun 18, 2013

NTMBK said:
Any news on how the socketed version will work? Can it be used in a multisocket system- and more importantly, can it be used alongside a standard Xeon in a different socket? (Although obviously you'd need an OS smart enough to know where to schedule threads.) Will KL support MMX, SSE1-4, AVX1-2?

Speculation is that it will be on the same socket as of Haswell-E. No news about MMX, SSE1-4 and AVX1-2.

cytg111 · Jun 18, 2013

Woo .. next level of co-processing here we come

where was that thread on 'the next big uarch thing' ...

pm · Jun 18, 2013

I took a tour of the NACR supercomputer ("Yellowstone", up near Laramie, Wyoming... really interesting tour if you are driving by on I80 and want to stop somewhere in the middle of Wyoming) a few months ago, and one thing that they mentioned that hadn't occurred to me was that FORTRAN compiler support was huge for them. They said that a lot of the original work in fluid and heat transfer simulations were all coded up in FORTRAN and then over time the code base has continued to develop in FORTRAN. So they said that one advantage of KC from their perspective was the strong FORTRAN compiler support.

* Although I work for Intel, I am not an Intel spokesperson and my opinions are my own.

Jodell88 · Jun 18, 2013

pm said:
I took a tour of the NACR supercomputer ("Yellowstone", up near Laramie, Wyoming... really interesting tour if you are driving by on I70 and want to stop somewhere in the middle of Wyoming) a few months ago, and one thing that they mentioned that hadn't occurred to me was that FORTRAN compiler support was huge for them. They said that a lot of the original work in fluid and heat transfer simulations were all coded up in FORTRAN and then over time the code base has continued to develop in FORTRAN. So they said that one advantage of KC from their perspective was the strong FORTRAN compiler support.

* Although I work for Intel, I am not an Intel spokesperson and my opinions are my own.

In the scientific community FORTRAN is the king followed in distant second by C.

sushiwarrior · Jun 18, 2013

Nvidia should be worried but they're hardly beaten. Rumour is Phi was sold at a loss, more or less, in order to get China to use it. Considering they have 3 Phi's for every K20x, I don't think Nvidia should be running home scared. Phi lacks any kind of flexibility and porting legacy code is just as difficult as CUDA (no SSE, no x87, just x86). So it still needs it's own specialized code.

Khato · Jun 18, 2013

sushiwarrior said:
Nvidia should be worried but they're hardly beaten. Rumour is Phi was sold at a loss, more or less, in order to get China to use it.

What exactly is the source of this rumor? Other than pure speculation on the part of those who wish it to be so.

For comparison, we know that the initial cost of upgrading Jaguar to Titan was approximately $60M. That's for 18688 nodes which each required at least a new motherboard, Opteron 6274, and K20x. Simple division of that initial cost by the number of nodes arrives at only $3.2k per node... which is already below the $3.3k price of the basic K20 and has to pay for at least the CPU and motherboard as well. Now sure, there's no question that NVIDIA made a profit given that the cost per card is probably ~$500, but it's a given that they sold well below normal price.

Point of the above being that it's common to sell below the standard price in order to get into the #1 Top500 spot... but it's doubtful that any supplier would go below cost, or even go below ~50% margins.

NTMBK · Jun 18, 2013

sushiwarrior said:
Phi lacks any kind of flexibility and porting legacy code is just as difficult as CUDA (no SSE, no x87, just x86). So it still needs it's own specialized code.

It depends on how your codebase is written. Plenty of scientific apps will use pure C or Fortran, with perhaps a few highly optimized libraries (e.g. BLAS) that they will call which may contain assembly/intrinsics. For Phi they would just need to recompile with the Phi compiler and the correct libraries linked.

LogOver · Jun 18, 2013

sushiwarrior said:
Nvidia should be worried but they're hardly beaten. Rumour is Phi was sold at a loss, more or less, in order to get China to use it. Considering they have 3 Phi's for every K20x,

This is not true. Tianhe-2 vs. Titan - 2.6x more accelerators but also 1.92x more performance.

I don't think Nvidia should be running home scared. Phi lacks any kind of flexibility and porting legacy code is just as difficult as CUDA (no SSE, no x87, just x86). So it still needs it's own specialized code.

Not true as well. Phi support almost all kinds of usual server tools (C, FORTRAN, OpenMP, etc) which can generate code for phi without any significant effort for application porting. Through some effort is still required for speed optimization.

sushiwarrior · Jun 18, 2013

Khato said:
What exactly is the source of this rumor? Other than pure speculation on the part of those who wish it to be so.

For comparison, we know that the initial cost of upgrading Jaguar to Titan was approximately $60M. That's for 18688 nodes which each required at least a new motherboard, Opteron 6274, and K20x. Simple division of that initial cost by the number of nodes arrives at only $3.2k per node... which is already below the $3.3k price of the basic K20 and has to pay for at least the CPU and motherboard as well. Now sure, there's no question that NVIDIA made a profit given that the cost per card is probably ~$500, but it's a given that they sold well below normal price.

Point of the above being that it's common to sell below the standard price in order to get into the #1 Top500 spot... but it's doubtful that any supplier would go below cost, or even go below ~50% margins.

Not below margin/cost, but with a slim enough margin that the project is a loss after R&D costs. Just speculation really, given that the project is performing below Intel's expectations.

NTMBK said:
It depends on how your codebase is written. Plenty of scientific apps will use pure C or Fortran, with perhaps a few highly optimized libraries (e.g. BLAS) that they will call which may contain assembly/intrinsics. For Phi they would just need to recompile with the Phi compiler and the correct libraries linked.

I haven't used CUDA myself but I'm guessing it's not much harder to do the exact same thing. Fact of the matter is that "ease of coding" doesn't really matter a lot with supercomputers - you make the code specifically for THAT supercomputer, it's "hardware first coding later" generally. I think the benefit of being x86 is pretty much lost considering the weaker performance compared to GPGPU/K20x.

For comparison:

The Nvidia Tesla K20X is a 7.1 billion transistor, 235 watt, 1.31 TP DP, 3.95 TP SP on a 28nm process
and the Tesla K20 is a 7.1 billion transistor, 225 watt, 1.17 TP DP, 3.52 TP SP on a 28nm process

The Intel Xeon Phi 5110P is a 5.0 billion transistor, 225 watt, 1.01 TP DP, 2.02 TP SP on a 22nm process

Less DP performance, less SP performance, same TDP even though the process is more refined. Intel has not "trounced" anything.

Khato · Jun 18, 2013

sushiwarrior said:
Not below margin/cost, but with a slim enough margin that the project is a loss after R&D costs. Just speculation really, given that the project is performing below Intel's expectations.

Oh, the ship has long since sailed on Larrabee being a profitable venture for Intel anywhere in the near term. The project has been nowhere near Intel's expectations for it, but they're at least finding a potential route to salvage the investment with the Xeon Phi product line. But it's difficult to state 'below cost' in those terms since it depends upon total sales and future products really.

sushiwarrior said:
Less DP performance, less SP performance, same TDP even though the process is more refined. Intel has not "trounced" anything.

Eh, it's a similar issue to what Intel's graphics have encountered thus far - they save a bit on die space (5B transistors versus 7.08B) but pay for it in power due to having to run at a higher frequency (a bit over 1GHz versus ~700MHz.)

sushiwarrior · Jun 18, 2013

Khato said:
Oh, the ship has long since sailed on Larrabee being a profitable venture for Intel anywhere in the near term. The project has been nowhere near Intel's expectations for it, but they're at least finding a potential route to salvage the investment with the Xeon Phi product line. But it's difficult to state 'below cost' in those terms since it depends upon total sales and future products really.

Eh, it's a similar issue to what Intel's graphics have encountered thus far - they save a bit on die space (5B transistors versus 7.08B) but pay for it in power due to having to run at a higher frequency (a bit over 1GHz versus ~700MHz.)

Yes, if Intel has a breakthrough it could become profitable, but the R&D costs I'm referring to are designing Xeon Phi, not the Larrabee project as a whole. I just mean the R&D to design and fabricate the chip given the previous work already done on Larrabee.

It seems like a bad approach from Intel IMO. No matter how many transistors they save, x86 is simply not cut out for massively parallel work. It's likely that the cache misses with a 57 core chip are absolutely destroying its performance, and are likely to do the same to any x86 processor on this scale. Maybe SMT would be able to help extract performance out of the wasted cycles, but even then we're reaching a point where this has really been done before (Sparc anyone? 64 threads in 2005). I don't think Intel is going to get anything meaningful out of shrinking x86 cores. They really need to step outside of the box.

LogOver · Jun 18, 2013

sushiwarrior said:
I haven't used CUDA myself but I'm guessing it's not much harder to do the exact same thing. Fact of the matter is that "ease of coding" doesn't really matter a lot with supercomputers - you make the code specifically for THAT supercomputer, it's "hardware first coding later" generally. I think the benefit of being x86 is pretty much lost considering the weaker performance compared to GPGPU/K20x.

It's really harder to port SW to CUDA. So much harder. Few more Gflops doesn't worth the effort you would spend on programming for CUDA. Especially because GPU programming has so many limitation that in many cases you would not be able to utilize K20 gflops advantage over MIC.

For comparison:

Less DP performance, less SP performance, same TDP even though the process is more refined. Intel has not "trounced" anything.

In most cases it is much harder to utilize these Gflops because of very limited programming model.

sushiwarrior · Jun 18, 2013

LogOver said:
It's really harder to port SW to CUDA. So much harder. Few more Gflops doesn't worth the effort you would spend on programming for CUDA. Especially because GPU programming has so many limitation that in many cases you would not be able to utilize K20 gflops advantage over MIC.

In most cases it is much harder to utilize these Gflops because of very limited programming model.

The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives (such as OpenACC), and extensions to industry-standard programming languages, including C, C++ and Fortran. C/C++ programmers use 'CUDA C/C++', compiled with "nvcc", NVIDIA's LLVM-based C/C++ compiler,[2] and Fortran programmers can use 'CUDA Fortran', compiled with the PGI CUDA Fortran compiler from The Portland Group.

I understand that CUDA is more difficult than doing nothing, but I don't think JUST x86 is a whole lot better to work with.

LogOver · Jun 18, 2013

sushiwarrior said:
I understand that CUDA is more difficult than doing nothing, but I don't think JUST x86 is a whole lot better to work with.

Cuda Fortan, Cuda C++ are just extension to Fortran and C++ allow adding device low-level GPU code to Fortran/C++ sources. It doesn't mean you can just compile existing program into GPU code.
It also does not eliminate any programming limitations associated with CUDA (such as thread schedule restrictions - no support for POSIX threads) which makes it hard to impossible to reach good enough gflops efficiency on many algorithms.

Ajay · Jun 18, 2013

ShintaiDK said:
Didint take long for Intel to take the HPC crown. And Phi will be on 14nm before Tesla will be on 20nm I guess.

I've been trying to find that out, but haven't been able to find dates on 14nm Phi. Obviously, NV's experience means allot, seeing how their 28nm product is generally considered to have high performance than Intel's 22nm product. Much will depend on how good TSMC's 20nm node turns out.

jhu said:
KC is way easier to program for.

Sure looks that way based on the small samples I've looked at. I haven't looked at the any programming guide personally and don't know if there are various memory pools (with varying performance parameters) and cache levels like with GPUs - and hence the need for specific hardware related optimizations. Also don't know if Phi has the a huge software libraries like CUDA does.

ShintaiDK said:
And able to handle alot more tasks due to its much more versatile uarch.

It looks more straight forward, but I'm not sure it's more versatile, that is, for high performance software. I guess I should get off my duff and doing some more reading. In any case, interesting times, in terms of performance jumps, in this area of computing

Ajay · Jun 18, 2013

USER8000 said:

Three cheers for XKCD!

sushiwarrior · Jun 18, 2013

LogOver said:
Cuda Fortan, Cuda C++ are just extension to Fortran and C++ allow adding device low-level GPU code to Fortran/C++ sources. It doesn't mean you can just compile existing program into GPU code.
It also does not eliminate any programming limitations associated with CUDA (such as thread schedule restrictions - no support for POSIX threads) which makes it hard to impossible to reach good enough gflops efficiency on many algorithms.

But it's inefficient to "port" software designed for supercomputers anyways. I'm sure software designed for SSE has functions which, while implementable in x86, come with performance penalties. Things written for supercomputers should always be written for the specific supercomputer in use.

Yeah, CUDA is very lacking in several areas.

Khato · Jun 18, 2013

sushiwarrior said:
Yes, if Intel has a breakthrough it could become profitable, but the R&D costs I'm referring to are designing Xeon Phi, not the Larrabee project as a whole. I just mean the R&D to design and fabricate the chip given the previous work already done on Larrabee.

Well, considering where the Larrabee projects were at the time of the decision to switch it from graphics to a pure processing accelerator... Yeah, it doesn't make sense to view that as the R&D costs for the product. Regardless, it's safe to say that the first generation of Xeon Phi is going to do nothing more than help Intel begin to recoup a small portion of the LRB investment... it's the next iteration and beyond that are going to get interesting.

sushiwarrior said:
It seems like a bad approach from Intel IMO. No matter how many transistors they save, x86 is simply not cut out for massively parallel work. It's likely that the cache misses with a 57 core chip are absolutely destroying its performance, and are likely to do the same to any x86 processor on this scale. Maybe SMT would be able to help extract performance out of the wasted cycles, but even then we're reaching a point where this has really been done before (Sparc anyone? 64 threads in 2005). I don't think Intel is going to get anything meaningful out of shrinking x86 cores. They really need to step outside of the box.

Uhhhhhhhh... No, just no. If that were the case then the GPU architectures would be in the exact same boat. As well, the x86 portion of each 'core' is merely a front end for the SIMD units (accessed by an x86 instruction set extension) that provide the actual compute throughput (simplest way to think of it is as being the same as a GPU architecture, but with a simple x86 core in front of a block of shaders.) The advantage of this approach being that the simple x86 core can take care of a number of basic operations that would otherwise need to go back to the host processors/hit restrictions with a normal GPU co-processor.

sushiwarrior · Jun 18, 2013

Khato said:
Uhhhhhhhh... No, just no. If that were the case then the GPU architectures would be in the exact same boat. As well, the x86 portion of each 'core' is merely a front end for the SIMD units (accessed by an x86 instruction set extension) that provide the actual compute throughput (simplest way to think of it is as being the same as a GPU architecture, but with a simple x86 core in front of a block of shaders.) The advantage of this approach being that the simple x86 core can take care of a number of basic operations that would otherwise need to go back to the host processors/hit restrictions with a normal GPU co-processor.

Yes, but the x86 instruction set is bloated and requires a large front end, which GPU's don't need and things like ARM don't need either. It adds a large power requirement and takes up tons of die space for being able to perform a few extra instructions without the host processor. Is this advantageous for some work? Sure, recursive work in particular is just about useless on a GPU and presumably would work well on Xeon Phi, but being good at a very specific subset of work is much different than being the next best HPC invention.

Khato · Jun 18, 2013

sushiwarrior said:
Yes, but the x86 instruction set is bloated and requires a large front end, which GPU's don't need and things like ARM don't need either. It adds a large power requirement and takes up tons of die space for being able to perform a few extra instructions without the host processor. Is this advantageous for some work? Sure, recursive work in particular is just about useless on a GPU and presumably would work well on Xeon Phi, but being good at a very specific subset of work is much different than being the next best HPC invention.

You're severely over-estimating both the die size and power requirements of the 'enhanced' P54C x86 front-end being used for each 'core' in KNC. (Here's a hint, the original P54C was 163mm^2 on a 0.6um process.) I'd guess that the x86 portion of KNC takes up maybe 5% of the die space, which is a fair bit less than some other things that were left in and don't contribute to HPC throughput.

sushiwarrior · Jun 18, 2013

Khato said:
You're severely over-estimating both the die size and power requirements of the 'enhanced' P54C x86 front-end being used for each 'core' in KNC. (Here's a hint, the original P54C was 163mm^2 on a 0.6um process.) I'd guess that the x86 portion of KNC takes up maybe 5% of the die space, which is a fair bit less than some other things that were left in and don't contribute to HPC throughput.

Maybe I am underestimating how much x86 is cut down then. I am making inferences based on the size of comparable ARM and x86 processors. The extended instruction sets must have more impact than I thought on the front end. It's hard to say exactly until we get a really solid die shot (I can't make much out on the SemiAccurate pic).

Ajay · Jun 19, 2013

Khato said:
You're severely over-estimating both the die size and power requirements of the 'enhanced' P54C x86 front-end being used for each 'core' in KNC. (Here's a hint, the original P54C was 163mm^2 on a 0.6um process.) I'd guess that the x86 portion of KNC takes up maybe 5% of the die space, which is a fair bit less than some other things that were left in and don't contribute to HPC throughput.

IIRC, it's not even the full P54C implementation, so just lean and mean. There would be no need for legacy x87 for instance and if we went down the list of instructions, I think we'd find plenty to cut.

Khato · Jun 19, 2013

Ajay said:
IIRC, it's not even the full P54C implementation, so just lean and mean. There would be no need for legacy x87 for instance and if we went down the list of instructions, I think we'd find plenty to cut.

Quite possible. (Which is my way of leaving it up in the air because I can never remember what information has actually been released versus not and trying to find out is a pain.)

It's also quite amusing to see Intel using a derivative of a 20 year old processor design in modern products.

IntelUser2000 · Jun 19, 2013

sushiwarrior said:
Less DP performance, less SP performance, same TDP even though the process is more refined. Intel has not "trounced" anything.

The official pricing is less on the Phi, so its not that big of a disadvantage. Based on similar pricing, they have about the same DP/watt.

Maybe I am underestimating how much x86 is cut down then. I am making inferences based on the size of comparable ARM and x86 processors.

x86 may have a slight impact, but Intel's logic size compared to TSMC is actually larger in the similar generation(~32nm generation for example).

I assume Intel's die size is between TSMC generations. For example, in terms of size, 32nm Intel would be right smack dab between 40nm TSMC and 28nm* TSMC. Of course Intel has higher performing transistors.

*Before arguing about number designations, its only relevant as a reference point. 28nm does not mean its necessarily smaller than 32nm, unless they are comparing against their own. In this case though, Intel should have similar SRAM densities at 32nm vs TSMC's 28nm, but TSMC may have smaller logic.

Knights Landing announced

Lifer

Diamond Member

Lifer

Elite Member Mobile Devices

Diamond Member

Senior member

Golden Member

Lifer

Member

Senior member

Golden Member

Senior member

Member

Senior member

Member

Lifer

Lifer

Senior member

Golden Member

Senior member

Golden Member

Senior member

Lifer

Golden Member

Elite Member