Intel Skylake / Kaby Lake

TheF34RChannel · Jul 7, 2017

DrMrLordX said:
So how much of a power draw delta is there between a 7820x running Prime95 29.2 and CBR15 when OCed to 4.5 GHz with a standard -500 MHz AVX offset?

Not owning this CPU I couldn't tell you. Someone who does own it might?

Markfw said:
Someone here asked for anyone to run this software on a Ryzen, so I thought it was OK. If no one wants to know, I will ignore future requests. I was not arguing the case like others were before.

True, and personally I am not against having good references to the competition in any thread. Because mods came in often it was my understanding it was a strict rule as to keep fanboys and trolls out - to which I am not accusing you, rest assured. I am all for a broader than one brand discussion, as long as it's honest.

Markfw · Jul 7, 2017

@TheF34RChannel

Thanks for understanding, thats the EXACT intent.

TheF34RChannel · Jul 7, 2017

Markfw said:
@TheF34RChannel

Thanks for understanding, thats the EXACT intent.

Sure thing. It may be worth its own thread so results won't be burried under a plethora of or ofter, unrelated posts?

Markfw · Jul 7, 2017

TheF34RChannel said:
Sure thing. It may be worth its own thread so results won't be burried under a plethora of or ofter, unrelated posts?

Thats up to you !

TheF34RChannel · Jul 7, 2017

Markfw said:
Thats up to you !

Not really ha ha

jpiniero · Jul 7, 2017

Ajay said:
Why Xeon D? And do we know for sure that Intel is planning a Xeon D "Icelake".

There haven't been any rumors recently about any Xeon D's in general, but given Intel's server first mentality now it would make sense that it would be the lead product for the mainstream line. Could be pretty compelling if Intel delivers with Icelake+EMIB. That's not to say that they will though.

DrMrLordX · Jul 7, 2017

If the past is any judge, I would think their next Xeon-D product would be Cannonlake-based. Unless 10nm is really just that immature. Then yeah they might delay until Icelake.

jpiniero · Jul 7, 2017

DrMrLordX said:
If the past is any judge, I would think their next Xeon-D product would be Cannonlake-based. Unless 10nm is really just that immature. Then yeah they might delay until Icelake.

Yields at 10 nm appear to be that bad. If Icelake mainstream is EMIB powered, that obviously solves a lot of yield problems.

DrMrLordX · Jul 7, 2017

jpiniero said:
Yields at 10 nm appear to be that bad. If Icelake mainstream is EMIB powered, that obviously solves a lot of yield problems.

That's actually really bad for Intel. Xeon-D is one of the products keeping ARM out of the server room. Delays in updating the platform are just bad, bad, bad. I don't like to see that at all. Maybe ARM in the server room isn't so so bad, but in the consumer space, it is nothing but cancer; or rather, the software ecosystems that come with it are cancerous. Personally I do not want anyone or anything encouraging ARM to get out of the tablet and phone sector.

DrMrLordX · Jul 8, 2017

Okay, taking this back to the issue of heat/power throttling, VRMs, and other fun stuff with Skylake-X:

http://www.numberworld.org/y-cruncher/

Read the July 6th news. For those that don't want to click the link:

Skylake X and AVX512: (July 6, 2017)

Let's talk about Skylake X and AVX512. Because everyone's been waiting for this. Since there's currently a lack of AVX512 benchmarks and stress tests. And because of that, I've had at least half a dozen people and organizations contact me about y-cruncher's AVX512.

Okay... some AVX512 benchmarks already existed. SiSoftware Sandra had some support. And my little-known FLOPs benchmark did too. But people either weren't aware of them, or wanted more. And by advertising y-cruncher's internal AVX512 support for at least a year now, I basically brought this on myself.

So let's get to the point. Unfortunately, AVX512 will not bring the "instant massive performance gain" that a lot of people were expecting. Realistically speaking, the speedups over AVX2 seem to vary around 10 - 50% - usually on the lower end of that scale. While the investigation is on-going, there are some known factors:

Not all Skylake X and Skylake Purley processors will have the full AVX512 capability.

"Phantom throttling" of performance when certain thermal limits are exceeded.

Memory bandwidth is a significant bottleneck.

Amdahl's law and other unknown scalability issues.

Not all Skylake X and Skylake Purley processors will have the full AVX512 capability:

While this reason doesn't apply to my system, it's worth mentioning it anyway.

Architecturally, Skylake X retains Skylake desktop's architecture with 2 x 256-bit FMA units. In Skylake X, those two 256-bit FMA units can merge to form a single 512-bit FMA. On the processors with full-throughput AVX512, there is also a dedicated 512-bit FMA - thereby providing 2 x 512-bit FMA capability.

However, that dedicated 512-bit FMA is only enabled on the Core i9 parts. The 6-core and 8-core Core i7 parts are supposed to have it disabled. Therefore they only have half the AVX512 performance.

It's worth mentioning that there is a benchmark on an engineering-sample 6-core Core i7 that shows full-throughput AVX512 anyway. However, engineering sample processors are not always representative of the retail parts.

So as of this writing, I still don't know if the 6 and 8-core Skylake X Core i7's have the full AVX512. The only Skylake X processor I have at this time is the Core i9 7900X which is supposed to have the full AVX512 anyway. (and indeed it does based on my tests)

"Phantom throttling" of performance when certain thermal limits are exceeded:

Within minutes of getting my system setup, I started noticing inconsistencies in performance. And after spending a long Friday night investigating the issue, I determined that there was a sort of "Phantom throttling" of AVX512 code when certain thermal limits are exceeded.

"Phantom throttling" is the term that I used to describe the problem in my emails with the Silicon Lottery vendor. And it looks like I'm not the only one using that term anymore. Phantom throttling is when the processor gets throttled without a change in clock frequency. For many years, processors have throttled down for many reasons to protect it from damage. But when throttling happens, it has always been done by lowering the clock frequency - which is visible in a monitor like CPUz. Skylake X is the first line of processors to break from this and it makes it more difficult to detect the throttling.

Right now, the phantom throttling phenomenon is still not well understood. Overclocker der8auer has mentioned that it could be caused by CPUz not reacting fast enough to actual clock frequency changes. On the other hand, the tests that Silicon Lottery and myself have done seem to show the that there really is no drop in clock frequency at all.

Initially, I observed this effect only with AVX512 code and thus hypothesized that the mechanism behind the throttling is the shutdown of the dedicated 512-bit FMA. But others have found that phantom throttling also occurs on AVX and scalar code as well. In short, much more investigation is needed. The lack of AVX512 programs out there certainly doesn't help and is partially why I'm rushing this release of y-cruncher v0.7.3.

Currently, there are no known reliable ways of stopping the throttling and results vary heavily by motherboard manufacturer. But maxing out thermal limits and disabling all thermal protections seems to help. (Don't try this at home if you don't know what you're doing or you aren't at least moderately experienced in overclocking. You can destroy your processor and/or motherboard if you aren't careful.)

Memory bandwidth is a significant bottleneck:

y-cruncher was already slightly memory-bound on Haswell-E. Now on Skylake X, it is much worse. While I had anticpiated a memory bottleneck on Skylake X with AVX512, it seems that I've underestimated the severity of it:

(The CPU frequencies in this benchmark were chosen to be low enough to avoid any throttling or phantom throttling.)

1 billion digits of Pi - Core i9 7900X @ 3.8 GHz

Times in Seconds

Threads Memory Frequency Instruction Set
AVX2 AVX512
1 thread

2133 MHz 444.434 325.543
3200 MHz 438.432 319.737
20 threads

2133 MHz 51.884 45.658
3200 MHz 47.672 39.723
In the single threaded benchmarks, the memory frequency has less than 2% effect for both AVX2 and AVX512. Multi-threaded, that jumps to 9% and 15% respectively. This is much more than is expected for a program that used to be completely compute-bound just a few years ago.

Amdahl's law and other unknown scalability issues:

In a typical y-cruncher computation, only about 80% of the CPU time is spent running vectorized code when AVX2 is used. So by Amdahl's law, even if we get perfect scaling with the AVX512, we can only cut 40% off the run-time. Right now, the single-threaded benchmarks (which are least memory-bound) are only showing 27% speedup with AVX512 over AVX2.

This remaining 13% discrepancy is currently unresolved. Microbenchmarks of y-cruncher's AVX512 code show near perfect 2x speedups over AVX2. (Some show >2x thanks to the increased register count.) But this speedup seems to drop off as the data sizes increase - even while still fitting in cache. This seems to hint at unknown bottlenecks within the L2 and L3 caches. The fact that cache sizes haven't increased along with wider the SIMD also doesn't help.

For now, investigation is difficult because none of my performance profilers support Skylake X yet.

Implications for Stress-Testing:

y-cruncher's failure to achieve a decent speedup for AVX512 also means that it is unable to put a heavy load on the AVX512 computation units. Therefore it is not a great stress-test for Skylake X with full AVX512.

But there is one y-cruncher feature which seems to be unaffected - the BBP benchmark.

The BBP benchmark feature is contained entirely in cache is thus free of the memory bottleneck. It is able to put a much higher stress than the stress-tester and the computations. So if you run the BBP benchmark (option 4) and set the offset to 100 billion, you can still put a pretty heavy load on your AVX512-capable processor.

A future version of y-cruncher will revamp the stress-tester to incorporate the BBP benchmark as well as other possible improvements.

Lots to take in there, folks.

.vodka · Jul 8, 2017

der8auer's findings get validated once again. I wonder if the phantom throttling comes from the power or thermals side of things.

If it is thermals related, a delidded+CLU'd 7900x should not throttle under stock settings with an adequate heatsink, and again proves the bean counting savings (in the grand scheme of things and Intel being a multi billion dollar company) is harmful.

If it is power related, a stock CPU shouldn't throttle under any load. If it does, it's running out of spec... at stock settings. Unacceptable.

This is quite the clusterf*ck. Again, what the hell do they expect to do with the 12-18 core parts that were announced as a knee jerk reaction to Threadripper??

....Does this also happen to Skylake-X in Xeon form, in server/datacenter conditions?

tamz_msc · Jul 8, 2017

DrMrLordX said:
Okay, taking this back to the issue of heat/power throttling, VRMs, and other fun stuff with Skylake-X:

http://www.numberworld.org/y-cruncher/

Read the July 6th news. For those that don't want to click the link:

Lots to take in there, folks.

That improper implementation of AVX-512 might have negative consequences was an issue raised by Agner Fog in 2013:

AVX-512 is a big step forward - but repeating past mistakes!

Agner Fog said:
I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

Regarding y-cruncher's findings on AVX-512 being memory bandwidth starved, I've detailed this before, where people in the HPC community found that the cache subsystem can have a huge impact on performance even with AVX2. Performance decreases as soon as you exceed a certain problem size, the threshold for which is pretty low in HPC workloads.

coercitiv · Jul 8, 2017

.vodka said:
If it is power related, a stock CPU shouldn't throttle under any load. If it does, it's running out of spec... at stock settings. Unacceptable.

Wow, watch that throttle

! CPU throttling at 4Ghz+ while it's base speed is 3GHz+ is not running out of spec.

That being said, depending on it's purpose and intensity, this phantom throttling will have some very interesting consequences.

Wyrm · Jul 8, 2017

.vodka said:
der8auer's findings get validated once again. I wonder if the phantom throttling comes from the power or thermals side of things.

If it is thermals related, a delidded+CLU'd 7900x should not throttle under stock settings with an adequate heatsink, and again proves the bean counting savings (in the grand scheme of things and Intel being a multi billion dollar company) is harmful.

If it is power related, a stock CPU shouldn't throttle under any load. If it does, it's running out of spec... at stock settings. Unacceptable.

This is quite the clusterf*ck. Again, what the hell do they expect to do with the 12-18 core parts that were announced as a knee jerk reaction to Threadripper??

....Does this also happen to Skylake-X in Xeon form, in server/datacenter conditions?

I suspect the power and throttling are caused by the FPU and by the memory bandwidth bottleneck. Der8auer posted this set of data that clearly shows power dependency on data size:

There is a significant drop off between FFT 96K (768KB) and FFT 128K (1024KB). Notice that i9 7900X has L2 cache of 1 MB/core. So FFT 128K has a small spill into shared L3 cache. While the CPU is waiting on the data from L3, FPU stalls with no logic switching (no-op) and there is no dynamic power dissipated.

For shorter FFT the entire data set is in core's L1/data or L2 cache, so FPU will be running full speed dissipating power. That would generally cause power throttling but der8auer removed the power limit in his BIOS.

Many other applications will be gated by the memory bandwidth. DDR4 at 4 ch reaches 80 GB/s:
http://techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one/4

If I'm reading the benchmarking data correctly, the double-precision floating point (DPFP) throughput for 7900X is reported to be 223 GFLOPs, and with FMA/AVX512 the DPFP throughput is 1140 GFLOPs:
http://www.sisoftware.eu/2017/06/23/intel-core-i9-skl-x-review-and-benchmarks-cpu-avx512-is-here/

The DPFP throughput is way higher than the memory bandwidth, so the CPU can't fetch large data sets fast enough to prevent the FPU pipeline from stalling. That stalling causes significant fluctuations between applications (benchmarks) and data sets (fits in cache or not).

AVX512 is more complicated to analyze because it depends on software. Standard object oriented code requires packing/unpacking instructions and wastes a lot of bandwidth. Activating AVX512 on such a code will lead to the CPU waiting on memory. Contiguous arrays that fit into L2 cache will overload AVX and lead to high power dissipation.

Assuming I haven't messed up something with my back of the envelope calculations, customers affected by the high power use will be the ones with:
* higher bandwidth memory with elevated power limits
* high compute demand vs smaller data sets (e.g. exp/log, convolution, fft, multiple kernels on small subimages, statistics and probability modeling, etc).

That said, it is likely that high power usage and power fluctuations are unavoidable with a high DPFP throughput unless there is a new silicon technology that can make more efficient transistor switching.

tamz_msc · Jul 8, 2017

Wyrm said:
I suspect the power and throttling are caused by the FPU and by the memory bandwidth bottleneck. Der8auer posted this set of data that clearly shows power dependency on data size:

There is a significant drop off between FFT 96K (768KB) and FFT 128K (1024KB). Notice that i9 7900X has L2 cache of 1 MB/core. So FFT 128K has a small spill into shared L3 cache. While the CPU is waiting on the data from L3, FPU stalls with no logic switching (no-op) and there is no dynamic power dissipated.

For shorter FFT the entire data set is in core's L1/data or L2 cache, so FPU will be running full speed dissipating power. That would generally cause power throttling but der8auer removed the power limit in his BIOS.

Many other applications will be gated by the memory bandwidth. DDR4 at 4 ch reaches 80 GB/s:
http://techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one/4

If I'm reading the benchmarking data correctly, the double-precision floating point (DPFP) throughput for 7900X is reported to be 223 GFLOPs, and with FMA/AVX512 the DPFP throughput is 1140 GFLOPs:
http://www.sisoftware.eu/2017/06/23/intel-core-i9-skl-x-review-and-benchmarks-cpu-avx512-is-here/

The DPFP throughput is way higher than the memory bandwidth, so the CPU can't fetch large data sets fast enough to prevent the FPU pipeline from stalling. That stalling causes significant fluctuations between applications (benchmarks) and data sets (fits in cache or not).

AVX512 is more complicated to analyze because it depends on software. Standard object oriented code requires packing/unpacking instructions and wastes a lot of bandwidth. Activating AVX512 on such a code will lead to the CPU waiting on memory. Contiguous arrays that fit into L2 cache will overload AVX and lead to high power dissipation.

Assuming I haven't messed up something with my back of the envelope calculations, customers affected by the high power use will be the ones with:
* higher bandwidth memory with elevated power limits
* high compute demand vs smaller data sets (e.g. exp/log, convolution, fft, multiple kernels on small subimages, statistics and probability modeling, etc).

That said, it is likely that high power usage and power fluctuations are unavoidable with a high DPFP throughput unless there is a new silicon technology that can make more efficient transistor switching.

Plus this will have further implications when L2 is spilled by ever larger problem sizes, and the lower performing L3 wouldn't help. In scientific workloads, the data sets are primarily constrained by the problem under consideration. Now it'll be constrained by hardware, if AVX-512 is intended to be utilized.

Wyrm · Jul 8, 2017

tamz_msc said:
Plus this will have further implications when L2 is spilled by ever larger problem sizes, and the lower performing L3 wouldn't help. In scientific workloads, the data sets are primarily constrained by the problem under consideration. Now it'll be constrained by hardware, if AVX-512 is intended to be utilized.

I agree with you that usefulness of AVX-512 looks rather questionable when memory bandwidth is taken into account. In my prior experience with the HPC guys their large problems have always been limited by memory bandwidth and the node interconnect technology. That's why supercomputing companies spend more resources on the interconnect rather than on the CPU performance of individual nodes. To be fair to both Skylake-X and the upcoming Threadripper though, neither is marketed as an HPC processor. Perhaps, Intel is expecting some new higher bandwidth memory to hit the HEDT market soon?

NTMBK · Jul 8, 2017

DrMrLordX said:
That's actually really bad for Intel. Xeon-D is one of the products keeping ARM out of the server room. Delays in updating the platform are just bad, bad, bad. I don't like to see that at all. Maybe ARM in the server room isn't so so bad, but in the consumer space, it is nothing but cancer; or rather, the software ecosystems that come with it are cancerous. Personally I do not want anyone or anything encouraging ARM to get out of the tablet and phone sector.

ARM servers are just Linux servers that happen to run on ARM. It's not like they run Android and only use an app store

DrMrLordX · Jul 8, 2017

NTMBK said:
ARM servers are just Linux servers that happen to run on ARM. It's not like they run Android and only use an app store

It's the first step on a slippery slope . . . but let's not get too far off-topic.

StefanR5R · Jul 8, 2017

TheF34RChannel said:
OC3D together with Der8auer:
[...]

Timmah! said:
So does this mean, i actually can overclock 7900x to say 4,5GHz and use it for regular gaming, 3D rendering, Cinebench, etc... simply usual stuff, and it will be fine, even possibly cooled by 240/280mm AIO? And all the throttling, 90+C temps on CPU, 100+C temps on VRMs and ridiculous power-draw would concern me only if i tried to stress test with the Prime?

TheF34RChannel said:
Yes.

No.

The VRMs won't heat up as much, but the CPU will still get very hot due to (a) heat spots, (b) bad conductivity of the thermal interface between die and the so-called heat spreader, (c) still very high power to be dissipated. And it may even throttle occasionally.

For gaming with a 4.5 GHz CPU, use the mainstream platform, not X299 whose CPUs are derived from the scalable but higher latency server CPU design.

For rendering, don't use a 4.5 GHz CPU, use CPUs at 3.5 GHz or less. Or at least configure the CPU to run at 4.5 GHz only under lowly threaded loads, and keep it at stock clocks under multithreaded loads.

TheGiant · Jul 8, 2017

Wyrm said:
For shorter FFT the entire data set is in core's L1/data or L2 cache, so FPU will be running full speed dissipating power. That would generally cause power throttling but der8auer removed the power limit in his BIOS.

Very interesting. I am doing some CFD with chemical reactions modelling with commercial and my own fortran based programs. Do you think SKL-X when compiled properly will have the best perf/watt ratio?

tamz_msc · Jul 8, 2017

TheGiant said:
Very interesting. I am doing some CFD with chemical reactions modelling with commercial and my own fortran based programs. Do you think SKL-X when compiled properly will have the best perf/watt ratio?

CFD you say? You're most likely to be memory latency or bandwidth bound, or even both.

aigomorla · Jul 8, 2017

StefanR5R said:
For gaming with a 4.5 GHz CPU, use the mainstream platform, not X299 whose CPUs are derived from the scalable but higher latency server CPU design.

For rendering, don't use a 4.5 GHz CPU, use CPUs at 3.5 GHz or less. Or at least configure the CPU to run at 4.5 GHz only under lowly threaded loads, and keep it at stock clocks under multithreaded loads.

im fairly sure most of the 7900X and 7920X which are overclocked will be under a custom mid tier (NOT AIO) watercooling, so heat issues will probably not even bother us. Infact we'll welcome it as now most of us like having big blingy radiators which can handle all that heat.

I for one intend to wait and see which boards eK makes a Mono block (Mostfet + CPU) and go that route and not have to worry about anything overheating.

However im still waiting for the 7920X or the 7960X even, as i am a member of "MOAR CORES" association which states u need to have more cores then the fingers on both hands.

Ajay · Jul 8, 2017

@aigomorla I think the 7960X will be very limited by power and temps for any useful amount of overclocking. The 7920X, might be an interesting from the aspect of having a challenge in terms of overclocking and performance - with VRMs and CPU under custom H2O (or better!).

NTMBK · Jul 8, 2017

aigomorla said:
However im still waiting for the 7920X or the 7960X even, as i am a member of "MOAR CORES" association which states u need to have more cores then the fingers on both hands.

Threadripper says hi.

StefanR5R · Jul 8, 2017

@aigomorla, I responded to a question whether or not workloads like rendering on 10 cores @ 4.5 GHz will involve (a) throttling, (b) 90+C temps on CPU, (c) 100+C temps on VRMs, (d) ridiculous power-draw.

The workload will avoid (c). The large radiators that you mention will avoid (a). (b) likely remains but could be avoided with direct die cooling. (d) remains.

Intel Skylake / Kaby Lake

Senior member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Lifer

Lifer

Lifer

Lifer

Lifer

Golden Member

Diamond Member

Diamond Member

Junior Member

Diamond Member

Junior Member

Lifer

Lifer

Elite Member

Senior member

Diamond Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Lifer

Lifer

Elite Member