Coffeelake thread, benchmarks, reviews, input, everything.

tamz_msc · Oct 2, 2017

Zucker2k said:
Superior in what sense? Are you suggesting Intel's SMT implementation is inefficient? Or is this a case of a stronger, more efficient core leaving little resources for smt?

Superior in the sense that when workloads are ideally suited to SMT in general, which means excluding low latency or FPU intensive scenarios, AMD's SMT comes out ahead of Intel's HT more often than not.

hnizdo · Oct 2, 2017

tamz_msc said:
Superior in the sense that when workloads are ideally suited to SMT in general, which means excluding low latency or FPU intensive scenarios, AMD's SMT comes out ahead of Intel's HT more often than not.

Dou you have any example for this statement?

tamz_msc · Oct 2, 2017

hnizdo said:
Dou you have any example for this statement?

See for yourself the SMT yield in these tests:
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/#post-38770120
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/#post-38770122

epsilon84 · Oct 2, 2017

I think Cinebench MT is a 'best case' scenario for AMD, as in, it shows its SMT implementation in the best light. Look at upcoming reviews, you will see the 8700K trade blows with the 1800X in most multithreaded apps/benchmarks but the 1800X will be well ahead in CB 15 MT.

That being said, it has been shown that AMDs implementation of SMT *is* slightly superior to Intel's HT, not by a huge margin, I think it was a few percent. IIRC averaged out, Intels HT adds ~25% to MT throughput whereas AMDs SMT added ~28%. I don't remember where I saw this though so don't ask me for a source, I just rememeber reading it during the launch of Ryzen.

So AMD does gain on Intel slightly in heavy MT loads, but its not enough to bring it to performance parity, clock for clock. Otherwise a 1600X will come very close to a 8700/8700K at stock - this is obviously not the case. The 8700K is closer to 1700X/1800X levels of MT performance.

itsmydamnation · Oct 2, 2017

epsilon84 said:
I think Cinebench MT is a 'best case' scenario for AMD, as in, it shows its SMT implementation in the best light. Look at upcoming reviews, you will see the 8700K trade blows with the 1800X in most multithreaded apps/benchmarks but the 1800X will be well ahead in CB 15 MT.

That being said, it has been shown that AMDs implementation of SMT *is* slightly superior to Intel's HT, not by a huge margin, I think it was a few percent. IIRC averaged out, Intels HT adds ~25% to MT throughput whereas AMDs SMT added ~28%. I don't remember where I saw this though so don't ask me for a source, I just rememeber reading it during the launch of Ryzen.

Its mostly down to core execution resources, both have the same amount of load/store ops a cycle, so that more offen then not is the limit. When it isn't the limit, AMD uop throughput but also its much more symmetrical pipelines ( more likely to be able to schedule an instruction quicker) would be the main reasons for the difference.

Now if either one of AMD/Intel channel IBM and got to a 4x load/store setup then the SMT dynamic would completely change. But in the world of small dies with on package interconnects smt just an optional thing, it doesn't improve perf/watt ( in many cases you go backwards) but it does help with the benchmark wars.......

Zucker2k · Oct 2, 2017

epsilon84 said:
I think Cinebench MT is a 'best case' scenario for AMD, as in, it shows its SMT implementation in the best light. Look at upcoming reviews, you will see the 8700K trade blows with the 1800X in most multithreaded apps/benchmarks but the 1800X will be well ahead in CB 15 MT.

That being said, it has been shown that AMDs implementation of SMT *is* slightly superior to Intel's HT, not by a huge margin, I think it was a few percent. IIRC averaged out, Intels HT adds ~25% to MT throughput whereas AMDs SMT added ~28%. I don't remember where I saw this though so don't ask me for a source, I just rememeber reading it during the launch of Ryzen.

So AMD does gain on Intel slightly in heavy MT loads, but its not enough to bring it to performance parity, clock for clock. Otherwise a 1600X will come very close to a 8700/8700K at stock - this is obviously not the case. The 8700K is closer to 1700X/1800X levels of MT performance.

25% of what base score? We can't look at these numbers in isolation. HT is extra/surplus processing. Remember, it's the same core doing the work. If we take a hypothetical SMT score of 25% for Ryzen 1700, 8x25=200, whereas, for Kabylake at 25%, 4x25=100, assuming linear core scaling as in CB. In the above example, this translates into a 50% HT deficit for kabylake.

https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/#post-38770111

Edited: Changed coffeelake to kabylake - in line with the charts.

tamz_msc · Oct 2, 2017

Zucker2k said:
25% of what base score? We can't look at these numbers in isolation. HT is extra/surplus processing. Remember, it's the same core doing the work. If we take a hypothetical SMT score of 25% for Ryzen 1700, 8x25=200, whereas, for Coffeelake at 25%, 4x25=100%, assuming linear core scaling as in CB. In the above example, this translates into a 50% HT deficit for coffeelake.

That should be 6, and consequently the HT deficit, according to your calculation, should be 25%.

Glo. · Oct 2, 2017

To close your IPC debate:

There is Instruction Per Clock Dispatched(by scheduler), and Instruction Per Clock Executed.

First is 100% totally CPU dependent. Its hardware level IPC.
Second one is mix of Hardware and software performance.

In each, and every IPC calculation you think everything on CPU side is 100% hardware only dependent. Its not. Otherwise we would not see gains with each software update with optimizations for said CPUs.

Know this difference before you start any IPC debate, that is derailing every thread.

Can we get back to COFFEE LAKE Thread?

Zucker2k · Oct 2, 2017

tamz_msc said:
That should be 6, and consequently the HT deficit, according to your calculation, should be 25%.

Should be Kabylake. See charts. So 50% at the same core clock, sku vs sku. See your link.

tamz_msc · Oct 2, 2017

Zucker2k said:
Should be Kabylake. See graph. So 50% at the same core clock, sku vs sku. See your link.

Did you mean Kaby Lake or Coffee Lake? Because if it's the latter then 50% more cores should slice a 50% deficit to 25%.

Zucker2k · Oct 2, 2017

Dropping these here for clarity. @tamz_msc Nice find!

tamz_msc · Oct 2, 2017

Zucker2k said:
Dropping these here for clarity. @tamz_msc Nice find!

Your point being? Without the AVX2 advantage Kaby Lake->Zen being 12% more IPC. Is it that far from 8-10%?

Zucker2k · Oct 2, 2017

tamz_msc said:
Your point being? Without the AVX2 advantage Kaby Lake->Zen being 12% more IPC. Is it that far from 8-10%?

Splitting hairs again, I see.

tamz_msc · Oct 2, 2017

Zucker2k said:
Splitting hairs again, I see.

Maybe I should point out how dedicated AES hardware on Zen gives it superior performance in encryption, just like how dedicated FPUs on Skylake gives it superior performance in LINPACK.

JoeRambo · Oct 2, 2017

So 12% from IPC and 25% from clock advantage is final word in ST performance? Sounds fine to me. 40% is not too shabby of advantage.

tamz_msc · Oct 2, 2017

JoeRambo said:
So 12% from IPC and 25% from clock advantage is final word in ST performance? Sounds fine to me. 40% (theoretical) is not too shabby of advantage.

Corrected.

Zucker2k · Oct 2, 2017

JoeRambo said:
So 12% from IPC and 25% from clock advantage is final word in ST performance? Sounds fine to me. 40% is not too shabby of advantage.

Looks like @epsilon84 was on to something, after all!

JoeRambo · Oct 2, 2017

tamz_msc said:
Corrected.

Theoretical as in not using AVX2 apps? Or some other strings attached?

tamz_msc · Oct 2, 2017

JoeRambo said:
Theoretical as in not using AVX2 apps? Or some other strings attached?

Yes, without AVX2 assuming perfect scaling.

Zucker2k · Oct 2, 2017

tamz_msc said:
Yes, without AVX2 assuming perfect scaling.

Thought the 12% was already 'AVX-corrected'? And Linpack is only 1 app out of 20+ tested. Yes, 25% clock scaling is best case, but that applies somewhat to AMD as well. These chips are running at 3.5GHz, ya know.

JoeRambo · Oct 2, 2017

tamz_msc said:
Does that mean that the 2nd CPU is worse for multithreaded performance? Absolutely not. If you believe otherwise then explain why the Platinum 8160 exists when it should in theory just barely beat the Gold 6154 in a Cinebench-like workload with 33% more cores(freq*core count = 67.2 for 8160, 66.6 for 6154)

Xeon Platinums have RAS and 8S capability and should not be compared with Gold stuff. You either need those features or You don't.

Gold is for 1-4S and priced according to perf/features. For example both Gold 6138 and Gold 6148 have 20 cores, but their turbo ratios have enough differential ( ~60 vs 66 ) to justify purchase of the following.

So obviuosly You can't compare different lineups and even blind multiplication of turbo x core does not account for difference in cache and/or wattage.

P.S. Intel's lineup is stright from marketing hell, we just went through server purchase and decision was really muddy compared to 2690's we kept on buying before. Anandtech's non-AVX turbo ratio table is pure gold

NTMBK · Oct 2, 2017

This talk all makes no sense to me. Why on earth would AVX2 boost IPC? Surely at best IPC should be the same vs. SSE4, and at worst go down slightly? Each AVX2 instruction can pull in more data, so the odds of a memory stall on each instruction is higher, meaning IPC is liable to go down.

Of course, throughput/clock will be up... but that's not IPC.

tamz_msc · Oct 2, 2017

Zucker2k said:
Thought the 12% was already 'AVX-corrected'? And Linpack is only 1 app out of 20+ tested. Yes, 25% clock scaling is best case, but that applies somewhat to AMD as well. These chips are running at 3.5GHz, ya know.

The Stilt said:
An additional data point was added to the charts to indicate performance with 256-bit workloads excluded (the ones which have actual gains from 256-bit code).
The excluded 256-bit workloads are: Blender, Bullet (IPC only), Embree, Euler3D, Himeno, Linpack, NBody & X265.

JoeRambo · Oct 2, 2017

NTMBK said:
This talk all makes no sense to me. Why on earth would AVX2 boost IPC?

If one of CPUs is executing same 256bit vector workload in half of the time and program instruction count to retire is const, wouldn't the faster CPU have double IPC?

tamz_msc · Oct 2, 2017

JoeRambo said:
Xeon Platinums have RAS and 8S capability and should not be compared with Gold stuff. You either need those features or You don't.

What if I want a 2P 56 core system? Or even a single 28C workstation? Surely Platinums being capable of 8p doesn't mean that that's the only configuration they're capable of running.

JoeRambo said:
So obviuosly You can't compare different lineups and even blind multiplication of turbo x core does not account for difference in cache and/or wattage.

So what's happening with speculation around Coffee Lake Cinebench MT scores, and with posts with a similar line of argument?

Coffeelake thread, benchmarks, reviews, input, everything.

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Lifer

Diamond Member

Golden Member

Diamond Member