AMD Kaveri OC On Planet Neptune

ShintaiDK · Dec 30, 2013

inf64 said:
PS It takes 4.2Ghz PD to match 3.7Ghz SR in C15 . 7850K runs C15 at 3.7Ghz, it doesn't turbo up in this workload (at least not in the run that you see in the image above ).

So a 7850K might roughly match a 6800K in CPU performance.

inf64 · Dec 30, 2013

ShintaiDK said:
So a 7850K might roughly match a 6800K in CPU performance.

Yeah, in some workloads it can significantly outperform it, but it's almost like Haswell story in that regard (if we select a special workload it rocks)

.
The worse part is that Richland really clocks much higher due to better process.

ShintaiDK · Dec 30, 2013

inf64 said:
Yeah, in some workloads it can significantly outperform it, but it's almost like Haswell story in that regard (if we select a special workload it rocks) .
The worse part is that Richland really clocks much higher due to better process.

What select workloads would that be? Steamroller offers nothing that would be significantly faster on paper. And Cinebench is already showing this. No AVX2, no 256bit caches. Kaveri sofar seems to offer a 3% performance boost over the previous generation. Its the speedracer design going backwards.

I wonder if the IGP part will have the same problems in delivering...

inf64 · Dec 30, 2013

Well you wil have to wait for the launch

. One of those (or subset of it) will show some big gains, but like I said it's all select cherry pick case.
PS Where do you see "3% boost"? Do you mean overall performance increase over Richland or what?

ShintaiDK · Dec 30, 2013

inf64 said:
Well you wil have to wait for the launch . One of those (or subset of it) will show some big gains, but like I said it's all select cherry pick case.
PS Where do you see "3% boost"? Do you mean overall performance increase over Richland or what?

Just used the optimistic numbers between top parts. But that doesnt seem to hold either. Even Trinity can get 314 points in R15 at 4Ghz. So Kaveri might be 3% slower, model vs model.

It seems this is another case where expectations and real world are miles apart.

inf64 · Dec 30, 2013

1st of all Kaveri runs the C15 benchmark at 3.7Ghz fixed clock(don't ask) with ddr3 1600Mhz ram, and scores ~311pts. For comparison, I get 283pts with fixed 3.7Ghz clock on my QC PD. Do the math

. It's always wrong to use best/worse case benchmarks from "stock" parts found on internet forums/reviews as clock fluctuates a lot. Best is to use manually set clock like I did

.

ShintaiDK · Dec 30, 2013

Stock 6800K gets 326points.

Now tell me how much you got in singlethreaded. And lets see how much it misses from the fabled ST performance benefits that was told to be. The 30% IPC is obviously long gone.

PPB · Dec 30, 2013

Homeles said:
The point of these benchmarks is to emulate real world workloads.

CB doesnt represent a real workload. It should represent rendering performance, but it doesnt. What was a measly single digit IPC increase in CB from IB to Haswell, in a real world renderer like vray it was more like 20%+ gain. What seems like a regression from the BD/PD uarchs in CB score compared to K10 is actually a gain in performance in a real world renderer like vray.

Homeles said:
Isn't Cinebench integer-heavy?

Its FP heavy. Thats what makes it gibberish, as real world renders have a lot of integer code mixed in too. Heck, most of the code in real world workloads tends to be integer heavy (and thats the argument from AMD for trimming down the FPU and focusing on integer performance increases)

PS: usual FUDers will hold on to this to say AMD missed expectations, but the 30% part holds true for ST integer perf, at least by the leaks shown. What need is for pure, paralellizable FP perf in CPU if most respectable software already moved on to using the GPU for that? ICC compiled CB is just thrash.

inf64 · Dec 30, 2013

ShintaiDK said:
Stock 6800K gets 326points.

Now tell me how much you got in singlethreaded. And lets see how much it misses from the fabled ST performance benefits that was told to be. The 30% IPC is obviously long gone.

You cannot apply all improvements to a benchmark like Cinebench and you know it. How much Haswell got over IB, how much IB over SB? C15 is just not the best example to show all that is improved in a core. There are other benchmarks and if properly "tuned" Kaveri can be solidly faster- that's why I said it's sort of Haswell-like product release.

BTW 6800K runs this benchmark at ~4.2Ghz, just FYI

.
I get 93pts in ST test and 328pts in MT test at fixed 4.2Ghz. Scaling is 3.55x.

ShintaiDK · Dec 30, 2013

PPB said:
CB doesnt represent a real workload. It should represent rendering performance, but it doesnt. What was a measly single digit IPC increase in CB from IB to Haswell, in a real world renderer like vray it was more like 20%+ gain. What seems like a regression from the BD/PD uarchs in CB score compared to K10 is actually a gain in performance in a real world renderer like vray.

Its FP heavy. Thats what makes it gibberish, as real world renders have a lot of integer code mixed in too. Heck, most of the code in real world workloads tends to be integer heavy (and thats the argument from AMD for trimming down the FPU and focusing on integer performance increases)

Cinebench is INT heavy. Also the instruction set support is very bad in Cinebench. No SSSE3, SSE4, AVX etc support. Something DB/PD/SR depends on. Its easy to see the INT dependency in Cinebench, when you bench against 4M/4T and 4M/8T.

inf64 · Dec 30, 2013

I forgot to add that if you take a look at MT and ST portions of that run, Kaveri runs ST test at 4Ghz and scores 88pts while 3.7Ghz gets it 311pts. This means that actual MT scaling when you adjust for clock disparity is 3.82pts, so an improvement over 3.55x that PD gets

. Now it's ~5% off the "ideal" 4x mark that "true" QC gets.

PPB · Dec 30, 2013

ShintaiDK said:
Cinebench is INT heavy. Also the instruction set support is very bad in Cinebench. No SSSE3, SSE4, AVX etc support. Something DB/PD/SR depends on. Its easy to see the INT dependency in Cinebench, when you bench against 4M/4T and 4M/8T.

Cb uses avx for intel, SSE2 for AMD and its FP heavy. This is per Agner Fog's analysis. Who should I trust? A respectable source or some random FUDer known to twist reality to fit his agenda? Hows that "never will be APUs on next gen consoles" going for you? :awe:

ShintaiDK · Dec 30, 2013

inf64 said:
You cannot apply all improvements to a benchmark like Cinebench and you know it. How much Haswell got over IB, how much IB over SB? C15 is just a not the best example to show all that is improved in a core. There are other benchmarks and if properly "tuned" Kaveri can be solidly faster- that's why I said it's sort of Haswell-like product release.

BTW 6800K runs this benchmark at ~4.2Ghz, just FYI .
I get 93pts in ST test and 328pts in MT test at fixed 4.2Ghz. Scaling is 3.55x.

Looks fine to me.
http://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested/6

Again, SR offers nothing new on the instruction front or 256bit caches for that matter. So where is it again that all this hidden SR core performance is? With Haswell its in AVX2/256bit caches.

ShintaiDK · Dec 30, 2013

PPB said:
Cb uses avx for intel, SSE2 for AMD and its FP heavy. This is per Agner Fog's analysis. Who should I trust?

Cinebench does not use AVX. It uses SSE2 only as the top instruction.
http://www.hardwareluxx.com/index.php/news/software/benchmarks/28060-cinebench-r15-released.html

The new version runs only on 64bit systems, but still provides no support for current instruction sets such as Intel's AVX. In turn, up to 256 threads are now supported, which will probably be sufficient for the next few years. Requirements for the benchmark are a 64bit CPU (AMD Athlon 64 or Intel Pentium 4 Prescott core) and, consequently, a 64bit operating system (Windows Vista, 7, or 8 or Mac OS X version 10.6.8). In addition, a video card with OpenGL 2.6 support should be present for the graphics tests.

Durp · Dec 30, 2013

Is that leak from a trustworthy source? I hope it's not accurate cause those numbers disappoint me and I was very excited to build a kaveri system. Isn't this still worse than Nehalem from 2008 IPC wise?

inf64 · Dec 30, 2013

ShintaiDK said:
Looks fine to me.
http://www.anandtech.com/show/7003/the-haswell-review-intel-core-i74770k-i54560k-tested/6

Again, SR offers nothing new on the instruction front or 256bit caches for that matter. So where is it again that all this hidden SR core performance is? With Haswell its in AVX2/256bit caches.

Well you will have to wait 2 more weeks to see the cherry picked benchmarks I guess.
SR is around ~8% faster in overall C15 score (at the same clock) than PD.

ShintaiDK · Dec 30, 2013

inf64 said:
I forgot to add that if you take a look at MT and ST portions of that run, Kaveri runs ST test at 4Ghz and scores 88pts while 3.7Ghz gets it 311pts. This means that actual MT scaling when you adjust for clock disparity is 3.82pts, so an improvement over 3.55x that PD gets . Now it's ~5% off the "ideal" 4x mark that "true" QC gets.

88poinst at 4Ghz ST?

A10-5800k - 297/86/3.46

Hardly any ST improvement for SR. That IGP part better be fantastic, or its a chip going nowhere.

inf64 · Dec 30, 2013

ShintaiDK said:
88poinst at 4Ghz ST?

A10-5800k - 297/86/3.46

Hardly any ST improvement for SR. That IGP part better be fantastic, or its a chip going nowhere.

Well yes, in C15 single thread seems to not score much higher. It's just one benchmark though, thee will be more showing similar stuff but also showing bigger gains. It's almost exactly like a copy of IB->Haswell launch. This time the clock has regressed a bit which is a shame.

Vesku · Dec 30, 2013

Not that anyone was expecting it to be good but Global Foundries 28nm is looking kind of sad. We'll have to wait for retail units to hit to see exactly how "meh".

No wonder AMD isn't spending any money to put out a 28nm many core Server/Performance chip. Shame for me though, with Steamroller it seems that they will finally meet low end expectations of original Bulldozer. Would have been nice to retire my AMD machine with an 4-6 module Steamroller so I can say "see AMD CAN deliver ~8 Nehalem cores @ mid to high 2GHz levels of performance. Just 2 or 3 years after it would be relevant".

Is AMD still committed to moving all production to Global Foundries? If so management must be losing sleep over being stuck on GF 28nm through most if not all of 2015.

mikk · Dec 30, 2013

inf64 said:
I forgot to add that if you take a look at MT and ST portions of that run, Kaveri runs ST test at 4Ghz and scores 88pts while 3.7Ghz gets it 311pts. This means that actual MT scaling when you adjust for clock disparity is 3.82pts, so an improvement over 3.55x that PD gets . Now it's ~5% off the "ideal" 4x mark that "true" QC gets.

I thought you claimed Turbo was off in this R15 score from Kaveri.

PPB said:
Cb uses avx for intel, SSE2 for AMD and its FP heavy.

When I disabled AVX, FMA on my Intel it didn't make any differene to the score. Also I read not long ago that Cinebench is Integer heavier as expected. Maybe anyone know what I mean and can link it.

inf64 · Dec 30, 2013

mikk said:
I thought you claimed Turbo was off in this R15 score from Kaveri.

Yes in MT portion as there is no power budget left. The overall score is what basically counts(and it ran it at base clock). Shintai wanted to know how ST score compares to PD. Also comparing ST and MT scores we can now see that multi-core scaling actually improved 7-8% vs PD, which makes all the difference in "per clock" results.

PPB · Dec 30, 2013

About CB using ICC:

Images shows compiler being selective when it comes to CPU vendor string. Wonder why would that be.....?

Credit goes to OCN user sdlvx.

Dont trust the source? Let's see what Anand thinks about this (on the Interlagos Review):

"Before we look at the Multi-threaded benchmark, Andreas Stiller, the legendary German C't Journalist ("Processor Whispers") sent me this comment:

'You should be aware that Cinebench 11.5 is using Intel openMP (libguide40.dll), which does not support AMD-NUMA'

So while Cinebench is a valid bench as quite a few people use the Intel OpenMP libraries, it is not representative of all render engines. In fact, Cinebench probably only represent the smaller part of the market that uses the Intel OpenMP API. On dual CPU systems, the Opteron machines run a bit slower than they should; on quad CPU systems, this lack of "AMD NUMA" awareness will have a larger impact.

Cinebench R11.5 MT

We did not expect that the latest Opteron would outperform the previous one by a large margin. Cinebench is limited by SSE processing power. The ICC 11.0 compiler was the fastest compiler of its time for SSE/FP intensive software, even for the Opterons (up to 24% faster than the competing compilers), but it has no knowledge of newer architectures. And of course, the intel compiler does favor the Xeons."

2 notes on this review:

1. Note how Anand always discusses FP performance when talking about CB, wonder why would that be?
2. Anand mentions the usage of the ICC 11.0 on this test, but omits the usage of AVX. Let's see what truth there is to your claim of "CB doesnt use AVX"

ICC being unfair by making different paths and using different instructions depending on vendor string: LINK. Credit goes to Agner Fog's findings on the subject.

As you can see, ICC 11 already makes the distinction between SSE2 for non-Intel vendor string, AVX for Intel processors.

So now lets rebuild it up so anyone, even the usual suspects, can understand it:

If CB uses ICC (11 version specifically) and ICC 11 uses AVX for intel processors, CB will run under avx if an Intel processor is detected. In AMD's case, it will run under SSE2.

CB is FP heavy, from the moment we are discussing SSE2 and AVX.

You can spin it any way you want, keep citing doubtful sources, I'm even citing the very owner of this site as my source of information.

inf64 · Dec 30, 2013

Guys don't turn this into compiler discussion. Let's discuss Kaveri and the results...

USER8000 · Dec 30, 2013

Well I suppose at least GPU compute will get a decent uplift with the new GPU,and hopefully Crossfire will actually work properly as there are 384 and 512 shader parts.

mikk · Dec 30, 2013

> 2. CineBench 11.5 requires SSE2 compatible cpus. There is no differentiation between Intel or AMD cpus (the
> compilers are set to create SSE2 code without creating jump code for different cpus or cpu vendors).
>
> 3. The CineBench 11.5 Windows version uses ICC (the OS X version GCC 4.2), as these have been the compilers
> creating the fastest code at that time (end of 2009) for these platforms - independent of the cpu vendor.
> To be more specific: With the (SSE2) compiler setting used in CINEMA 4D and CineBench 11.5, the speed advantage
> of ICC over MSVC (roughly 15-20%) has been slightly bigger on AMD cpus than it was on Intel cpus.

http://www.realworldtech.com/forum/?threadid=135978&curpostid=136051

Confirmation from a Cinebench dev that there is no differentiation which is no surprise since Piledriver works as expected. Some other heavy mutlithreaded programs like x264 are much worse than Cinebench R15 where a i5 Haswell is close to FX-8350. In CB R15 FX-8350 is much further ahead. In fact CB R15 is a good benchmark for Piledriver, MT score at least.

AMD Kaveri OC On Planet Neptune

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Diamond Member

Lifer

Diamond Member

Golden Member

Lifer

Lifer

Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member