looncraz
Senior member
- Sep 12, 2011
- 722
- 1,651
- 136
It's possible the guy you knew was saying something accurate but it was just misinterpreted somehow.
Very possible. Not like any of us were sober
It's possible the guy you knew was saying something accurate but it was just misinterpreted somehow.
Well, the whole "modern compiled" thing may very well put his whole statement in context. This was only months after the Pentium 4 release
Maybe it is one of those things that was true then, and not even remotely true now?
I remember reading some details about Bulldozer a loong time ago that made me believe that a branch misprediction would result in a L1 flush and the L1 would need to be refilled by the L2 before execution resumes. The problem, as I recall, was how little the integer unit can do at once. This meant, at least to me, that the branch misprediction penalty would, in effect, by constrained by the L2 latency (since a misprediction will flush the pipelines at the same time it initiated an L1 refill, there would only be, maybe, a 1 cycle penalty that was not driven by the L2).
I think this came up in a conversation with JF-AMD, but I really have no idea how to search for that :'(
On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator.
[snip]
Haswell-like performance? I think not.
Nice graphs! It looks like instruction execution changes settled somehow with XV. Then the biggest part of the IPC improvement comes from L1 cache size and other core improvements (as usual I admit).
On fp workloads, I saw more like an 11% increase in IPC from Steamroller to Excavator. The problem is that a lot of the XV numbers going around are coming from cTDP-strangled, throttle-prone Carrizo chips. Check out what happens when you have a cTDP-unlocked developer platform with Carrizo in it:
http://www.overclock.net/t/1560230/jagatreview-hands-on-amd-fx-8800p-carrizo/400_100#post_24310470
Connecting the dots from there isn't too hard. Carrizo @ 3.4 GHz managed an R10 score of 13146. In a MT scenario, that chip is getting ~966.6 CB per thread per GHz. Zen will (allegedly) get ~1353.3 per thread per GHz in the same scenario, assuming a 40% improvement.
Now if you look here:
![]()
You'll see the 6700k getting 40731 @ 4.8 GHz. That's ~1060.7 per thread per GHz. Here we see two interesting facts: First, XV is very close to the 6700k in this old SSE2 benchmark, losing by the margin it does thanks to AMD being unable to clock it very high and deploy more modules. Secondly, a 4c/8t Zen should require only a clockspeed of ~3.8 GHz to match the 4.8 GHz 6700k in R10.
Haswell-like performance? I think not.
SMT performance is only about 30% better in cinebench. With 4 cores/4 threads that Skylake would be around 30k, which makes its score above 1500 per thread per Ghz.
Broadwell-E will be 3.3Ghz for 8 cores and 3Ghz for 10 cores. All at 140W.
Anyone in their right mind still believing in 4Ghz Haswell IPC 8C/16T at 95W with 14 LPP for Zen?![]()
I remember reading some details about Bulldozer a loong time ago that made me believe that a branch misprediction would result in a L1 flush and the L1 would need to be refilled by the L2 before execution resumes. The problem, as I recall, was how little the integer unit can do at once. This meant, at least to me, that the branch misprediction penalty would, in effect, by constrained by the L2 latency (since a misprediction will flush the pipelines at the same time it initiated an L1 refill, there would only be, maybe, a 1 cycle penalty that was not driven by the L2).
I think this came up in a conversation with JF-AMD, but I really have no idea how to search for that :'(
Do we have any confirmation for the 140W figure?
FWIW VC 2013's C library uses them for a few things. We found out right away since MS didn't bother checking if AVX was enabled before using them (this didn't get fixed until VC 2015In another forum I asked for FMA usage in applications/games.
DrMrLordX, SKL has a SMT penalty in the score/GHz/thread calculation. XV has only a small one (CMT). That (avg.) 40% number could also stand for single threaded integer code. We don't know it.
SMT performance is only about 30% better in cinebench. With 4 cores/4 threads that Skylake would be around 30k, which makes its score above 1500 per thread per Ghz.
The link also shows the Carizzo being locked to 15W TDP, though I'm not sure that was really enforced.
Skylake's SMT penalty is irrelevant. Running at its highest performance level, the chip in question (6700k) has the capability to handle 8 threads. A quad-core Zen would do exactly the same thing, just as a 2M XV can handle 4 threads. If Zen is really going to have 40% higher IPC than XV, then that's going to be after we take SMT penalties into account (which is something Zen will face, since it will feature SMT). If not, then AMD is simply not telling the truth about Zen.
I doubt it'll be that much of an improvement in int-heavy code. There will still be some improvement thanks to faster cache and (probably) shorter pipeline. Most of the gains will be in floating-point code.
Now we're just playing with numbers. The only way we can objectively compare Construction core modules and SMT-capable cores is to push them to their maximum thread counts and see where we lie. Yes, Intel does sell CPUs that simply don't utilize SMT for anything, but as you clearly stated, they're slower without SMT in R10 and many other benchmarks.
Sure, you can try to eliminate SMT from the equation to make Skylake look like it's getting better performance per thread per GHz, but since you have eliminated SMT, the maximum thread count for Skylake drops by half, hurting overall performance. Why do you think Carrizo is a performance loser against Skylake today in a benchmark like R10? Simple: thread counts and clockspeeds.
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...
If you scale up Carrizo's score: 13146+40% IPC+30% SMT+26.3% clock= 30218 and even if the CMT penalty isn't part of that 40%: 30218+10%= 33240, Zen would be 20% behind.
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...
Summary/tl;dr: 1 Zen core + SMT is going to have to manage +40% performance at the same clockspeed vs. 1 XV module in order to fulfill AMD's claim of +40% IPC. They made good (and then some) on their IPC claim moving from SR to XV. Can they do it with Zen? We'll find out next year.
Do you have a source for the claim that AMD is promising one Zen core will produce 40% higher IPC than one XV module? That seems highly suspect, as it would imply that an 8 core Zen chip at the same clocks would get a score 5.6X higher in MT benchmarks than the two module Carrizo chips.
I doubt it'll be that much of an improvement in int-heavy code. There will still be some improvement thanks to faster cache and (probably) shorter pipeline. Most of the gains will be in floating-point code.
The actual claim is a single Zen core{1C|2T} has 40% higher IPC over a single XV core{1C|1T}.Do you have a source for the claim that AMD is promising one Zen core will produce 40% higher IPC than one XV module?
Except you calculated a 3.8Ghz Zen will equal a 4.8Ghz Skylake because Skylake's SMT penalty is much bigger than Carrizo's CMT penalty...
If you scale up Carrizo's score: 13146+40% IPC+30% SMT+26.3% clock= 30218 and even if the CMT penalty isn't part of that 40%: 30218+10%= 33240, Zen would be 20% behind.
Anyway, application performance doesnt scale with with IPC.
How not? IPC is supposed to refer to some average measurement of real world application's work/MHz.
Cause the CPU is just one part of the whole? If you were to scale up the rest of the gang too, busses, memory etc, that would be another story.
