Yes,every ZEN core has 2x the FP and L/S resources of an EX core,but it remains to be seen how AMDs SMT will work,if it just cuts down the core into two equal threads then one ZEN core will be the equivalent of an EX module minus the module penalty and plus some other improvements.
The single core speed (when running a single thread) will stay pretty much the same while the ZEN core (module) running two threads will have an ~40% improvement.
(presto chango 40% IPC improvement per "core" (as long as you run multiple threads) )
A Zen core has about 1.3x the FP resources of a XV core with ST code. L/S resources are roughly the same, but likely more efficient.
AMD's SMT (a first try can be seen since BD in the FPUs) doesn't work via back end resource partitioning. And your example with single core vs. SMT IPC improvement backfires, as such a good scaling wouldn't be possible with hardware resources, which won't already increase ST performance significantly. If there are not enough FUs (say 2 ALUs + 2 AGUs), SMT scaling would be <10%, because SMT is not just about better utilization. There is also a lot of shared resources. This is also supported by their statement:
more execution resources benefit both modes
If some discussion in the heat of arguments somehow gets disconnected from known statements, it's time to look at them again.
What I see, is an IPC improvement of a "Zen core" over an "Excavator core" on the original slide. Plus a footnote
3. Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.
Nothing else. And with what I saw so far, 40% only with SMT makes no sense. If this is not the case, then I'm open to sound explainations, why reality differs from what I'm seeing.
One more observation regarding Blender. The SMT yield in Blender appears to be unusually high. In similar applications, such as Cinebench the yield is around 27% on Haswell-E. In Blender the yield is > 59%. Blender BMW benchmark (at default resolution, 20x20 tiles) was completed in 127.98 seconds with 18C/18T while with SMT enabled the time was reduced to 90.07 seconds.
This could only be properly analyzed at code or assembly level plus some IBS data. If CB working sets are a bigger than Blender's, the latter might just fit data of 2 threads better into the caches. And if CB code extracts more ILP in one thread, there is less room to fill the gaps with a 2nd threads' FP ops.
As to why AMD might have shown a (can we not do tildes anymore? What is this?) approximate 130% gain in IPC from Vishera/Piledriver in their Summit Ridge benchmark, well, Blender does make use of AVX2 . . . and even though XV's implementation of AVX2 is quite poor, I have observed (through programs like y-cruncher) that XV is already leaps and bounds ahead of PD on a per-module basis in SIMD-heavy code.
A tilde is a legit symbol. And it saves space, as they actually should be used more often in speculative threads.
A tilde is also used to indicate "approximately equal to" (e.g. 1.902 ~= 2).
https://en.wikipedia.org/wiki/Tilde#As_a_relational_operator