Considering that generally process has minor impact on actual architecture performance (in particular per clock), it kinda discredits interview as a whole, if anything.
I think he referred to remaining power efficiency improvements, because a part of the process' improvement likely already got used up in uarch and cycle time.
The problem here is related to the words that he used, and particularly
"per clock". Maybe he made a mistake, and so giving a misleading information.
AMD mentions 40% for ST IPC, first of all because they would always communicate the higher number (40% for MT then communicate 50%+ for ST), second because if it were otherwise Zen would tie or beat Intel's HEDT in more traditional benchmarks such as Cinebench. So not only is there confusion in the way you used IPC in relation to the number of threads per core, you may have also chosen a rather unreliable path to interpret what little AMD claimed in regard to Zen vs. XV
So, you're stating that there are an IPC ST and IPC MT definitions, right? I beg to differ.
There was nothing like that in the definition that someone reported before, but it's not an isolated case.
For example,
here you can find the definition from a respectable source (they should know about measuring the performance, right?), and and pay attention to this: "
IPC is an excellent metric for judging an overall potential for application performance tuning". So, application, as a whole. No ST or MT distinction, because IPC IS measured/extracted running an application, and not a single part of its execution.
From another source (64-ia-32 manual-325462.pdf , "Intel® 64 and IA-32 Architectures Software Developer’s Manual"), at p.45:
"2.2.3.2 Execution Core
The execution core of the Intel Core microarchitecture is superscalar and can process instructions out of order to increase the overall rate of instructions executed per cycle (IPC)."
So, the whole core is considered, and NOT a part of it, or splitting the definition in ST/MT terms.
From another source (64-ia-32-architectures-optimization-manual.pdf, "Intel® 64 and IA-32 Architectures Optimization Reference Manual"), at p.20:
"2.3.3 The Out-of-Order Engine
The Out-of-Order engine provides improved performance over prior generations with excellent power characteristics. It detects dependency chains and sends them to execution out-of-order while maintaining the correct data flow. When a dependency chain is waiting for a resource, such as a second-level data cache line, it sends micro-ops from another chain to the execution core. This increases the overall rate of instructions executed per cycle (IPC)."
Pay attention to the "overall".
At p.583:
"Retiring denotes slots utilized by “good operations”. Ideally, you want to see all slots attributed here since it correlates with Instructions Per Cycle (IPC). Nevertheless, a high Retiring fraction does not necessary mean there is no room for speedup. since it correlates with Instructions Per Cycle (IPC). Nevertheless, a high Retiring fraction does not necessary mean there is no room for speedup."
In fact, when the CPU retires instructions, it doesn't make a distinction between the threads: it retires whatever is the thread (one or two) from which they come.
At p.586:
"B.1.7 Retiring
This category reflects slots utilized by “good micro-ops” – issued micro-ops that get retired expeditiously without performance bottlenecks. Ideally, we would want to see all slots attributed to the Retiring category; that is Retiring of 100% of every slots correspond to hitting the maximal micro-ops retired per cycle of the given microarchitecture. For example, assuming one instruction is decoded into one microop, Retiring of 50% in one slot means an IPC of 2 was achieved in a four-wide machine. In other words, maximizing the Retiring category increases the IPC of your program."
Again, it talks about the whole program. Not ST and/or MT.
I think that it's enough. So, there was no confusion when I talked about IPC neither when I talked about Blender's results.
I was at HC28. Mr. Clark was quite clear that it was ST.
Questioner from Nvidia: You had 40% uplift on IPC. Did it include the dual-thread, or was that per thread?
Clark: That was just a one-thread number. We do have good throughput on SMT but we're not stating numbers on that right now.
For those who have access to HC presentation videos, the timecode is 1:27:46 on the session9 video.
I have no problem believing you

, but see above: I respectfully disagree about such IPC definition.
@bjt2: I've no time now to reply to your post. However you can take a look at Intel's optimization manual, and you'll see the architectures' diagrams that you're looking for, as well as a lot of other useful information.
