At this point, I think we all just need to come to terms with the fact that IPC has become synonymous with performance per clock and no amount of "well, actually" can change that. As long as people realize that they aren't quite using the term in the most literal sense, it's fine.
Absolutely great point. We have to respect terminology so I think finding another term for what we are measuring as IPC for discussion purposes is appropriate. Modern processors have so many features that tend to defy the traditional IPC definition such as hyperthreading (SMT) or even certain new ISA instructions only suited to certain programs that when we talk about IPC when it comes to the point of where the wheels meet the road we are really talking about something like "clock efficiency."
I'll put this definition out there for comment as a real world of something like IPC that we can actually measure and discuss to know that we are talking about the same thing.
Clock Efficiency or maybe processor efficiency? I don't know though as that brings power to mind? Anyway it would be... At a particular frequency, the rate of output for a specific input. Example, fix the clock, input benchmark, measure the time for output and you have the rate. The time to completion can be used as the performance efficiency metric in this case.
Furthermore, if benchmarks could be run while measuring the "average effective clock" using HWinfo, as we are doing in the Handbrake bench in this forum, we could get a pretty valid result for what I'm calling clock efficiency. This is going to be more important moving forward with clocks jumping all over the place. HWinfo polls the clock on each processor constantly at a very fine grained level and then integrates to arrive at the average effective clock for the entire processor cluster on the CPU.
The main reason I analyzed the Anandtech data is not only for my curiosity but so we could discuss performance from the same set of results when talking about the generations.
I generally shy away from putting out controversial posts like that for fear of doing a lot of work and then being berated for it. You know the saying "no good deed goes unpunished." But I've been around here a long time and I know most of the people are like me in that we really enjoy following the processor industry on all fronts. Add to that the fact that I'm always willing to learn and I try to take posts directed at me in the best light not the worst. When someone posted "you should use Geomean." At first I was like all that work and I should have used geomean? Then after thinking about it I was like "yeah, that's an honest and positive contribution to the discussion, I can do that."
I think that should be solved on the OS front, just like multicore CPUs, SMT or even Bulldozer approach with modules (that were loaded in a poor order till they threated them as 1-2-3-4 cores + SMT).
A purely parallel load? Well, the small cores won't be that much slower in the first place, at worst I believe half the speed:
Goldmont 1.5 IPC x 5GHz = 7.5 units of work
Gracemont 1.0 IPC x 3.5 to 4 GHz = 3.5 to 4 units of work
Making asymmetrical loads scale well might be hard but if every smartphone can do this (even on 3 different clusters) I don't see Microsoft and Intel not pulling it off.
A good idea would be to run something like OS and other apps on the small cores constantly (think anti-virus, mail or background apps), then every other program will see only big cores with SMT threads.
Ideally when the load is light enough the big cores could be turned off and the workload go into the small cores clusters only, just as today cores throttle to idle clock speeds when they are not doing much.
Good points here. The units of work for Goldmont and Gracemont... did you pull them from the video of that guy on the previous page? I'm not critiquing, just wondering. His analysis (guesses) did seem reasonable.
Also, do get into even finer grained "thread proportioning" of work, I wonder if the scheduler mechanism can begin to take thermals into account? For example, you have 8 physical cores available. Within thermal parameters you can run 4 at full speed and 4 at quarter speed, or 2 at full, 2 at 3/4, and 4 at half, etc... A really smart scheduler would/could control the frequency of various threads to maximize overall application execution speeds. Adding in big/little cores just adds more variables into this "thread optimization" formula.
If you have an app coded such that it need one or two really screaming fast threads then it could crank them up at the expense of the other.
This would be an on-the-fly learning process for the OS and ideally it would involve AI so that it could "remember" previous runs, or better yet have access to an online database of similar data.