Maybe we're having some miscommunication about what is the "module penalty"? Piledriver and Bulldozer had the infuriating quality of producing inferior throughput in some sparsely-threaded applications when the thread scheduler would indiscriminantly load two threads onto the same module. For example, a 4m/8t Bulldozer running an application that spawned four threads would run significantly faster if one thread could be allocated to each module instead of allowing two threads per module (or something similar) for any significant amount of time.
SR and XV aren't nearly so bad, as your own comparisons highlight, though there still isn't 100% performance scaling moving from one thread per module to two threads per module. It's a lot closer for SR and XV than it was for Bulldozer, but let's face it, AMD's CMT implementation isn't good enough to boast 100% performance increase. On Cinebench R10, my 7700k manages 84% thread scaling going from ST to MT (4 threads). That's quite good compared to some of the nightmarish scaling in fp apps that happened on Bulldozer.
Of course, it also helps that modern OSes (Linux and Win10 in particular) are much better about scheduling threads on modules than Windows 7 or XP used to be.
In response to SuperPi, I think XV is doing well in that bench due to the cache improvements XV has over SR. And SR was no slouch in Super Pi compared to other AMD chips!
The module penalty is still there, but, yes, it is tiny compared to Bulldozer or Piledriver... but huge compared to not having one at all.
A little rundown of the difference I calculated (using Phenom II X4 scaling values):
Code:
Bench | x4 XV | x4 NoMod | Improvement
CB R10 | 7708 | 8378 | 8.7%
CB R11.5| 2.99 | 3.26 | 9.0%
CB R15 | 258 | 298 | 15.5%
GeekInt | 7935 | 8406 | 5.9%
GeekFPU | 6474 | 7312 | 12.9%
3dPM | 153 | 195 | 27.4%
7-zip | 9055 | 9644 | 6.5%
This also includes some cache penalties (L3, mostly), which probably impacts 3dPM more than anything else, but should only be a small part of the rest of the results. In any event, we'd never really expect 100% scaling, 90% is quite decent. I have some code designed to scale to the highest degree possible, and it scales 100.2%. Yeah, figure that one out, LOL!*
SuperPi is definitely a sweet spot for XV. It gains more than almost anything else.
*There's a fixed-cost for n-cores, so once you get enough threads going, they overwhelm the cost to execute in one thread and scale better and better. With synthetic loads, Intel's HT can have 100% scaling - giving me 802% more performance than with just one thread.