Glad you liked them. And thanks for the video pointer; I have not seen that before! And yes, without Netburst's failure against the Athlon64, Intel would not have implemented 64-bit x86 or tick-tock!
I'm eagerly waiting to read that reply.
OK, here it is:
Regarding Theseus' ship: Of course, the components don't survive process related and microarchitectrual changes (even an ALU has to change with a new scheduler or result bus), but at the high level it continues to be a ship of about the same class, just being faster while consuming less, being less ugly, whatever. It doesn't become a Titanic instead.

Does adding a HW divider and increasing some buffers make the Husky core (Llano) a new microarchitecture? Yes, as it's not the same K10 anymore. But it's clearly been derived from it. So at which level or at which size do changes need to happen to call it totally new (like created from scratch), or just new (not the same anymore)?
The discussion itself remembers me of the long Barcelona IPC enhancing feature list. And of course, each single improvement doesn't improve performance for all softwares out there, and some might even be mutually exclusive. Once I made this list of
list of K10's microarchitectural improvements (majord should know it well

), which are many, but of course didn't improve IPC by 1% each. According to one of the fading old reviews, which compared K8 and K10 at the same clock speed (no result spoiling turbo mode at least!), the improvement is ~24% for Cinebench. But as it seems the process in AMD's own fab ate up some of this improvement.
Phenom 2.6GHz vs. Athlon X2 2.6GHz
CB ST (2936 vs. 2359): +24.4%
CB MT (10311 vs. 4570): +225.6% (2578 vs. 2285 pts/core -> +12.8% per core, scaling 87.8% vs. 93.7%!)
CB OGL (3396 vs. 2937): +15.6%
The tested game Supreme Commander: Forged Alliance seems to be scaling well in its AI-heavy benchmark:
SC:FA min/avg/max FPS: +71.5% / +17.9% / +3.7%
2.3GHz K10 vs. K8 min/avg/max: +64.0% / +17.4% / +3.0%
WinRAR (MT): +40.2%
Source:
http://www.pcper.com/reviews/Processors/AMD-Phenom-versus-Athlon-X2-Something-Old-Something-New
pcper said:
Despite its 900 MHz clock speed disadvantage, the Phenom 9600 can outpace the Athlon 64 X2 6400+ by up to 40% in applications that take advantage of multi-core CPUs, such as video encoding and 3D rendering.
Regarding your design complexity points (and likely less informative for you than for the forum

):
DEC with its Alpha CPUs surely had to learn something on its way to the desired performance levels. I think, they also had to find new ways to design things while going the low FO4 path. The ARM1 also hasn't been sold in a product.
Interesting: The ARM developers wrote the ISA model and a microarchitectural simulator
in BASIC back then. But this was a simple uarch. Alpha was complex. And as someone said a while ago, the best simulator of a chip is the chip itself. But I think, that situation has changed, also indicated by Keller's statements regarding availability of x86 and ARM traces, LinkedIn profiles, papers, use of FPGAs, etc. While in the past the logic complexity of big chips was way beyond what could have been simulated in a cycle exact uarch simulator or even significant parts of them in SPICE, the computing performance continued to grow exponentially, while microarchitectures got features added in a more linear fashion. Colwells "big head" argument actually describes what happens without these options.
The growth might also help to manage the testability. The improvements and complexity can only grow as much as the design team can handle it. So there is a shift from creating abstract "mind models" of the uarch to think through known use cases (while simulating small parts to support this analysis) to simulating increasing amounts of the uarch's components at increasing detail. I often have the feeling, that many discussions going on here revolve around hidden wrong or simply missed assumptions, due to the increasing complexity. How often is energy efficiency left out of uarch discussions, while under the presence of a power wall this even helps high performance processors?! Simulation is the key to standardizing the evaluation of ideas and handle more complex scenarios. We use them for ADAS too. It's just not possible to recreate lots of realistic traffic conditions in a few architected NCAP/NHTSA tests. And nobody wants to go out and try to provoke some crashes to test new ideas (at least not here

).
I think, these changes (simulators, new design targets) might help seeing some success in creating something new with reused proven components. As a counter example to BD, Bobcat didn't do that bad for a synthesizable design, created from scratch. They started to sell the B0 stepping, which might have been caused by the core or any other component. Jaguar hit the market as A1 stepping.
Also a shift in the design goals to avoid for example logic, which causes big voltage droops (to reduce the margin) could mean, that known high performance design decisions have to be reiterated. Instead of the typical multipliers AMD might use rectangular or unpipelined arrays. So with multiple dimensions, this would for example reduce FP workload IPC by 10% while reducing FPU area by 10% and max power by 25%.
P.S.: Oh, I had to add to the newer Colwell video, that the relevant part about using AMD to force development of arch/uarch improvements at Intel begins at minute 26. BTW Dave Ditzel sat in the auditorium during that other talk at Stanford. This all is a very valuable input to my chipdesign game. ^^