Indeed, architecture is loosely reflected at RTL level, while design is reflected at gate level/cell level.
In my example from above we are using the very same architecture, with different design (e.g. different cell mix, drivers, netlist) in conjunction with low power memories for caches etc.
The achievable frequency range is at least factor 2 between ultra low power (at or below nominal voltage) and high performance (at overdrive voltage) for the very same architecture.
I do not agree that it is overstated, it is understated based on your comments. Its not just the uOp cache itself, its the fact that there is a uOP cache at all with large miss penalty (e.g. pre-decode miss). Its the fact that with variable instruction length you cannot start decoding before knowing the length of previous instruction. Its the fact, that there are few 3 operand instructions ,which requires additional moves (either register to register or register to memory), it the fact that you only have 16 registers which increases the amount of memory accesses, its the fact that memory references are hard to prove independent and cannot be subject to register renaming while ooo scheduling, its the fact that return address is stored on stack instead of registers, its the fact that far calls uses segment descriptors to access GDT or LDT....i could go on and on. Most of these "features" were much less of a problem in the 20th century but today they limit both efficiency and setting upper boundary for realistically achievable IPC.