Thanks for backing me up ch424
🙂
But, I'm going to have to politely - and somewhat reservedly - disagree with what you said here:
Clearly they're in the cross-over area where it does matter; somewhere very much below Haswell-EX/POWER8 but quite a bit above (possibly; I'm guessing) Silvermont/Cortex-A12 levels.
I think there's quite a bit of evidence that goes against this. Consider what Jim Keller informed AT:
Jim Keller added some details on K12. He referenced AMD's knowledge of doing high frequency designs as well as "extending the range" that ARM is in. Keller also mentioned he told his team to take the best of the big and little cores that AMD presently makes in putting together this design.
A Silvermont/A12 level core is going to be far below that sort of mark. For that matter, it even falls below the level set by the Cortex-A57s in AMD's Seattle, to be released this year. A57 is not exactly Haswell level, or Cyclone level for that matter, but it's still quite a bit wider and deeper than A12 or A17. From all of this information I very much think that AMD is doing a custom core so they can go beyond the performance of an ARM design, and incorporating elements from the construction and cat cores (note that cat also already goes beyond Silvermont to an extent) reaffirms this.
I agree absolutely that the wider, faster, bigger and so on you go with a core design the less x86 overhead hits you. But without empirical data it's hard to nail down exactly which point it seeps into the noise. The only ones in much of a good place to analyze this are those who are intimately involved in designing the CPUs, and understand the tradeoffs made. Someone like Jim Keller is in a position where he can at least run simulations to determine the impact of pieces of the pipeline that are catering to x86.
I mentioned uop cache earlier, I'd like to follow that up with this additional insight from David Kanter:
The uop cache is one of the most promising features in Sandy Bridge because it both decreases power and improves performance. It avoids power hungry x86 decoding, which spans several pipline stages and requires fairly expensive hardware to handle the irregular instruction set. For a hit in the uop cache, Sandy Bridge’s pipeline (as measured by the mispredict penalty) is several cycles shorter than Nehalem’s, although in the case of a uop cache miss, the pipeline is about 2 stages longer.
http://www.realworldtech.com/sandy-bridge/4/
Scaling an x86 design will introduce a lot of elements that start to minimize the cost of decoders. But as you scale up the rest of the system in performance you still have to scale up the decoders. Complex instruction encoding will mean a longer serialized decode time which means more pipeline stages - in addition to increasing your branch mispredict penalty, more stages means more area and power dedicated to saving and forwarding state, as well as more clock distribution. And the cost of decoders doesn't merely increase linearly as you add them because of the dependent positioning - even if you add predecode bits to the L1 icache you still have to scan for more instruction barriers first. And chunking up the instructions along boundaries doesn't equalize everything, x86 still has complex variable offsets within the instruction based on prefixes (REX effectively being a common prefix), opcode type (variable size opcodes), presence of operand extension bytes, and combination of address and immediate fields.
The uop cache is a nice alternative to this but it does come at a cost.. I described this in that post earlier but to give some extra information, the mapping from icache lines to uop cache lines is not trivial. The cache lines are tagged by the address of the first instruction in the line, but the lookups are done to arbitrary instructions in the line. So the lines must maintain a list of instruction offsets which it must scan to find out which uops to return, at least in the case where the code isn't known to be continuing to the next sequential uop cache line. And since each icache line corresponds to 1-3 uop cache lines there needs to be some mechanism for sequencing multiple ways from a line lookup. One other thing is that you either need to take on extra latency on a uop cache miss or start looking up in the L1 icache tags in parallel with the uop cache, which uses more power. Intel appears to have done the latter, in the spirit of not allowing the uop cache (which doesn't have an amazing hit rate) to degrade performance vs the L1 icache.
Bottom line, it's not negligible. A lot of that could just be that it's a bunch of work to get right, I don't pretend to have real numbers on performance/efficiency/area impact.
Also, I don't really know the details on this, but I suspect that x86's memory model imposes additional expenses even in OoO CPUs that implement speculative memory ordering/disambiguation/alias prediction because of the read ordering requirements. But I could be totally wrong about that one. An x86 core will have to maintain coherency between icache and dcache where an ARM core won't, but there are good arguments for the ARM core to do this anyway (although I don't know any that do?) so that's kind of moot.