New x86-64! - 64-256 64bit registers!
This is not useful. Modern desktop CPUs already have hundreds of registers, which they can use to increase execution width thanks to register renaming. What you are proposing to add is several hundred
register names, which have both questionable utility and very real costs in a renaming OoOE cpu.
When AMD was designing x64, they ran a lot of simulations on running normal x86 code with more or fewer registers. Going from 8->16 was a no-brainer, but they also simulated 32 registers, and found that it only gave a few percent more speed in most code. In contrast, increasing the amount of register names has a major cost on context switches, as they all have to be stored and reloaded from the memory, so 32 registers turned out to give a net performance deficit. (There would have been no implementation cost for going to 32 names, so AMD could just purely pick the best choice, which on normal x86 code was 16 names.)
Itanium had to have that many registers because it was in-order, with no renaming. This means that for execution width purposes, register names = registers on Itanium.
Most ARM loads do less context switches than x86 because in embedded there's less varied software running at the same time, so for their loads they chose 32, probably because they also intend to push Aarchv8 to very low power targets eventually, where register renaming and OoO are not a given.
(edit)
8 Decode x86 - 2x complex and 6 simple
Also, this would be ridiculously expensive in power, and the 2 complex decode is not useful at all. Because of it's very variable width instructions, increasing x86 decode with has an exponential power cost. Every extra instruction you add cost more power than the already existing decode. This is the one place where ARM has a genuine advantage over x86: increasing decode width in ARM has linear costs, they can just go wild and decode as much as they feel like. For x86, it's better to use a uop cache for getting hot loops, it's not like you are ever going to be decode-limited on 5 insns per clock in straight line code anyway.
As for complex instructions, they basically only exist as either to maintain backwards compatibility for things no real code actually uses anymore, for "janitorial tasks" like managing cpu state (that is completely irrelevant for performance) or for string instructions where decoding more than one per clock is useless anyway. >1 clock is completely pointless.