What I would really like to see is the underlying RISC operations of Intel and AMD processors exposed as ISAs. Newly compiled programs could then target the underlying ISAs bypassing all the translation logic. Intel and AMD could introduce ways to have multiple target exes just like is OSX, also something like Rosetta for when they start producing processors without x86 translation facilities.
I know the exposed ISAs would probably be rife with IP, but I'm allowed to dream.
This would force them to freeze the underlying ISA's as they are. So far, Intel has changed theirs pretty much every tock, and as the decreasing price of transistors changes the costs and benefits of various tradeoffs, this is likely to continue. So, on a five-year outlook or more, revealing the underlying ISA would likely
hurt performance.
Having your frontend decoupled from your backend is a good thing. I think stumbling into this by accident and necessity is probably one of the reasons x86 won. If only the frontend wasn't as horrible as x86

.
If Intel wanted a new ISA, the best choice would probably be functionally almost equivalent to the parts of x86 that everyone uses, but redoing all the instruction encoding. Keep the variable instruction sizes, but make finding instruction boundaries cheaper, by clamping them at 2 bytes instead of 1, and either putting the length of the instruction into the first few bits of every instruction, or even putting a header into every group of 16 (or 32) bytes that marks instruction boundaries. Use no prefixes or other madness, and have a very simple (length+opcode)(reg1)(reg2)[optionally room for immediates/more complex addressing modes, in groups of 2 bytes] that all the most widely used instructions fit into, with the less common ones taking 4 bytes. Remove all partial register updates (all non-64 bit operations zero- or sign-extend), have insert instructions in their place. Spend a bit to allow any instruction to avoid updating flags. Cut all the FPU's/SIMD other than AVX, and give it the same instruction decoding overhaul (fit every avx instruction to 4 bytes). At least initially, use microcode only to trap complex situations (like page fault during unaligned access) -- basically, if you don't implement the instruction in 3 uops or less, don't offer it at all.
That should fix the worst problems of x86, while actually remaining implementable. You should be able to decode a full block (either 16 or 32 bytes) with one-clock throughput (pipelined). This keeps the OOE hardware busy, and allows the frontend to shut itself down when the buffers are full, saving power. Code density should improve, as many instructions that are now 3 bytes or more would fit into 2. You can powergate the frontend you are not currently using, and the new one should be so much smaller than the one x86 uses that it should basically vanish into the die. There'd be no need for uop cache, as the decode should more than keep up with execute. And of course, since it would be so much simpler, the new decode hardware would use much less power when it's used.
Will they make something like this? As AMD64 has shown, the only thing that matters is performance on existing code, so probably no. We can dream.
Ok, sorry for the misinformation. Sounds like that makes the compiler a hell of a lot more complicated than it already is and means that a cache miss does actually stall the CPU for the time being - doesn't sound sensibly considering that power efficiency isn't a forte of itanium to start with..
IPF has the curious distinction of being perhaps the only still widely-used isa that is actually worse than x86. Shifting complexity from processor to compiler using VLIW is a fundamentally bad idea, and not just because compilers took a decade to catch up.
Most real code can be divided roughly into two categories -- compute kernels, where you can get great IPC, and the challenge is decoding code fast enough to allow wide designs to take advantage of that IPC, and the rest, where you stall for memory and pointer-chase, where there are only occasional opportunities of executing more than one instruction per clock, and the challenge is using as few bits as possible to encode the program, whose execution time is entirely bounded by your memory system anyway. Most performance-critical programs spend most of their execution time in the first kind of segments, but by far most of their code consists of the second kind of segments.
IPF is truly great at the first kind -- basically give me any compute-related microbenchmark and otherwise equivalent IPF and x86 processors, and the IPF will win. It's the assembly tweakers wet dream. But most of real code is the second kind, for which IPF really, really sucks. Each vliw bundle is 16 bytes, and instructions within cannot use the results of each other. The most complex addressing mode is register-indirect. If you want to do any memory operation more complex than that, every pointer chase you do is 32 bytes, or
even more. Sure, you'd have plenty of slots in those bundles to fit unrelated instructions into, but
you don't have anything to put in them.
The end result is that real, normal code tends to swell up really badly, and overflows all sensible instruction caches. Which is just great because in the absence of OOE, on every miss you really wait. If you ever wondered why they put 1MB of L2 instruction cache per core on the 90nm Montecito, well, now you know. Any savings they had due to the simpler decode and not having OOE they paid back right there, with really heavy interest.
The only reason IPF even exists now is that Intel has segmented all the cool RAS features into Itanium servers -- expensive high end servers that need reliability use Itanium despite the instruction set, not because of it.
Note that I think that there are places where VLIW shines -- notably, AMD GPU's use it, and it allows them much better compute density than nVidia. But that's because the loads they get are basically tailor-made to their strengths -- in AMD shaders, all memory operations are segregated into separate programs run individually from the shader ops. Pointer-chasing is more or less just banned. Unfortunately, you cannot really do that for general-purpose code. (Itanium should be pretty cool at OpenCL, if somebody bothered to write a decent compiler for it).