Intel CPU x86 Why not compile straight to uOps?

Hugo Drax · Oct 19, 2013

I wonder how much overhead is involved in the whole x86 ISA to uOps translation process in terms of performance penalty, lowered IPC?

If apple was able to transition from PowerPC to x86 why not offer a pure uOps mode. Ie with the intel compiler you target uOp instead of x86 and you get a clean executable that runs in native mode.

Offer this mode and a future intel CPU can scrap the legacy ISA and just offer processors where all its transistors are dedicate to just running native uOp code with no overhead.

For windows and Mac they can offer a Rosetta type emulator that would be transparent to the end user, the software would handle the x86 uOp translation.

zir_blazer · Oct 19, 2013

What you're talking about comes from the RISC vs CISC days and how current x86 Processors are RISC-like internally but with a CISC-like frontend and ISA. Processors got dedicated logic to provide x86 instructions-to-microOps translation so they can be 100% x86 compatible, yet their inner working be totally different.
Remember that while the x86 ISA is standarized, I don't think the microOps are. They could possibily radically vary from one architecture to another. If you were capable of doing something like feeding microOps directly to the Processor (Which I don't think its even possible, considering that you have the x86 frontend for a reason), an executable would only work in a single architecture (Nehalem, Sandy Bridge, or Haswell), not all them, and worse if you had a different vendor that should probabily be completely different. And keep in mind that we're where we are because people always preferred backwards compatibility, portability and easy to develop for over raw performance, reason why x86 Assembler is barely heared about nowadays. And you want to go a step below it.

Exophase · Oct 19, 2013

In addition to what zir_blazer said about interoperability (going to binaries that work on only one processor line would be suicide) - Intel uops aren't similar to some RISC ISA. They're probably very large and not necessarily some multiple of 32-bits in size, so storing them in main RAM wouldn't be efficient and keeping them in L1/L2 cache isn't efficient either.

They're in a format that's useful for internal management within the processor. That's not going to be the same thing as something ideal for programs. Pretty much every processor has different representations of instructions for how they're taken at the decoders vs how they're passed through the rest of the pipelines. Otherwise you wouldn't even need decoder stages, which again, pretty much all processors have (even when dealing with really simple ISAs like the original MIPS). That doesn't mean that there's a big overhead problem that you're looking to get rid of or that it's at all comparable to the huge cost of converting code from one ISA to another in software.

Hulk · Oct 19, 2013

This is interesting. If the processor is decoding the instruction into uOps then why not recompile the software to do this so the processor doesn't have to? I understand that each type of processor would need a different compile but it is an interesting idea. I wonder how much, if any, increase in performance there would be?

VirtualLarry · Oct 19, 2013

Hulk said:
This is interesting. If the processor is decoding the instruction into uOps then why not recompile the software to do this so the processor doesn't have to? I understand that each type of processor would need a different compile but it is an interesting idea. I wonder how much, if any, increase in performance there would be?

See: Transmeta

jhu · Oct 19, 2013

I've read that that the Nexgen Nx586 could be switched between executing x86 instructions and native RISC instructions. Not sure if that waa true, and if true, if there was performance benefit to doing the latter.

SOFTengCOMPelec · Oct 19, 2013

VirtualLarry said:
See: Transmeta

jhu said:
I've read that that the Nexgen Nx586 could be switched between executing x86 instructions and native RISC instructions. Not sure if that waa true, and if true, if there was performance benefit to doing the latter.

I have (still) got a Transmeta tablet computer.

(From what I remember, and understand) It grabs some of the available RAM for itself, and when a program is run, it stops for a fraction of a second (or even a few seconds), and on the fly converts the x86 to its own code (stored in that grabbed RAM).

The resultant code then seems to run (Very, very approximately) about twice as fast, as I would expect from that period of cpu and at that price point.
(Unfortunately, the double speed results in it still being a fair bit slower than Intel mobile processors of that era).

N.B. Please treat my "double speed" comment as being hugely subjective, and totally non-scientific.

jhu · Oct 19, 2013

Found a description here:

The NexGen/AMD Nx586 (early 1995) is unique by being able to execute its micro-ops (called RISC86 code) directly, allowing optimised RISC86 programs to be written which are faster than an equivalent x86 program would be, but this feature is seldom used.

This is the only description I've found that states this. I can't find any details on how to enter this non-x86 mode.

Exophase · Oct 20, 2013

SOFTengCOMPelec said:
The resultant code then seems to run (Very, very approximately) about twice as fast, as I would expect from that period of cpu and at that price point.
(Unfortunately, the double speed results in it still being a fair bit slower than Intel mobile processors of that era).

N.B. Please treat my "double speed" comment as being hugely subjective, and totally non-scientific.

I doubt that. The performance was terrible and they didn't have the economy of scale that Intel or even AMD had at the time, nor could they reap die harvests from higher end parts. Efficeon in particular was larger than all of its contemporary CPUs, and performed much worse. You can read more here: http://www.vanshardware.com/reviews/2004/04/040405_efficeon/040405_efficeon.htm

If Transmeta was such a great value I'm sure they wouldn't have tanked so hard. They couldn't come close to delivering on any of the points they promised. Not performance, not even perf/W.

Bottom line: Transmeta was a failed idea and a dead end. It had some really cool hardware features for accelerating binary translation like a sort of transactional memory system, but it wasn't enough; it was ultimately still limited in a lot of ways software translation always is and brings with it huge overhead. You just can't beat pure hardware solutions with this. Being VLIW was a double wammy, it's just not the best approach for general purpose code. Transmeta was really the ultimate in "let the compiler do everything somehow" naivety.

SOFTengCOMPelec · Oct 20, 2013

Exophase said:
I doubt that. The performance was terrible and they didn't have the economy of scale that Intel or even AMD had at the time, nor could they reap die harvests from higher end parts. Efficeon in particular was larger than all of its contemporary CPUs, and performed much worse. You can read more here: http://www.vanshardware.com/reviews/2004/04/040405_efficeon/040405_efficeon.htm

If Transmeta was such a great value I'm sure they wouldn't have tanked so hard. They couldn't come close to delivering on any of the points they promised. Not performance, not even perf/W.

Bottom line: Transmeta was a failed idea and a dead end. It had some really cool hardware features for accelerating binary translation like a sort of transactional memory system, but it wasn't enough; it was ultimately still limited in a lot of ways software translation always is and brings with it huge overhead. You just can't beat pure hardware solutions with this. Being VLIW was a double wammy, it's just not the best approach for general purpose code. Transmeta was really the ultimate in "let the compiler do everything somehow" naivety.

I agree, it did have poor performance.
I was confused, because after it has finished converting (re-compiling), a noticeable speed improvement occurs.
BUT it was still a lot slower than better processors available at the time, e.g. Intel.
(I tried to convey this in my original post, but apparently did not succeed).

This says it saved a decent amount of power

Tuna-Fish · Oct 20, 2013

Hulk said:
This is interesting. If the processor is decoding the instruction into uOps then why not recompile the software to do this so the processor doesn't have to? I understand that each type of processor would need a different compile but it is an interesting idea. I wonder how much, if any, increase in performance there would be?

Almost none at all. Decode is done in it's own pipeline stage, that is, *in parallel* with execution. When the cpu is decoding one set of instructions it's executing the previous ones. All you'd gain from writing to the internal set directly would be one or two cycles shorter branch miss delay -- and this is pretty well hidden by the branch predictor in most cases.

Designing a cpu without the translation would save you some area and power, but as long as it's there, there is very little cost in using it.

SOFTengCOMPelec · Oct 20, 2013

Another advantage (at least in theory, but I think I heard Intel like this) is that the chip companies can freely change the uOps instruction set, between major cpu architecture releases, with NO worries about getting all the compilers/assemblers in the world changed over to the new instruction set.
Essentially all they have to worry about is changing the x86 decoders/translators on-chip to the new standard.

Even if all the compilers/assemblers were changed throughout the world, there would still be a huge code base of released software, so changing uOps, if we used it directly, could well of been a nightmare.

It's possible that keeping the details of uOps secret (if they are secret, I don't know), helps Intel keep a lead over their competitor(s).

Headfoot · Oct 20, 2013

The Itanium tried to move processor level complexity into the compiler and that went over like a lead balloon

aigomorla · Oct 21, 2013

microsoft will not support it and it will be a repeat of ibm and cyrix....

glugglug · Oct 21, 2013

From what I've read this would reduce performance.

For tight loops that are already in the L1 code cache, it has been cached as uOps instead of x86 code since the Pentium 4, so it isn't being decoded again in the loop.

For larger chunks of code or non-loops, memory bandwidth is now a bottleneck. The smaller number of x86 instructions takes less space than microcode instructions, and gets fetched from RAM quicker than the equivalent microcode would.

TuxDave · Oct 21, 2013

Hugo Drax, you should be pleased that you are definitely not the only one asking this question. Removing the front end and exposing the execution engine to software (which will change from processor to processor, no more standard ISA) is a big win if you can outweigh the overhead of uop generation. Lots of good points addressed above, and lots of studies required.

I know there's research papers evaluating this, and it will definitely be one of those "mind blown" things if proven to be beneficial.

Another advantage (at least in theory, but I think I heard Intel like this) is that the chip companies can freely change the uOps instruction set, between major cpu architecture releases, with NO worries about getting all the compilers/assemblers in the world changed over to the new instruction set.

You bet it happens.

zir_blazer · Oct 21, 2013

I think that most people here are forgetting that you already have a low level method to talk directly to the Processor: Assembly. The proposed microOps method would be even a lower level than ASM is, but you can analyze what's good and what's wrong with Assembly first. Assembly was quite common up to the 90's. Usually games, but anything that required the highest possible archiveable performance, had to be optimized in ASM in some way or another. This includes games like Wolfenstein 3D and Doom from ID Software which had ASM routines, Transport Tycoon from Chris Sawer which was made entirely in ASM, and even applications like emulators as ZSNES, NeoRageX and no$gba which had great performance thanks to all the ASM code.
As Hardware got faster, developers started to choose ease to learn and use the language, code mainteinability, reusability and portability over raw performance, reason why resource hogs like Java are soo popular right now. Most popular high level languages seems to simply suck dry whatever Hardware's performance you can throw it at because Software seems to be rushed as soon as they got it working instead of polishing it to the last bit to get better performance. Its cause Hardware is soo cheap these days compared to develop time, and because thanks to Internet, any bug or optimization can be done after launch via patches.
If today, few people bothers to learn and/or use Assembly even through the performance increase would be RIDICULOUS higher than it was during the 90's, as at that time the comparision was betwenn C/C++ vs ASM and currently you have things like Java, it gives you an idea of how many people would actually bother with this microOps thing. Heck, ASM itself can be quite hand-optimized with an specific architecture in mind.

ChronoReverse · Oct 21, 2013

That... is really not relevant. It's not even wrong.

Hulk · Oct 22, 2013

zir_blazer said:
I think that most people here are forgetting that you already have a low level method to talk directly to the Processor: Assembly. The proposed microOps method would be even a lower level than ASM is, but you can analyze what's good and what's wrong with Assembly first. Assembly was quite common up to the 90's. Usually games, but anything that required the highest possible archiveable performance, had to be optimized in ASM in some way or another. This includes games like Wolfenstein 3D and Doom from ID Software which had ASM routines, Transport Tycoon from Chris Sawer which was made entirely in ASM, and even applications like emulators as ZSNES, NeoRageX and no$gba which had great performance thanks to all the ASM code.
As Hardware got faster, developers started to choose ease to learn and use the language, code mainteinability, reusability and portability over raw performance, reason why resource hogs like Java are soo popular right now. Most popular high level languages seems to simply suck dry whatever Hardware's performance you can throw it at because Software seems to be rushed as soon as they got it working instead of polishing it to the last bit to get better performance. Its cause Hardware is soo cheap these days compared to develop time, and because thanks to Internet, any bug or optimization can be done after launch via patches.
If today, few people bothers to learn and/or use Assembly even through the performance increase would be RIDICULOUS higher than it was during the 90's, as at that time the comparision was betwenn C/C++ vs ASM and currently you have things like Java, it gives you an idea of how many people would actually bother with this microOps thing. Heck, ASM itself can be quite hand-optimized with an specific architecture in mind.

Great post. I don't know what some compute intensive apps aren't coded in Assembly. I mean I understand why, it's time consuming (expensive) and you've got to have talented programmers to get it done. But imagine a video editing software that could preview and encode 3 or 4 times faster than the competition. Seems like that would be a huge selling point to me.

Remember SAW (Software Audio Workshop) from the '90's? A hand coded Assembly multitrack audio program by the genius Bob Lentini. It was light years ahead of anything that was available during the day. It made a Pentium run multitrack audio with fx like it had no right to.

Intel CPU x86 Why not compile straight to uOps?

Diamond Member

Golden Member

Diamond Member

Diamond Member

No Lifer

Lifer

Platinum Member

Lifer

Diamond Member

Platinum Member

Golden Member

Platinum Member

Diamond Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Diamond Member

Lifer

Golden Member

Platinum Member

Diamond Member