• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Question Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Golden Member
Oct 16, 2019
1,897
3,822
116
But in conclusion this means that Apple is a lot better, a lot better at designing CPU's than Intel and AMD. Either x86 has a penalty or Intel/AMD are year(s) behind Apple in CPU design, especially efficiency which I do not really think. I think x86 "forces" you into certain design decisions which impact IPC and performance and efficiency.
Congratulations, you figured it out.

Apple are seriously just way ahead of the rest of the industry as a whole. They're a generation ahead of everyone else in terms of architecture and really, node as well.
 
  • Like
Reactions: Tlh97 and Gideon

moinmoin

Platinum Member
Jun 1, 2017
2,902
3,880
136
Haven't both AMD and Intel largely moved away from the legacy x86 designs of the past to the extent that they're basically a RISC architecture internally with support for the older CISC instructions to be translated into some kind of microcode the processor actually operates on?
That's old news. That step happened with Pentium Pro back in 1995, that's only 25 years ago. ;)

The CISC decoder frontend is essentially the whole difference between x86 and ARM. For all we know the Zen core designs could be ARM based with an x86 frontend (not that I actually believe that, just food for thought).
 
  • Like
Reactions: Carfax83

zir_blazer

Golden Member
Jun 6, 2013
1,069
299
136
I'd like to start with that sentence, because it's mistaken and that shows us where a realistic way forward for x86 as a whole could be. x86-64 (or x64 or AMD64) is not "too old and burdened with legacy instruction sets that nobody uses anymore", it's just one mode of many modes of which most are actually too old and nobody uses anymore. ;) x86 up to now kept the practice of perfect back compatibility, which includes at this time honestly ridiculous support for real mode (so 8086 and 8088), protected mode (so 80286) and enhanced mode (so 80386).

A clean break could be removing support for those 16 and 32 bit modes and retain just x86-64. In case 32 bit applications should continue to run, the OS would need to offer an emulation layer. Since the introduction of 64 bit UEFI the POST doesn't rely on anything older than 64 bit anymore so that part shouldn't be a problem.

I would not buy an x86 cpu that did not support 32 bit.

32 bit is the default compile target for Visual Studio last I checked. From the software writing aspect, write once, and run anywhere for decades to come is very attractive.

I insist: Google what an Intel 80376 is. Said that as the first reply in this Thrad and everyone ignored it.
 

moinmoin

Platinum Member
Jun 1, 2017
2,902
3,880
136
I insist: Google what an Intel 80376 is. Said that as the first reply in this Thrad and everyone ignored it.
Should that search unearth something else aside that 80376 without support for real mode was vastly less successful than 80386EX which was cut down in other regards but supported real mode? Those are embedded parts from late 1980s and mid 1990s.
 

Cogman

Lifer
Sep 19, 2000
10,274
121
106
Intel literally did this about 20 years ago with Itanium or IA64. It did not go well.

It's all fine and dandy to want a new instruction set. However, the hard part is getting compiler, developers, and software companies to want to support your new architecture.

The only hope for a new architecture to take over x86 is if something like ARM or RISC-V gain popularity. You'll never see (IMO) a proprietary architecture rise
 

Schmide

Diamond Member
Mar 7, 2002
5,379
295
126
But in conclusion this means that Apple is a lot better, a lot better at designing CPU's than Intel and AMD. Either x86 has a penalty or Intel/AMD are year(s) behind Apple in CPU design, especially efficiency which I do not really think. I think x86 "forces" you into certain design decisions which impact IPC and performance and efficiency.
Not taking away from the M1, I do not share that conclusion. Framing it as an either or situation does not take into account the nuances of the platforms.

The M1 trades blows with the competition, most of which came out year(s) before it. This at the very least removes the "lot better" from the above statement.

Further looking at power. In some ways M1 is extremely cut down. It has limited on chip ram, a half to a quarter I/O lanes depending on how you look at it, and half the threads.

A better measure of metrics is the Ampere Altra Q80-33 vs the Epyc 7742. For the most part they compete on equal ground 128 pcie 4 lanes, 8 channel memory, 200-250w power, etc. (the Q80-33 does have less L3 but is monolithic with more cores). When the software matures this will certainly be the fight to watch.

Regardless, this is all a distraction. The purpose of this thread is to quantify the cost/benefit of maintaining backwards compatibility in hardware. This may actually be answered by ARMv9, which will have a clean break from 32bit, setting up similar CPUs with unique differences.
 

zir_blazer

Golden Member
Jun 6, 2013
1,069
299
136
Should that search unearth something else aside that 80376 without support for real mode was vastly less successful than 80386EX which was cut down in other regards but supported real mode? Those are embedded parts from late 1980s and mid 1990s.
Exactly that. There is a reason why Intel never did an 80376 again. You can't go out and cut backwards compatibility like its nothing, as at that point it would be better to go with a new ISA that is better thought from the ground up, assuming you can kick the Software ecosystem around as you please like Apple seems to be able to do. Also, while Real Mode was more important back when the 80376 was new than IA32 in the current x86-64 realm, IA32 virtualization is still important. Besides, removing it will not achieve anything besides saving a few transistors, cause even if you free up some opcodes, I can bet that you do NOT want to reuse them to not cause compatibility issues, so you will be just leaving empty gaps (Intel reused opcodes with a few removed instructions back with the 386/486). Plus, you're theorically opening a Pandora's Box cause you don't really know how much x86-64 Compatibility Mode relies on 386 Protected Mode stuff that is going to be there wasting transistors anyways.
Just 3 years ago, first generation Zen had borked VME support (An extension to Virtual 8086 Mode added in the original 1993 P5 Pentium), causing Windows XP VMs to implode. Are you telling me that AMD didn't even bothered to test running Windows XP in a VM, which doesn't seem something terribly niche? So if you ask me, there are things that are better left alone. If you're on x86, legacy is here to stay. I prefer to throw everything away and start anew that toying too much with removing the baseline.
 
  • Like
Reactions: Carfax83

jeanlain

Member
Oct 26, 2020
126
99
61
The M1 trades blows with the competition, most of which came out year(s) before it.
Years before it? Which X86 CPU(s) do you have in mind?
And even if current X86 CPU cores compete with the M1 firestorm cores in raw perf, they do it at a significantly higher power.
As for I/O, that doesn't have much to do with CPU architecture, does it?
 

Schmide

Diamond Member
Mar 7, 2002
5,379
295
126
Years before it? Which X86 CPU(s) do you have in mind?
And even if current X86 CPU cores compete with the M1 firestorm cores in raw perf, they do it at a significantly higher power.
As for I/O, that doesn't have much to do with CPU architecture, does it?
AMD Renior 4800u which is came out in Jan. So basically a year.

I/O and memory have a lot to do with power. Every lane has to be terminated even if it isn't used. The longer the traces the greater the power draw. On board memory saves a lot.

We have a whole thread of M1 vs the world. I'll join you there if you want that to be a focus.

Which is why I put the Altra as the similar platform with similar software stacks.
 

Ajay

Diamond Member
Jan 8, 2001
9,814
4,175
136
The 'opportunity' for a clean break was their with x64 - but with two competitors in the mix (AMD/Intel) neither could afford to write off 32b instructions and APIs without losing out terribly to the the other company. They also would have to convince Microsoft create an emulation layer so that most 32b programs still worked on Windows. There are allot of x86 instructions that are now 'emulated' in microcode. The Pentium Pro transitioned from native x86 CISC execution to an x86 decoder that produced proprietary instructions for a RISC back end. Instead, the x86 world evolved in such a way that allowed them to keep increasing performance while engaging the largest developer network compared to all other architectures at the time (pretty much killing off their early RISC competitors). The two main x86 rivals are still stuck in the same situation as they were in in 2003.

Microsoft is now fully abstracting all their APIs to run over various emulation layers that only call to the kernel as necessary. They've fully embraced the original design goal of NT by tailoring the kernel to a variety of architectures (x86-64 and ARM right now) to remain relevant going forward. MS is also trying to move users and apps more heavily toward the cloud, which is much less platform dependent.
 

Leeea

Golden Member
Apr 3, 2020
1,180
1,414
96
That's what I understand too.

I do wonder if there could be some flag an application could pass to bypass/turn off the x64->microcode component, and just compile the app for the processor-specific uops (in and of itself this probably would make finding the proprietary uop code of each chip maker easier too...). Of course, enabling such a switch would in and of itself require another controller, and would need to ensure all dependencies are also not needing the uop decode. On top of that, I'm not sure how "integrated" the uop decode is into the chip at large for current x64 uarch, so I don't know how hard it would be to turn it off and still have the rest of the CPU work as usual, and I'm also not sure what kind of gains could be expected. It would seem to me that as long as the translation from x64->microcode isn't inherently slower than the rest of the pipeline, then it may be more fruitful to work on other aspects of the pipeline.
What your suggesting is already here. Setting the x64 flag drops all the 32 and 16 bit instructions, and they get emulated:
"Under Windows 64-bit, 32-bit applications run on top of an emulation of a 32-bit operating system "

On the hobby app I am working with now, which rehashes SHA512 hashes, setting x64 flag more then doubled the speed.


Congratulations, you figured it out.

Apple are seriously just way ahead of the rest of the industry as a whole. They're a generation ahead of everyone else in terms of architecture and really, node as well.
The m1 is expensive and loses to the competition. It does get some single threaded wins against older x86-64, but it only has 4 high performance threads.

Single threaded performance was all that mattered a decade ago, but these days, even internet browsers are multi-threaded. Games, file compression software, media transcoders, just about anything that needs performance is multithreaded*.

In most applications that need performance, something like a 4800u with its 16 high performance threads will out perform the m1 while being much cheaper.

If you compare a m1 to a top of the line x86-64, like a Ryzen 5900x, it is a one sided slaughter. Each of the 5900x's 24 high performance threads beat anything the m1 can bring. Yet the 5900x is still somehow cheaper then the m1.


The Altra Q80-33 is a far more interesting competitor. Anandtech recently did a review on it. It is not compelling enough to swap yet, but it has way more potential then the m1.

---------------------------------

*two decade ago multithreaded used to be more of a pain, but these days, it is cake:
https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/task-based-asynchronous-programming
https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/lock-statement

It is literally built right into the tooling.
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,379
295
126
What your suggesting is already here. Setting the x64 flag drops all the 32 and 16 bit instructions, and they get emulated:
"Under Windows 64-bit, 32-bit applications run on top of an emulation of a 32-bit operating system "
This could be misconstrued.

On x86-64 OS the processor is in long mode, 32 and 16 bit programs run in a sub modes of compatibility mode. Instructions execute in hardware and are not emulated.

I guess the 32 bit OS is technically emulated. Although it is more of a virtual environment.

The method for which they provide these OS calls is called thunking. I guess that term never caught on.
 

amrnuke

Golden Member
Apr 24, 2019
1,169
1,739
106
What your suggesting is already here. Setting the x64 flag drops all the 32 and 16 bit instructions, and they get emulated:
"Under Windows 64-bit, 32-bit applications run on top of an emulation of a 32-bit operating system "

On the hobby app I am working with now, which rehashes SHA512 hashes, setting x64 flag more then doubled the speed.
What I meant was bypassing the micro-op decode stage by simply compiling for the CPU-specific "RISC"y micro-ops.

So rather than:
compile for x64 -> program sends x64 instruction with "x64" flag -> decode x64 to uop(s) -> execute uop(s) on core
You'd just go:
compile for uop -> program sends uop instruction with "uop" flag -> execute uop instruction
 

wlee15

Senior member
Jan 7, 2009
312
26
91
What I meant was bypassing the micro-op decode stage by simply compiling for the CPU-specific "RISC"y micro-ops.

So rather than:
compile for x64 -> program sends x64 instruction with "x64" flag -> decode x64 to uop(s) -> execute uop(s) on core
You'd just go:
compile for uop -> program sends uop instruction with "uop" flag -> execute uop instruction
They don't want to do that because Intel and AMD want the freedom to change the internal ISA for optimization purposes.
 
  • Like
Reactions: Leeea

VirtualLarry

No Lifer
Aug 25, 2001
52,246
7,061
126
Has everyone forgotten, that MORE transistors, often, but not always, equates to MORE performance. Having "extra" transistors for decode of "obsolete" opcodes, doesn't necessarily slow down CPUs. It may increase wattage by some minute fraction, while the decoders are enabled (and not power-gated at that moment, during a L1 cache read), but honestly, I see it as a drop in the bucket. As an example of more transistors == more performance, look at AMD's Zen architecture, and all of their "Machine Intelligence" in their CPUs. Certainly, those features add quite a bit of transistor complexity, but they result in a CPU that performs BETTER than Intel (at this point), and hopefully, draws less power.

As far as compatibility for compatibility sake: "Legacy support" on x86/x64 platforms, is both a boon and a boondoggle. The thing, as mentioned, is really "legacy corporate applications", things that have been compiled twenty years ago and haven't changed since, and the source code may not even be available. It runs, and it works. That's the value of legacy compatibility towards the existing ecosystem.

If one of the x86/x64 CPU mfgs decided to radically "break compatibility", that might provide an avenue for ARM to take over, if applications are going to be re-compiled or emulated anyways, in the hypothetical "new, stream-lined x64 architecture". It could cost Intel and AMD (and Via, to a lesser extent), the entire vast x86/x64 "kingdom".
 
  • Like
Reactions: Carfax83

jeanlain

Member
Oct 26, 2020
126
99
61
AMD Renior 4800u which is came out in Jan. So basically a year.

I/O and memory have a lot to do with power. Every lane has to be terminated even if it isn't used. The longer the traces the greater the power draw. On board memory saves a lot.

We have a whole thread of M1 vs the world. I'll join you there if you want that to be a focus.

Which is why I put the Altra as the similar platform with similar software stacks.
So that was 10 months before the M1, not years. And were there actual products available in January? Wikipedia indicates a release date in March. M1 Macs have been available in volume a mere week after the announcement.
As for I/O, the M1 may not have as many lanes, but it has a lot of modules that other SoCs don't have, like the neural engine, the image signal processing unit, the secure enclave, etc.
And I think that monitoring tools can differentiate between the power consumes by the cores, the RAM, and by the I/O ("uncore"), etc. So we can compare core against core.

I've already contributed to threads about the M1. We know that the M1 may lose against recent X86 CPUs with more than 4 SMT cores if more than 4 threads are used. But when discussing the potential of an architecture, it's important to consider the efficiency of cores and avoid drawing conclusions from comparisons between CPUs with different core counts.
We may extrapolate what an M1X with 8 performance cores can do, and that's probably better than its X86 competitors on the same power budget.
 
  • Like
Reactions: Viknet

Insert_Nickname

Diamond Member
May 6, 2012
4,279
821
126
Microsoft is now fully abstracting all their APIs to run over various emulation layers that only call to the kernel as necessary. They've fully embraced the original design goal of NT by tailoring the kernel to a variety of architectures (x86-64 and ARM right now) to remain relevant going forward. MS is also trying to move users and apps more heavily toward the cloud, which is much less platform dependent.
NT was originally designed to be portable between CPU architectures. It's not a new thing. The current ARM version is more of a re-enablement of this, then a new concept.

Fun fact; original NT development was on the i860, a RISC architecture. So you could say NT is returning to its roots with ARM.

Single threaded performance was all that mattered a decade ago, but these days, even internet browsers are multi-threaded. Games, file compression software, media transcoders, just about anything that needs performance is multithreaded*.
Even then, there are limits to how much single threaded performance matters. You really can't get faster then instant.

An anecdote; I remember doing MP3 encoding on a Pentium 90MHz. Back then, that took a while, to put it mildly. I recently had reason to redo some encoding on my Ryzen 1700. Seeing it rip through my music library 16 tracks at a time, with almost instant encoding was impressive. My 90's self would have been blown away at the thought alone...
 

Schmide

Diamond Member
Mar 7, 2002
5,379
295
126
So that was 10 months before the M1, not years. And were there actual products available in January? Wikipedia indicates a release date in March. M1 Macs have been available in volume a mere week after the announcement.
As for I/O, the M1 may not have as many lanes, but it has a lot of modules that other SoCs don't have, like the neural engine, the image signal processing unit, the secure enclave, etc.
And I think that monitoring tools can differentiate between the power consumes by the cores, the RAM, and by the I/O ("uncore"), etc. So we can compare core against core.

I've already contributed to threads about the M1. We know that the M1 may lose against recent X86 CPUs with more than 4 SMT cores if more than 4 threads are used. But when discussing the potential of an architecture, it's important to consider the efficiency of cores and avoid drawing conclusions from comparisons between CPUs with different core counts.
We may extrapolate what an M1X with 8 performance cores can do, and that's probably better than its X86 competitors on the same power budget.
I'm not here to nit pick a few months, nor am I here to declare a holy metric that says one is greater than another. They all have their virtues. This is becoming a RISC v CISC debate when it should be a what is the cost of legacy?

I really feel like I got roped into this by playing the middle ground. I say. I don't think the x86 legacy modes cost that much. Turns into what I said but conclusion apple epeen fight.

You seem to gloss over every other thing I said. Similar ARM cores and the dropping of 32 bit compatibility in ARMv8.2 -> ARMv9 (and no I don't mean ARM9 that was decades ago, fine it was 14 years ago give or take a few months)

We could explore all day what makes the M1 super efficient. I will state for the record that IMO it isn't that apple dropped 32 bit compatibility.
 

VirtualLarry

No Lifer
Aug 25, 2001
52,246
7,061
126
NT was originally designed to be portable between CPU architectures. It's not a new thing. The current ARM version is more of a re-enablement of this, then a new concept.

Fun fact; original NT development was on the i860, a RISC architecture. So you could say NT is returning to its roots with ARM.
They also had NT 3.1 for MIPS, Sparc, and Alpha. (I had beta copies of the version for DEC Alpha.)

See: POSIX.
 

beginner99

Diamond Member
Jun 2, 2009
4,895
1,274
136
Intel literally did this about 20 years ago with Itanium or IA64. It did not go well.
It's not about a new instruction set but ditching old instructions that aren't used in any new code from the last 2 decades.


MS is also trying to move users and apps more heavily toward the cloud, which is much less platform dependent.
Most importantly it's much more profitable to sell subscription than actual software. I remember for Office 2010 (word, excel, powerpoint) I got a family pack with 3 perpetual licenses (valid forever) for $150. Now a single such license costs around $120. Since Office 2010 doesn't get any security updates anymore one can say 10 years for $150 for 3 users = $15 per year per user. They sell you office 365 for $55 per year and user or family version for $120 per year. It's clear this is mostly about making money and not about platform.
 

Shivansps

Diamond Member
Sep 11, 2013
3,404
1,004
136
But in conclusion this means that Apple is a lot better, a lot better at designing CPU's than Intel and AMD. Either x86 has a penalty or Intel/AMD are year(s) behind Apple in CPU design, especially efficiency which I do not really think. I think x86 "forces" you into certain design decisions which impact IPC and performance and efficiency.
It is proveen that ARM is actually that more power efficient than x86 while actually doing stuff? For example my RPI4 and my BT tablet use a similar amount of power, they have similar perf, altrought the RPI4 is definately faster in 64 bit, the BT cpu has other stuff in it, like a better IGP and AES.

I dont know, i really need to see how ARM works in desktop enviroment, i still got this impresion that it is like my phone, a excelent power management that allows for really good idle and near idle power usage, that makes the batery to last 2 or 3 days, but go and open a game, make a wifi hotspot, or watch youtube and the battery last 4 hours tops.

I really want to see some in-depth tests.
 

amrnuke

Golden Member
Apr 24, 2019
1,169
1,739
106
They don't want to do that because Intel and AMD want the freedom to change the internal ISA for optimization purposes.
Exactly, additionally, I'm sure they don't want any more of their IP being public-facing than necessary.
 

Cogman

Lifer
Sep 19, 2000
10,274
121
106
Has everyone forgotten, that MORE transistors, often, but not always, equates to MORE performance. Having "extra" transistors for decode of "obsolete" opcodes, doesn't necessarily slow down CPUs. It may increase wattage by some minute fraction, while the decoders are enabled (and not power-gated at that moment, during a L1 cache read), but honestly, I see it as a drop in the bucket. As an example of more transistors == more performance, look at AMD's Zen architecture, and all of their "Machine Intelligence" in their CPUs. Certainly, those features add quite a bit of transistor complexity, but they result in a CPU that performs BETTER than Intel (at this point), and hopefully, draws less power.

As far as compatibility for compatibility sake: "Legacy support" on x86/x64 platforms, is both a boon and a boondoggle. The thing, as mentioned, is really "legacy corporate applications", things that have been compiled twenty years ago and haven't changed since, and the source code may not even be available. It runs, and it works. That's the value of legacy compatibility towards the existing ecosystem.

If one of the x86/x64 CPU mfgs decided to radically "break compatibility", that might provide an avenue for ARM to take over, if applications are going to be re-compiled or emulated anyways, in the hypothetical "new, stream-lined x64 architecture". It could cost Intel and AMD (and Via, to a lesser extent), the entire vast x86/x64 "kingdom".
Correct. Most transistors aren't going to logic, rather, they are going to the on chip cache. If they can throw more transistors at the logic to get higher performance they will. The power increase isn't really anything to worry about in most cases (floating point operations are about the only place where it can be someone power intense).

It's not about a new instruction set but ditching old instructions that aren't used in any new code from the last 2 decades.
That's a new instruction set. You'll never convince a single person that it's not. If you are breaking backwards compatibility you really might as well go with an entirely new instruction set.

For example, one thing that'd provide a ton of benefit would be fixed width instructions. If you are going to drop "old" instructions, why not instead change not old instructions (such as xor) to be 32bits rather than 8bits?

There is little benefit to dropping seldom used instructions for instruction dropping sake. The process modern x86 processors follow anyways is to convert all x86 instructions into micro-opts (and then to further optimize those micro-opts). So by dropping an old instruction all you are litterally doing is dropping an entry in the conversion table of x86 instruction to micro-opt. That's not something that has any value. Those tables require no power to operate and nearly 0 transistors to maintain. Meanwhile, billions upon billions of transistors are being burnt on L1->L3 caches and registers.
 

ASK THE COMMUNITY