Question Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Carfax83 · Dec 25, 2020

This is one of the greatest criticisms I hear about x86-64, that it is too old and burdened with legacy instruction sets that nobody uses anymore. The other major criticism has to do with the variable instruction length, but that's not nearly as clear cut in my opinion. From what I have read, it seems that having a variable instruction length can be advantageous at times, and increase instruction density. Also, micro op caches and ops fusion seem to do a great job of mitigating the problem of having to add more decoders to achieve greater instruction level parallelism. So at least there are solid workarounds for that one. But the legacy stuff is another matter entirely.... How disruptive would such a move be if Intel or AMD decided to throw out most or all of the legacy x86 stuff and design a cleaner more modern architecture?

I doubt it would affect end consumers that much, more so the business and enterprise sectors. But maybe I'm wrong. There was also a rumor about this several years back from bits n' chips.it, that perhaps Willow Cove would be the last x86-64 CPU designed by Intel that had wide support for legacy instruction sets. So I guess that would be Golden Cove, assuming the rumor is true.

zir_blazer · Dec 28, 2020

Ajay said:
The 'opportunity' for a clean break was their with x64 - but with two competitors in the mix (AMD/Intel) neither could afford to write off 32b instructions and APIs without losing out terribly to the the other company.

Impossible. x64 without x86 compatibility would have gone nowhere in 2003, as at that point there were quite a lot of more modern 64 Bits ISAs to choose from (At least those few that were still alive after Itanium vaporware attack), and x64 for sure would have been inferior to any of them due to being built on top of x86 late 70's design idiosyncracies. Besides, AMD did some cleanup cause Long Mode deprecates the Segmentation Unit (The 286 MMU had a Segmentation Unit while the 386 added a Paging Unit, and BOTH could be used at the same time, which is going bonkers. Early 32 Bits virtualization Software before Intel VT-x and AMD-V actually used both for a technique known as Ring Depriviledging), and removed the useless TSS (Hardware Task Switching, which for some reason was slower than doing the same via Software, making it redundant). I think Long Mode also favours SSE over x87 FPU instructions, but if I recall correctly, they were still fully supported.

People are also forgetting that x86-64 was AMD desperate attempt to create a 64 Bits extension to x86 that would allow it to keep being relevant. If Intel succedeed in moving the PC ecosystem to IA64 as it wanted to, AMD x86 license would severely lose value, and you can bet that Intel wouldn't have licensed IA64 so that it could create the total monopoly it couldn't with x86. And AMD won precisely due to backwards compatibility. If you can get the important parts of 64 Bits (Mainly larger address space) in a Processor that also runs standard 32 Bits x86 faster than the previous generation, why bother with something else?
Intel actually had to catch up to AMD by keeping as a secret Plan B of sorts Yamhill, which was x86-64/EM64T support in Prescott. And eventually, Intel began to enable it in an increasing amount of models due to Opteron/A64 success.

Ajay · Dec 28, 2020

Insert_Nickname said:
NT was originally designed to be portable between CPU architectures. It's not a new thing. The current ARM version is more of a re-enablement of this, then a new concept.

Fun fact; original NT development was on the i860, a RISC architecture. So you could say NT is returning to its roots with ARM.

Yeah, I do remember that! I even had and Alpha running NT4.0 in our lab for doing compatibility tests (jeez that sucker was fast).

Doug S · Dec 28, 2020

Schmide said:
You seem to gloss over every other thing I said. Similar ARM cores and the dropping of 32 bit compatibility in ARMv8.2 -> ARMv9 (and no I don't mean ARM9 that was decades ago, fine it was 14 years ago give or take a few months)

There is no ARMv9 (yet) and 32 bit compatibility was already dropped with ARMv8. The ARMv8 CPUs that are able to run 32 bit code support BOTH ARMv8 and ARMv7. The ARMv8 only CPUs (so far Apple has the only ones) run only 64 bit code.

Schmide · Dec 28, 2020

Doug S said:
There is no ARMv9 (yet) and 32 bit compatibility was already dropped with ARMv8. The ARMv8 CPUs that are able to run 32 bit code support BOTH ARMv8 and ARMv7. The ARMv8 only CPUs (so far Apple has the only ones) run only 64 bit code.

The way I understand it ARMv# is licensed IP (the core version), AArch64 and AArch32 are the instructions sets that run on on it.

Either it's dropped or it exists. As far as I understand all ARMv8.x-A cores support both AArch64 and AArch32 with the exception of ARMv8-R or M which are AArch32 only.

There are implementations that are 64 bit only, but I have no knowledge of whether they removed the AArch32 ability or just trapped it.

The rumors that pass around the nets say when ARM moves to ARMv9 or whatever Matterhorn and Makalu are labeled as, at the very least Makalu is going to be 64 bit only.

moinmoin · Dec 29, 2020

I'd like to get back to the starting point of this thread and summarize the discussion up to now: The original argument was that x86 is hampered by being x86 and its long legacy. The implication is that without that being solved it can't truly be competitive against (at this time) ARM.

The discussion then went on with cutting off the legacy support wouldn't make much difference (since it affects little silicon, is mostly implemented in microcode nowadays that has no cost etc.), that previous efforts to sell clean implementations without any legacy support failed in the market (80376 back in 1989...) and that all the legacy stuff can be safely be ignored (until it isn't).

Another lesser strand of the discussion mentioned that x86 (since Pentium Pro back in 1995) no longer is CISC internally, with x86 and ARM both effectively being RISC based. So the whole supposed competitive disadvantage of x86 compared to ARM would boil down to the CISC decoder frontend x86 requires to run any code and ARM obviously doesn't need. The obvious solution to open the internal instruction set has been shot down as not being in the interest of Intel and AMD as they would want to make internal optimizations without affecting any code. And even if that weren't the case the x86 legacy compatibility would just end up being replaced with some unrelated instruction set not designed for public consumption.

To close this post I propose that the CISC decoder frontend is indeed the "only" disadvantage of x86 that Intel and AMD can actually do something about* without actually moving away from x86 itself. I image a solution to that would be extending x86 in a way that allows minimizing the penalty and limitations the existence of the CISC decoder frontend introduces for the core designs as a whole. Anybody have an idea how that could look like?

(* Intel and AMD can't do anything about e.g. Apple simply being a superior chip designer so that kind of argument doesn't add anything to the discussion at hand.)

dacostafilipe · Dec 29, 2020

moinmoin said:
To close this post I propose that the CISC decoder frontend is indeed the "only" disadvantage of x86 that Intel and AMD can actually do something about* without actually moving away from x86 itself. I image a solution to that would be extending x86 in a way that allows minimizing the penalty and limitations the existence of the CISC decoder frontend introduces for the core designs as a whole. Anybody have an idea how that could look like?

I haven't done a lot of ASM in the past years, so please correct me if I'm wrong, but I seem to remember that the more modern instructions are the more complex ones to decode, where "legacy" ones tend to be more simple and require less decoding. No?

DrMrLordX · Dec 29, 2020

TheELF said:
That's literally what Reduced instruction set computers (RISC) are.

Ehhh no. As @moinmoin articulated, this is about Intel/AMD/maybe even VIA producing future x86 CPUs without legacy application support. That would not be a RISC CPU.

Nothingness · Dec 29, 2020

Doug S said:
There is no ARMv9 (yet) and 32 bit compatibility was already dropped with ARMv8. The ARMv8 CPUs that are able to run 32 bit code support BOTH ARMv8 and ARMv7. The ARMv8 only CPUs (so far Apple has the only ones) run only 64 bit code.

That's a mistake I often read: ARMv8-A didn't remove 32-bit code at all. It introduced 64-bit (AArch64) and improved 32-bit from ARMv7 to ARMv8 (AArch32). Now ARMv8 also allows to only have AArch64, but that's not the same as saying that ARMv8 is 64-bit only or as saying 32-bit only goes up to ARMv7

moinmoin · Dec 29, 2020

NeoLuxembourg said:
I haven't done a lot of ASM in the past years, so please correct me if I'm wrong, but I seem to remember that the more modern instructions are the more complex ones to decode, where "legacy" ones tend to be more simple and require less decoding. No?

Generalized yes, but that doesn't mean all old instructions are less and all new ones more complex. Also the requirement for decoding doesn't enter due to the complexity of instructions but the fact that they are not of the same length, the incoming data always needs to be parsed right away. x86 instructions can be between 1 and 15 bytes long, and the decoder aligns them all for internal RISC representation (micro-op) and also combines multiple instructions where possible (macro-op in AMD Zen's parlance).

x86 Assembly/Machine Language Conversion - Wikibooks, open books for an open world

en.wikibooks.org

x86 instruction listings - Wikipedia

en.wikipedia.org

Zen 2 - Microarchitectures - AMD - WikiChip

Zen 2 is AMD's successor to Zen+, and is a 7 nm process microarchitecture for mainstream mobile, desktops, workstations, and servers. Zen 2 was replaced by Zen 3.

en.wikichip.org

Nothingness · Dec 29, 2020

moinmoin said:
To close this post I propose that the CISC decoder frontend is indeed the "only" disadvantage of x86 that Intel and AMD can actually do something about* without actually moving away from x86 itself. I image a solution to that would be extending x86 in a way that allows minimizing the penalty and limitations the existence of the CISC decoder frontend introduces for the core designs as a whole. Anybody have an idea how that could look like?

(* Intel and AMD can't do anything about e.g. Apple simply being a superior chip designer so that kind of argument doesn't add anything to the discussion at hand.)

I can think of three things can be done to ease decoding:

additional bits in instruction cache (to get instruction length earlier for instance)
reencoding of instructions in icache
use uop cache.

uop caches are extensively used nowadays but they contain less instructions than icaches due to having wide encoding (and being more complex). I guess 1 already is in use. For 2, I'm not sure it's ever been tried on x86 CPUs.

beginner99 · Dec 29, 2020

moinmoin said:
To close this post I propose that the CISC decoder frontend is indeed the "only" disadvantage of x86 that Intel and AMD can actually do something about* without actually moving away from x86 itself.

I know little to nothing about electrical engineering / actual chips design so be tolerant if what follows is non-sense. Isn't the fact that this decoder is needed for x86 while it's not needed for ARM is inherently a disadvantage especially in terms of efficiency that simply can't be overcome? Also doesn't the fact the decoder is need affect (limits) the chips design itself? Again, I know too little. But making the chip wider would also mean the decoder needs to decode faster (bigger, more transistors, more power)? So running at higher frequency vs being wider swings towards higher frequency?

In complete opposite thinking can the decoder actually be an advantage in certain situations? Like JIT compilation can be faster in some cases than static compiled binaries?

Thala · Dec 29, 2020

Doug S said:
There is no ARMv9 (yet) and 32 bit compatibility was already dropped with ARMv8. The ARMv8 CPUs that are able to run 32 bit code support BOTH ARMv8 and ARMv7. The ARMv8 only CPUs (so far Apple has the only ones) run only 64 bit code.

You got this wrong. Backwards compatibility is part of the ARMv8-A spec - as it contains ARM32 mode along with the specification of switching between the modes. There is no special notion of an "ARMv8 only CPU".
Likewise, there are no ARMv8-A CPUs, which are ARMv7-A compliant - its just that ARMv8-A CPUs have per spec an ARM32 mode, which makes them compatible with the instruction set specification of ARMv7-A. The inclusion of the ARM32 mode has been made optional in later revisions of the ARMv8-A spec - which makes the Apple cores still ARMv8-A compliant.

Doug S · Dec 29, 2020

moinmoin said:
To close this post I propose that the CISC decoder frontend is indeed the "only" disadvantage of x86 that Intel and AMD can actually do something about* without actually moving away from x86 itself. I image a solution to that would be extending x86 in a way that allows minimizing the penalty and limitations the existence of the CISC decoder frontend introduces for the core designs as a whole. Anybody have an idea how that could look like?

The only disadvantage of the CISC decoder is the pain of decoding wildly different lengths of instructions. Instructions that "do more than one thing" are simple to separate into multiple uops, but decoding x86 instructions is a problem when you don't know in advance how many bytes it is - you don't know until you begin the decoding process.

There is no possible way to "extend" x86 with fixed or even relatively fixed instruction lengths while still being binary compatible with x86-64.

The one thing that makes x86 harder to decode is the one thing you cannot fix. If you tried to introduce some sort of fixed length x86 instruction set it would no longer be x86, and you might as well be switching to ARM or creating something new from scratch at that point.

Schmide · Dec 29, 2020

beginner99 said:
Isn't the fact that this decoder is needed for x86 while it's not needed for ARM is inherently a disadvantage especially in terms of efficiency that simply can't be overcome? Also doesn't the fact the decoder is need affect (limits) the chips design itself? Again, I know too little. But making the chip wider would also mean the decoder needs to decode faster (bigger, more transistors, more power)? So running at higher frequency vs being wider swings towards higher frequency?

Saying one is inherently disadvantaged is short sighted.

All modern OoO (out of order) processors need a decoder. To what complexity that decoder works is the crux of this argument. It's actually a statistical "value of the game" problem.

For RISC you have a fixed (usually 32 bit) instruction with limited functionality. Because of this you may need multiple instructions to provide the same functionality as a CISC instruction. For example. Because some of the 32 bit instruction has to be for the operator you have a limited budget for immediate values. ARM solves this by encoding them as a 4 bit rotation with an 8 bit value. So to load a 32 bit value using only immediate data you would need 4 instructions. The variable size of CISC allows this to be encoded in one instruction.

Because of the above RISC uses a Load/Store model. This generally means to get something done, memory is first loaded into a register, operated upon, then stored back to memory. Although all modern computers internally use this model, for CISC the instruction can use immediate data as value or reference. So what would be 2 instructions on RISC a load and an operate would be one instruction on CISC.

While RISC has to divide up some operations to get data into the registers, once it's there, it can do a bit more. ARM and most RISC implementations have the inherent ability for 3 operand instructions. (a = b + c). While on x86 it is 2 operator (a = a + b). So if you want to do a non destructive operation, the destination value must be copied beforehand. This is the case where x86 takes 2 operations to do what ARM does in one. Note: in x86 there are vector instructions that have a 3 operand format.

The trade offs really come down to complexity vs size.

If you are able to encode more instructions that do more, you will have less memory pressure. (more fits in the cache) The reverse, if you can load and decode those instructions at a greater rate, you can make up for their increased number.

Thala · Dec 30, 2020

Doug S said:
The only disadvantage of the CISC decoder is the pain of decoding wildly different lengths of instructions. Instructions that "do more than one thing" are simple to separate into multiple uops, but decoding x86 instructions is a problem when you don't know in advance how many bytes it is - you don't know until you begin the decoding process.

Precisely. Different length instructions makes it very hard if not impossible to implement wide decoders efficiently. As this is a well know fact, no architect these days would come up with an ISA with full variable length instruction encoding - that is like shooting yourself in the foot. However back in the 70s this was not much of an issue, as the CPUs in these days where not multi-scalar at all and instruction density was of bigger concern.

The one thing that makes x86 harder to decode is the one thing you cannot fix. If you tried to introduce some sort of fixed length x86 instruction set it would no longer be x86, and you might as well be switching to ARM or creating something new from scratch at that point.

Thats why Intel and to some extend AMD are sitting between a rock and a hard place. Likewise increasing the GP register count to a more reasonable 32 is practically impossible - as is changing the memory model to a weakly ordered one. Total-store-order is as relaxed as they can go.

wlee15 · Dec 30, 2020

To me the obvious path forward is with large op-caches after all the caches in the Zen architectures can already issue 8 decode instruction per clock. Wouldn't be surprised if Zen 4 went to 16K op cache and maybe even cut back on the x86 decoders.

Gideon · Dec 30, 2020

wlee15 said:
To me the obvious path forward is with large op-caches after all the caches in the Zen architectures can already issue 8 decode instruction per clock. Wouldn't be surprised if Zen 4 went to 16K op cache and maybe even cut back on the x86 decoders.

There is another approach that I hope more x86 architectures will adopt. Dual decode engines (working on different branches) like Tremont:

Each decode engine, when dealing with different branch predictions, can take a separate instruction stream. This allows for a higher average utilization across both of the 3-wide decode engines compared to a single 6-wide engine, but when a branch isn’t present it means that one of the decode engines can be clock gated to save power. For a single instruction stream, the Tremont design is actually only 3-wide decode, with a 4-wide dispatch.

(Technically Intel states that, through microcode, they can change the decode engines to act as a single 6-wide implementation rather than dual 3-wide engines. This won’t be configurable to the OEM, but based on demand Intel may make specific products for customers that request it.)

So just to clarify, Tremont does not have a micro-op cache. When discussing with Intel about the benefits of this dual decode engine design compared to having a micro-op cache, Intel stated that a micro-op cache can help utilize a wide-decode design better, but with a smaller per-engine decode size, they were able to see a performance uplift as well as save die area by using this dual-engine design. Intel declined to comment which one was better, but we were told that given the die size, power envelope of Atom, and the typical instruction flow of an Atom core, this design yielded the better combination of performance, power, and area.

JoeRambo · Dec 30, 2020

wlee15 said:
To me the obvious path forward is with large op-caches after all the caches in the Zen architectures can already issue 8 decode instruction per clock. Wouldn't be surprised if Zen 4 went to 16K op cache and maybe even cut back on the x86 decoders.

Pentium 4 did both. Op cache was different, but largish at the cost of L1i size and cutting back on x86 decoders. We all know how it went.

At the end of the day you still need all 3: wide decoders, large L1i and huge uOP cache. Cold starts, context switches, syscalls, security constraints etc will require clearing "cached state" and if decoders are bottleneck it will take ages to reach steady state again.

I vaguely remember that at some stage in the past uOP cache entry size was ~90 bits already, no idea if it is larger or smaller now, but they don't come cheap. Whatever the entry size is now, it is easy to imagine even simplest of operations that take several bytes in iCache getting "blown up" to uOP cache entry size.

The fact that ARM designs like Neoverse stuff can get away with 4 decoders, no uOP cache and comparatively tiny ROB and still have performance they have - is a testament to the woes of X86 frontends.

Doug S · Dec 30, 2020

Thala said:
Thats why Intel and to some extend AMD are sitting between a rock and a hard place. Likewise increasing the GP register count to a more reasonable 32 is practically impossible - as is changing the memory model to a weakly ordered one. Total-store-order is as relaxed as they can go.

I think you are overstating the difficulty. The difference between TSO and ARM's memory model is lost in the noise except in very specific circumstances. Likewise the difference between 16 and 32 registers isn't that large - it was something like a few percent in the models I recall and while there was a further even smaller advantage at 64 you'd lose too many bits in a 32 bit instruction word so it wouldn't make sense. If someday someone adopted a 48 or 64 bit instruction word I'd expect at least 64 registers and perhaps more for FP/SIMD (the limitation starts to become save/restore during context/protection domain switches so if you want more you might add a bit to know which ones have been "touched" and only save those)

While x86 has some unnecessary complications from the variable length instructions I think people overstate the problem. It doesn't affect clock rate, or instruction throughput - what it affects is decode latency. So when you branch/jump to a new location that's not already been decoded it probably adds a cycle or two vs a fixed length ISA. Branch latency outside of tight loops (where you are hitting the uop cache) doesn't have a big influence on overall performance, so this shouldn't either. Probably at worst another low single digit hit like the smaller register set, and a few percent larger "core" (i.e. excluding all caches) from the added logic to tell the difference between all those different instruction lengths.

If all the x86 handicaps combined equal even 5% versus a fixed length RISC ISA, I'd be shocked. And <0.5% core size w/ L1/L2. Would it be worth Intel/AMD tossing out decades of backwards compatibility for that small of a gain? Hardly.

moinmoin · Dec 30, 2020

Nothingness said:
I can think of three things can be done to ease decoding:

additional bits in instruction cache (to get instruction length earlier for instance)

reencoding of instructions in icache

use uop cache.

uop caches are extensively used nowadays but they contain less instructions than icaches due to having wide encoding (and being more complex). I guess 1 already is in use. For 2, I'm not sure it's ever been tried on x86 CPUs.

Note that the uop cache and optimizations thereof are still a rather new development. While Intel had a paper about it in 2001 already the first implementation only appeared in Sandy Bridge in 2011. AMD only implemented it with Zen in 2017.

A patent by AMD toyed with the idea of storing the decoded representation of code in L1 and L2 (possibly also L3? RAM? don't remember) to avoid having to decode it again. This would introduce the possibility of all code eventually being available in the decoded state, assuming the size of all decoded code repeatedly run doesn't exceed the cache/memory size. Can't find said patent right now, but I'm sure @DisEnchantment shared it before.
Edit: This was the patent I thought of. Via post.

While that doesn't reduce the initial latency penalty of the CISC decoder, if Intel/AMD expand on these possibilities the extend of caching over time could get an x86 system to a point were this latency happens only on the first run of a given code anymore. As prefetching and speculative execution happens as well the effective latency may approach nil.

Doug S said:
The only disadvantage of the CISC decoder is the pain of decoding wildly different lengths of instructions. Instructions that "do more than one thing" are simple to separate into multiple uops, but decoding x86 instructions is a problem when you don't know in advance how many bytes it is - you don't know until you begin the decoding process.

There is no possible way to "extend" x86 with fixed or even relatively fixed instruction lengths while still being binary compatible with x86-64.

The one thing that makes x86 harder to decode is the one thing you cannot fix. If you tried to introduce some sort of fixed length x86 instruction set it would no longer be x86, and you might as well be switching to ARM or creating something new from scratch at that point.

I disagree that's that something you can't fix (though our definition of "fix" may well differ). Introducing new instructions doesn't break binary compatibility, it just runs on respective CPUs that support said new instructions, so applications would contain respective code paths. And such new instructions could be used to support the decoder handling variable length CISC code faster.

I'm sure this would be the stupidest possible implementation, but you could introduce a "header" instruction that does nothing else than tell the decoder that an ASM code loop follows that contains X instructions with A, B, C... lengths. This kind of info could be generated by the compiler at compile time. Instead just lengths one could also group instructions by length and purpose/affected area so that the code can be passed to the specific part of the core (think IU and FPU, possibly much finer grained) before decoding. I'm sure there are other better possibilities of supporting the decoder at compile time.

naukkis · Dec 31, 2020

Doug S said:
I think you are overstating the difficulty. The difference between TSO and ARM's memory model is lost in the noise except in very specific circumstances.

And those specific circumstances are exactly faster cores. TSO prevents cache writing effectively, no matter how many store ports your CPU has one store waiting for coherence/memory stalls whole memory subsystem. And TSO prevents even reusing store-buffer slots increasing that cache traffic. So with ARM memory model cpu core can have as many store ports as needed working independently so as effectively as just one store port, with TSO only single write to cache per thread can happen freely, all other writes has to be linked together making whole memory subsystem far less effective than with relaxed memory model.

This is clearly seen at those memory benchmarks Anandtech does, even small Neoverse N1 cores can saturate whole system write bandwith with only one thread where x86-systems simply can't as stores to cache has to be strictly ordered.

And TSO isn't something useful. It's totally worthless, any ARM-design could emulate TSO with their syncronized instructions but no x86-design can offer better performance with relaxed memory operations.

dacostafilipe · Dec 31, 2020

naukkis said:
And TSO isn't something useful. It's totally worthless, any ARM-design could emulate TSO with their syncronized instructions but no x86-design can offer better performance with relaxed memory operations.

If weak models are the "best thing around", why are we not all running PowerPC?

PS: TSO does increase the predictability of your cache, no small feat. This tasks needs to be handled by the compilers on more weak memory models.

naukkis · Dec 31, 2020

NeoLuxembourg said:
PS: TSO does increase the predictability of your cache, no small feat. This tasks needs to be handled by the compilers on more weak memory models.

How? As I said, only thing TSO does is that other cores see core's stores at program order. That's giving nothing to cache predictability, and if programming is done right it actually does nothing at all. ARM memory model can programmed to TSO too, just replace every store with store-release.

Doug S · Dec 31, 2020

moinmoin said:
I disagree that's that something you can't fix (though our definition of "fix" may well differ). Introducing new instructions doesn't break binary compatibility, it just runs on respective CPUs that support said new instructions, so applications would contain respective code paths. And such new instructions could be used to support the decoder handling variable length CISC code faster.

Well sure, Intel could introduce a new CPU that executes the "x86+" ISA, which is a totally different instruction set that's fixed length, but this CPU can also execute "x86 classic". They tell everyone "in x years we will be selling CPUs that only execute x86+, so you will want to port by then". Do you think that has even the remotest chance of success?

After all, this is basically what Intel tried to do 20 years ago with Itanium. They deliberately withheld a 64 bit version of x86 because they wanted to use the 64 bit transition that the whole PC market would eventually have to go through as a wedge to migrate their customer base from the top down. However AMD gained a bit of life at just the wrong time for Intel's scheme, and was able to get Microsoft to support their 64 bit extension of x86 which was all that was needed to doom Intel's scheme.

Today Microsoft already supports ARM, so if developers are going to port to a totally different ISA why should they choose Intel's x86+ over ARM - or more to the point, why should they choose only one of those alternatives if they are going to bother porting at all? I mean, assuming AMD decides to stick with x86 classic either because they feel that's the best strategy or because Intel won't license x86+ to them, developers wouldn't need to port at all.

Those devs will know Microsoft isn't Apple, and will support some way of running x86 classic binaries for many many years to come, even if AMD went along with the x86+ plan, so what's the rush? Hell, Microsoft still supports running DOS stuff, they'd probably still have the x86+->x86 classic JIT in place in whatever the last version of Windows is decades from now even if hardly anyone ever used it.

zir_blazer · Dec 31, 2020

Reelevant: https://www.agner.org/optimize/blog/read.php?i=25

Question Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Diamond Member

Golden Member