Question Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Carfax83 · Dec 25, 2020

This is one of the greatest criticisms I hear about x86-64, that it is too old and burdened with legacy instruction sets that nobody uses anymore. The other major criticism has to do with the variable instruction length, but that's not nearly as clear cut in my opinion. From what I have read, it seems that having a variable instruction length can be advantageous at times, and increase instruction density. Also, micro op caches and ops fusion seem to do a great job of mitigating the problem of having to add more decoders to achieve greater instruction level parallelism. So at least there are solid workarounds for that one. But the legacy stuff is another matter entirely.... How disruptive would such a move be if Intel or AMD decided to throw out most or all of the legacy x86 stuff and design a cleaner more modern architecture?

I doubt it would affect end consumers that much, more so the business and enterprise sectors. But maybe I'm wrong. There was also a rumor about this several years back from bits n' chips.it, that perhaps Willow Cove would be the last x86-64 CPU designed by Intel that had wide support for legacy instruction sets. So I guess that would be Golden Cove, assuming the rumor is true.

wlee15 · Dec 31, 2020

zir_blazer said:
Reelevant: https://www.agner.org/optimize/blog/read.php?i=25

I'll note that SSE5 was never adopted rather it was separated into XOP, FMA4, and F16C with the first 2 deprecated by AMD while the last one was adopted by both Intel and AMD.

beginner99 · Jan 1, 2021

Doug S said:
After all, this is basically what Intel tried to do 20 years ago with Itanium.

Which might have worked if Itanium itself wasn't as bad as it was. I think the problem was more how bad itanium was than the breaking of backwards compatibility. At least in server world.

Hulk · Jan 1, 2021

I'm not an expert so go easy on me but since x86 are "converted" into micro-ops in CPU's since the Pentium would it be possible to somehow compile software directly into micro-ops?
I have read that decoding complex instructions into micro-ops is a power hungry and cycle hungry business.

If it were possible to compile straight to micro-ops then the entire decode to micro-ops part of the front end could be skipped all-together it seems?

So I guess what I'm asking is can the x86 instructions be converted into micro-ops by the compiler and would it help the CPU to have a front end that deals only with micro-ops?

JoeRambo · Jan 1, 2021

Hulk said:
So I guess what I'm asking is can the x86 instructions be converted into micro-ops by the compiler and would it help the CPU to have a front end that deals only with micro-ops?

Actual uOps in CPU are rather large, and optimized towards execution intricacies of exact micro architecture. I vaguely remember uOP size of ~90bits already during some old generation of 32bit code and register starvation. Who knows how large they are now?
So even if we take those 90bits as baseline, that is already 11 bytes and that is huge. Code density would go to hell, instructions caches would need to be enlarged and datapaths would need to rise into hundreds of bytes to even keep decode rates as they are now.

There are also always questions about how wise it is to expose your internal CPU workings, you would need to get compilers out before actual chip, now sure how to even keep old uOP format compatible with new CPUs without freezing it completely.
And freezing would mean getting stuck with horrible things.

Depending on compiler is bad, just ask Itanium guys and Mill guys also. Not exactly remembered with love.

zir_blazer · Jan 1, 2021

Hulk said:
I'm not an expert so go easy on me but since x86 are "converted" into micro-ops in CPU's since the Pentium would it be possible to somehow compile software directly into micro-ops?
I have read that decoding complex instructions into micro-ops is a power hungry and cycle hungry business.

If it were possible to compile straight to micro-ops then the entire decode to micro-ops part of the front end could be skipped all-together it seems?

So I guess what I'm asking is can the x86 instructions be converted into micro-ops by the compiler and would it help the CPU to have a front end that deals only with micro-ops?

x86 Instructions are converted into MicroOps since the P6 Pentium Pro, not P5 Pentium.

You would need a Processor that supports bypassing the x86 decoder frontend to feed it MicroOps directly. And also, the MicroOps would be pretty much a new ISA itself, and bound to THAT specific Processor generation. You have no way to do this since unless it is a currently available hidden secret functionality (Which I would not be surprised, yet don't expect to be found, either), you can't send MicroOps to the Processor directly. It is entirely internal.

Hulk · Jan 1, 2021

zir_blazer said:
x86 Instructions are converted into MicroOps since the P6 Pentium Pro, not P5 Pentium.

You would need a Processor that supports bypassing the x86 decoder frontend to feed it MicroOps directly. And also, the MicroOps would be pretty much a new ISA itself, and bound to THAT specific Processor generation. You have no way to do this since unless it is a currently available hidden secret functionality (Which I would not be surprised, yet don't expect to be found, either), you can't send MicroOps to the Processor directly. It is entirely internal.

Yes I forgot the Pentium was the first superscaler, not microo-ops.

cytg111 · Jan 1, 2021

With big.LITTLE designs, and the uuuge amount of cores we are getting, couldnt they just sport a few "old" x64 cores to run legacy apps on and the rest be new and fast cores? Like a slowold.NEWFAST design sort of thing?

.

zir_blazer · Jan 1, 2021

cytg111 said:
With big.LITTLE designs, and the uuuge amount of cores we are getting, couldnt they just sport a few "old" x64 cores to run legacy apps on and the rest be new and fast cores? Like a slowold.NEWFAST design sort of thing? .

Somewhat related... Check THIS.

Actually, when we created Merced (1st Itanic) it was designed to be able to be FULLY backwards compatible (i.e. boot MS-Dos 1.0). 25%-33% of the chip was actually a HARDWARE ia32 to ia64 translation engine.
You could put the chip is EPIC (ia64) mode and everything would run though the normal pipeline or ia32 mode and things 1st ran through the ia32 translator then most of the normal pipline. Yeah, you took a performance hit in ia32 mode, but it was the price you paid for "100%" backwards compatibility.

So, I am not sure why the change to a software emulator, unless:
1) they ditched the hardware emulator to get back some real estate of the die, or
2) they didn't like the switching the chip between ia32 & ia64 bit modes.

Basically, first Itanium had a x86 decoder in it that translated x86-to-IA64, which was obviously bypassed in IA64 mode (I recall reading that early Itaniums had a x86 core on them, but seems to be that translator decoder. I don't recall if early Itaniums were demoed running standard x86 Windows XP or something like that, but is theorically possible). Also, the 460GX Chipset for Itanium had a whole bunch of PC/AT baggage on it, like a PIC (Programmable Interrupt Controller) with a 8259A compatibility mode. Basically, Intel created a new architecture and platform from the ground up then tainted it with the legacy of the previous architecture and platform.
Just think: Who the hell is going to spend huge sums of money on a new ISA to use it with Software that performs poorly on it rather than spending the same money on a native ISA Processor that would do it faster? No point in getting the Hardware without the Software. I think it is either native, or Software emulation. That was a waste of transistors. Nowadays thanks to open source and high level languages, at least on Linux, the porting job of a whole ecosystem is at least realistic. Just see what Raptor Engineering did with the POWER9 Talos II.

Also, I think that the whole big.LITTLE thing is pretty much a disaster, since you pass a whole lot of complexity to the OS CPU Scheduler. Physical Cores, Logical Cores, Core-to-Core Latency, Cache L3 domains, NUMA Nodes, Core Parking, C-States, P-States, Turbo States... Is already complex enough, and they keep adding to it. Now do that with Cores that are on a different ISA. Didn't Intel removed the AVX-512 unit of the big Core on a big.LITTLE Processor (Alder Lake it was?) since all them had to support the same instruction sets?

JoeRambo · Jan 1, 2021

zir_blazer said:
Didn't Intel removed the AVX-512 unit of the big Core on a big.LITTLE Processor (Alder Lake it was?) since all them had to support the same instruction sets?

The rumor is that they did not remove it. Once you disable small cores, AVX512 is back. Outright horrible choice that will kill AVX512 on where it would matter the most.

I agree completely on big.LITTLE being fully stupid solution to the problem desktop does not have -> efficiency at all costs.

I have railed against it before, but even if cores had same feature set, when new runnable task appears - it is impossible to know on what type of core to run it without having time machine. Everything is heuristics, can't blame scheduler for not having data about load characteristics from the future. This CPU is designed to compete in Cinebench versus ZEN, when everyone who cares about those things is already on 32C and those who doesn't ask for 6-8C with as large as possible IPC increase.

moinmoin · Jan 1, 2021

Doug S said:
Well sure, Intel could introduce a new CPU that executes the "x86+" ISA, which is a totally different instruction set that's fixed length

Nothing of "such new instructions could be used to support the decoder handling variable length CISC code faster" you quoted suggests an "ISA, which is a totally different instruction set that's fixed length".

zir_blazer · Jan 1, 2021

As having something to show off for the new year, I decided to publish this.

Reelevant for this Thread:

4.1 - Intel 80286 CPU Overview, the MMU, Segmented Virtual Memory and Protected Mode, support chips
4.2 - Processor IPC and performance, compiler optimizations
4.3 - Screwing up x86 forward compatibility: 286 reset hacks, A20 Gate (HMA), 286 LOADALL, the Intel 80186 CPU missing link
6.1 - The Intel 80386 CPU main features, Virtual 8086 Memory, Flat Memory Model, Paged Virtual Memory
6.2 - The side stories of the 386: The 32 Bits bug recall, removed instructions, poorly known support chips, and the other 386 variants

Been writing and rewriting it on/off for like, 4 or 5 years. I hope someone else enjoys it. Note that I'm hotlinking a lot of heavy PDFs so you may not want to click too much, so I may have to cleanup it...

Don't forget that we're not only carrying legacy x86 stuff from 40 years ago, but also remants from the IBM PC platform. We're tainted with support chips from the Intel 8080/8085 CPU from BEFORE the 8086/8088.

Carfax83 · Jan 1, 2021

zir_blazer said:
Been writing and rewriting it on/off for like, 4 or 5 years. I hope someone else enjoys it. Note that I'm hotlinking a lot of heavy PDFs so you may not want to click too much, so I may have to cleanup it...

Don't forget that we're not only carrying legacy x86 stuff from 40 years ago, but also remants from the IBM PC platform. We're tainted with support chips from the Intel 8080/8085 CPU from BEFORE the 8086/8088.

You must have a lot of experience and knowledge (I'm sure it was researched as well) in the industry to be able to write such a detailed and exhaustive historical account of the x86 architecture.

zir_blazer · Jan 1, 2021

Carfax83 said:
You must have a lot of experience and knowledge (I'm sure it was researched as well) in the industry to be able to write such a detailed and exhaustive historical account of the x86 architecture.

It was actually intending to build a case that we have to burn the PC-x86 platform with a flamethrower and start from scratch again, heh.

There were more juicy details that were fun to research like the introduction of the CPUID instruction (Living before CPUID looked like this), a section that should be pretty much called "Indiana Jones and the Raiders of the Lost Supplement to the Pentium Processor User's Manual Appendix H", and how I almost became crazy when trying to explain in just ONE paragraph what ACPI is supposed to be and do. But I dropped them during last rewrite, which I left where you see it end.

naukkis · Jan 2, 2021

Some things that came to mind from 286's MMU that were usable with paging too was segmentation differentiation of data and code segments. Not until Athlon64 x86 cpu's got execution disable features for flat address range but with segmentation those were implemented by design.

x86 segmentation could have been used to provide execution disable properties to flat addressing with OS separation of instruction and data pages into code and data segments - but that would have come with performance penalty as most modern x86 cpu's have one cycle additional latency for address generation when segment registers aren't zero.

Actually I think that 386's segmentation + paging virtual memory scheme is about ultimate virtual memory concept, segmentation model biggest handicap was that to be able to use virtual memory efficiently just with segmentation segments need to be small in size unnecessarily complicating programming. But with added paging segment size is free as hardware can split segments to smaller size and whole segment doesn't need to be swapped in/out from memory.

But programmers still preferred full flat memory, for 386 that means that they used same segment for data, code and stack instead of separating them even though separate segments would have secured code from data, preventing buffer overflow attacks and similar data corruption.

I think that Intel did make a mistake when they allowed to use same segment for different purposes at same time, there would have also hardware benefits to know which segment holds which data - actually pretty much all cpu designs nowadays will separate code and data at least in L1-cache level.

zir_blazer · Jan 7, 2021

Bumping thisThread to post THIS.

Some guy bothered to dump the 8086 internal Microcode ROM and reverse engineered it, and found some interesing stuff...

Does the microcode have any hidden features, opcodes or easter eggs that have not yet been documented?

It does! Using the REP or REPNE prefix with a MUL or IMUL instruction negates the product. Using the REP or REPNE prefix with an IDIV instruction negates the quotient. As far as I know, nobody has discovered these before (or at least documented them).

One has to wonder what we will find hidden in later generations...

Hulk · Jan 7, 2021

zir_blazer said:
Bumping thisThread to post THIS.

Some guy bothered to dump the 8086 internal Microcode ROM and reverse engineered it, and found some interesing stuff...

One has to wonder what we will find hidden in later generations...

Amazing! Thanks for sharing. The grand daddy of them all!
Well I guess it could be argued that would be the 8008 but you know what I mean.

Search

Question Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Carfax83

Diamond Member

wlee15

Senior member

beginner99

Diamond Member

Hulk

Diamond Member

JoeRambo

Golden Member

zir_blazer

Golden Member

Hulk

Diamond Member

cytg111

Lifer

zir_blazer

Golden Member

JoeRambo

Golden Member

moinmoin

Diamond Member

zir_blazer

Golden Member

Carfax83

Diamond Member

zir_blazer

Golden Member

naukkis

Senior member

zir_blazer

Golden Member

Hulk

Diamond Member

TRENDING THREADS