Question What's preventing AMD and Intel from widening their pipelines?

SsupernovaE · Jan 2, 2021

As we all know, Apple's newest architecture is able to retire eight instructions per clock under ideal circumstances. What is the technical reason preventing x86 from exceeding 4-5 instructions per clock?

Is it even possible? And if so why hasn't it been already implemented.

Feel free to ELI5 or go into detail.

scannall · Jan 2, 2021

SsupernovaE said:
As we all know, Apple's newest architecture is able to retire eight instructions per clock under ideal circumstances. What is the technical reason preventing x86 from exceeding 4-5 instructions per clock?

Is it even possible? And if so why hasn't it been already implemented.

Feel free to ELI5 or go into detail.

This guy goes into great detail on this subject in this article.

amrnuke · Jan 2, 2021

scannall said:
This guy goes into great detail on this subject in this article.

On that article he says, "AMD Ryzen Accelerated Processing Unit (APU) which combines CPU and GPU (Radeon Vega) on one silicon chip. Does however not contain other co-processors, IO-controllers, or unified memory."

Ryzen processors have IO controllers on-die for all APUs... and on-chip for all of their CPUs. It's odd for someone talking about the subject to not have even done that basic research, and then to make a clearly false error of commission.

Leeea · Jan 2, 2021

SsupernovaE said:
Is it even possible?

Yes

SsupernovaE said:
And if so why hasn't it been already implemented.

Diminishing returns.

CISC gets more done per instruction
RISC gets more instructions, and thereby needs to retire more

It has been that way since the beginning of time.

Why does not apple have 16 instructions retired per clock? Same reason as CISC, at a certain point your transistors are better spent elsewhere.

itsmydamnation · Jan 2, 2021

SsupernovaE said:
As we all know, Apple's newest architecture is able to retire eight instructions per clock under ideal circumstances. What is the technical reason preventing x86 from exceeding 4-5 instructions per clock?

Is it even possible? And if so why hasn't it been already implemented.

Feel free to ELI5 or go into detail.

So this is just pretty much ARMy BS, Andrei's A-14 article ( while really good) probably didn't help with this ....lol . what you need to look at is what an instruction is, across ARM ISA , X64 , SIMD extensions , things like scatter gather etc. X64 instructions are more dense , but ARM isn't pure RISC either.

So intel sunnycove has 1 complex decoder for emitting upto 4 uops and 4 simple decoders for 4 uops , an op cache for emitting upto 6 uops. What intel has is a limit after this point of only 6 uops into the micro op queue. I believe Intel can retire/commit 6 ops a cycle.

AMD zen 2/3 has 4 complex decoders emitting upto 8 uops and an op cache for emitting upto 8 uops. AMD has a limit further down the pipeline of dispatching 6 ops per clock. AMD can retire/commit 8 ops a clock.

ARM A-78 can emit 4 uops from decoders and 6 uops from uop cache and can dispatch 6 uops a cycle. A-78 can retire 6 ops a cycle

A-14 who knows because they dont tell you, 8 somethings decoded , loop/trace/op cache , queue / dispatch / retire widths etc no idea.

So what i would like to point out is that if instruction throughput was really such a big deal in limiting performance both AMD and Intel has bottlenecks post decode to hit the same width as A-14 and is no less wide the A-78.

Another interesting point is that Tremont has 6 decoders that can operate as either 6x1 instruction streams or 3x2 instructions streams. Intel has released it as 3x2 because only very specific workloads actually benefited from being able to decode 6 instructions from a single stream. Im guessing we will see those SKU's for things like ethernet switch control plane / packet punt* CPU's ( atom line is actually used a lot of this ).

What might be more interesting to look at is cold instruction latency, how long does it take and x86 core in cycles to decode an instruction vs an ARM core, but that's probably not super relevant and is more edge case, the entirety of a modern OOOE core with speculative execution is about avoiding exactly that.

packet punt is where the SOC ( switch on chip) SOC (lol) cant process the packet in its internal structures because its to complex, some switching at this point will just drop the packet , others will forward it out of the standard data plane to a general purpose CPU for processing.

Carfax83 · Jan 2, 2021

scannall said:
This guy goes into great detail on this subject in this article.

I'm just a layman but even I know a lot of what's in that article is B.S!

jamescox · Jan 2, 2021

Leeea said:
Yes

Diminishing returns.

CISC gets more done per instruction
RISC gets more instructions, and thereby needs to retire more

It has been that way since the beginning of time.

Why does not apple have 16 instructions retired per clock? Same reason as CISC, at a certain point your transistors are better spent elsewhere.

RISC and CISC are not really applicable anymore. ARM is not really RISC. RISC is reduced instruction set computing. ARM has a massive number of very specialized instructions that do not actually fit with the original RISC paradigm. It seems like RISC and CISC have become terms just used to differentiate between x86 derivatives and everything else, which means they are not actually useful anymore.

AMD64 still has a lot of baggage, but modern compilers generally aren’t going to issue instructions that do not perform well on modern processors, so a lot of the baggage is just sitting in microcode and never used unless you run some really old code. I suspect that a lot of AMD64 and ARM instruction streams would actually look very similar due to AMD64 compilers using the highest performance instructions to get the job done.

ARM still has the RISC-like, simpler instruction encoding (fixed length) and simpler addressing modes and such. I am not an expert, but I would say that the limitations on number of execution units has more to do with the scheduler than anything else. Integer execution units are actually very simple and take little die area. The decoders are more complex for x86 derivatives, but they are not actually decomposed into RISC-like instructions. They are decoded into a wide internal representation with a bunch of extra information attached.

The complex part is the schedulers. I believe they actually grow non-linearly with the number of execution units. The schedulers are ridiculously complex. They have to avoid data hazards due to dependent instructions. They have register renaming since AMD64 only has 16 GP registers. Zen 3 has 192 physical registers. You also have speculative execution. The internal representation and the schedulers have to track all of that information in addition to the actual instrution operands. There is going to be a point where adding more execution units increases the scheduler complexity significantly with zero performance return.

I suspect that Apple processors do very well mostly due to cache design; I would expect execution width has little to do with it. This includes automatic prefetch and such. The problem has been how to keep the core fed for a long time. I tried some test with bit vectors a long time ago since someone said that using a full byte per bit was faster. It wasn’t, even though the cpu has to do a bunch of bit masking which is extra instructions. The smaller cache footprint may have won. Instructions are cheap at 3 to 5 GHz though. A lot of code only reaches an IPC of around 1 due to waiting on the memory system, so the enthusiast idea that some super wide execution core is going to beat everything else is not reality.

CluelessOne · Jan 3, 2021

I would rather have Intel, AMD, Microsoft and the 5 largest PC OEM (Dell, Lenovo, HP, Asus and Acer) sit together and strategize together how to deal with Apple challenge to their existence. Make no mistake if the premium PC market goes all Apple all of them are dead.

Some of the things that can be done:
1. Deliver and sell CPUs that meets a certain minimum feature. And I don't mean like now where it's stopped at SSE 4. Minimum is AVX 2, AMD 64 only, VTd etc. A current relevant minimum. Think all Skylake features. Update the minimum every 5 years.
2. Have Microsoft demand all software to support the minimum CPU feature as above within 5 years to the agreement.
3. Microsoft should push usage of the modern API. Deprecate old API and overhaul their NET and VC ecosystem. Make the next NET and VC redistribution a superset of the previous one. It's ridiculous we need all the older NET and VC++ distribution to run an older software.
4. OEM should make better hardware drivers and distributed only via Windows Update mechanism. One stop for all updates.
5. Intel's Project Athena (minimum hardware configuration on laptop) is a good idea. Although it is restricted to Intel's hardware currently, no reason for not making it to be more open performance standard. Make OEM commit to build only computers that meets that kind of standard within the next 5 years.
Those guys have a lot of smart people and as long as they look for the long term, they can do a lot for the longevity of their products. At the minimum Intel, AMD and Microsoft should do this already.

Insert_Nickname · Jan 3, 2021

CluelessOne said:
1. Deliver and sell CPUs that meets a certain minimum feature. And I don't mean like now where it's stopped at SSE 4. Minimum is AVX 2, AMD 64 only, VTd etc. A current relevant minimum. Think all Skylake features. Update the minimum every 5 years.

Intel really shot themselves in the foot with their artificial segmentation between Celeron/Pentium and Core. Making AVX(2) Core exclusive mean we're stuck at the SSE4.2 baseline for a long time. Even Comet Lake Celeron/Pentium CPUs don't support AVX yet.

Thankfully, x86_64 has to use SSE2 minimum so at least if you're using x64 versions, you're not running legacy x87 code.

CluelessOne said:
2. Have Microsoft demand all software to support the minimum CPU feature as above within 5 years to the agreement.

Not gonna happen. You'd loose backwards compatibility, and thus the entire reason for sticking with x86. At that point you might as well design a new instruction set from scratch.

If you mean software released going forward, that might work, but you still have a problem with older systems not being able to execute new software. Especially desktop systems and servers have a long shelf life in enterprise.

CluelessOne said:
3. Microsoft should push usage of the modern API. Deprecate old API and overhaul their NET and VC ecosystem. Make the next NET and VC redistribution a superset of the previous one. It's ridiculous we need all the older NET and VC++ distribution to run an older software.

Agree completely on Visual C++ and NET. It is getting ridiculous. But then again, something is bound to not work correctly. At least the 2015, 2017 and 2019 versions share the same redist.

CluelessOne said:
4. OEM should make better hardware drivers and distributed only via Windows Update mechanism. One stop for all updates.

Already there for modern platforms.

DisEnchantment · Jan 3, 2021

jamescox said:
AMD64 still has a lot of baggage, but modern compilers generally aren’t going to issue instructions that do not perform well on modern processors, so a lot of the baggage is just sitting in microcode and never used unless you run some really old code. I suspect that a lot of AMD64 and ARM instruction streams would actually look very similar due to AMD64 compilers using the highest performance instructions to get the job done.

Actually, this cannot be any farther from the truth.
As others have said, Windows and Linux, in general are shipped with very generic libraries to cover a wide gamut of processors from much more than 10 years ago.
The compiler, although it can emit code for newer processors, is however being used to target the lowest common denominator.
AMD and Intel are generally very slow to update compilers because the old code would run on the new processors no problem.

This is one of many reasons many companies would be running their own distro so that they can extract performance out of their machines in their data centers.
Look at OpenSSL, if you want top performance on a modern machine you need to recompile so that you can take benefit of AES instructions on new processors.
Look at Linux netfilter, if you want orders of magnitude gain in performance you recompile with AVX.
This is what we do and I am certain most other big companies do.
If you use some infrastructure from Azure, They have their own distro. They are not using some generic Ubuntu distro, even though if you wish you could still do.

Windows is nerfed on a next level, you cannot recompile it and you are at the mercy of Microsoft.

If you look at Android, the OEM compiles the system libraries and the kernel using highly tuned options and extract best performance out of the processor because they own the HW and the SW.
During OTA the /boot, /vendor are always updated by OEM which means best case performance. AOSP updates are also compiled by OEM and delivered via OTA to update the /system
Apps made by third party are java based and are not influenced by the compiler options.

You can bet Apple does the same. I would even go far as to suggest they would not mind nerfing old HW in order to support current products.

.NET 6 has a chance of redeeming itself if the AOT can JIT the IL code to match the native arch, but thats probably stretching it a bit too far. But nevertheless I am excited about .NET 6 AOT.

MS has to make a new OS every few years and set the minimum requirements. But then again that has never been the MO for both Windows and Linux.

Leeea · Jan 3, 2021

CluelessOne said:
Make no mistake if the premium PC market goes all Apple all of them are dead.

Why would the premium PC market go to Apple?

Apple has been making premium PCs a long time. Back in the Power PC days they had a faster processor for a while. Before that they had the 68000 series which for a time was also faster.

Apple is not even close to making a faster CPU than its x86 counter parts.

The CPU they did make has more transistors and only 4 high performance threads, which hardly compares to the 12 to 16 high performance threads of the competition. It is true they get single threaded performance wins in some benchmarks. However, these days nearly all apps that need performance implement multithreading, an area where the m1 gets thrashed hard. Even your internet browser is multithreaded. It is difficult to think of an app that needs performance that is not multithreaded.

Factoring in Apple's prison ( the so called "walled garden" ), combined with poor value for the money for the hardware, the expensive Apple tax on software and content, no upgrade options, no repair options, and a lack of legacy support it is hardly a compelling product for high end computing.

DisEnchantment said:
Actually, this cannot be any farther from the truth.

I think it is more of a middle ground. Yes, the default setting in Visual Studio is to compile for the beginning of time.

But hit the x64 check box, and Visual Studio compiles for the much "newer" x64 standard, which at least gets us to 2003. One has to imagine the truly awful ancient stuff is left behind.

DisEnchantment said:
If you look at Android, the OEM compiles the system libraries and the kernel using highly tuned options and extract best performance out of the processor because they own the HW and the SW.

That is true. However then the android OS runs nearly all of the programs with a bytecode dalvik virtual machine similar to a java virtual machine. Very efficient, but not something a person would say is performance optimized.

DisEnchantment · Jan 3, 2021

Leeea said:
I think it is more of a middle ground. Yes, the default setting in Visual Studio is to compile for the beginning of time.

But hit the x64 check box, and Visual Studio compiles for the much "newer" x64 standard, which at least gets us to 2003. One has to imagine the truly awful ancient stuff is left behind.

Some of the gems about current Linux distros, from gcc mailing list to organize architecture levels for x86

New x86-64 micro-architecture levels

Most Linux distributions still compile against the original x86-64
baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
EM64T compatibility).

They compiled it for K8 ( even taking out 3DNow! extensions )

Level A - CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3. Barely going above the base x86_64 requirements.

Level B - Level A + AVX. The vintage of Intel Sandy Bridge and AMD Jaguar.

Level C - Level B + AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE. The point of roughly Intel Haswell era systems.

Level D - Level C + AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL. At this stage with the AVX-512 focus, just current Intel Xeon Scalable CPUs and Ice Lake.

Intel ClearLinux which is the fastest Linux distro out there, by a good margin, is compiled for an equivalent of Level B.

Leeea said:
That is true. However then the android OS runs nearly all of the programs with a bytecode dalvik virtual machine similar to a java virtual machine. Very efficient, but not something a person would say is performance optimized.

What I was trying to convey is that there is no Android image from Google for Handset manufacturers. Every OEM will customise and build the AOSP exactly for their SoC/HW. Windows/Linux distros on the other hand ...

Just a short comment on Android...
Android uses ART not Dalvik since a long time now.
The Android Framework Services contain mainly C/C++ code, the HALs and of course the OS is Linux 4.1x with custom OOM, binder IPC, ION, custom governors (with Android 10+ it will be 5.4x and in the very near future with a possibility to build it directly from upstream). A lot of the system libraries are the same from upstream
The apps are of course IL byte code, but whenever in memory HALs are used, like OpenGL/Vulkan it is all C/C++ code.

Google Play apps use standardized java/kotlin framework interfaces and can connect to standard HALs but OEMs implement a lot of apps as Native services and expose VHALs (Vendor HALs) which are binderized and the app only handle the intent/activity in java and connects to these services via binder IPC.
With APEX you can deploy apps containing a native .so instead of only plain apk.

Hulk · Jan 3, 2021

jamescox said:
RISC and CISC are not really applicable anymore. ARM is not really RISC. RISC is reduced instruction set computing. ARM has a massive number of very specialized instructions that do not actually fit with the original RISC paradigm. It seems like RISC and CISC have become terms just used to differentiate between x86 derivatives and everything else, which means they are not actually useful anymore.

AMD64 still has a lot of baggage, but modern compilers generally aren’t going to issue instructions that do not perform well on modern processors, so a lot of the baggage is just sitting in microcode and never used unless you run some really old code. I suspect that a lot of AMD64 and ARM instruction streams would actually look very similar due to AMD64 compilers using the highest performance instructions to get the job done.

ARM still has the RISC-like, simpler instruction encoding (fixed length) and simpler addressing modes and such. I am not an expert, but I would say that the limitations on number of execution units has more to do with the scheduler than anything else. Integer execution units are actually very simple and take little die area. The decoders are more complex for x86 derivatives, but they are not actually decomposed into RISC-like instructions. They are decoded into a wide internal representation with a bunch of extra information attached.

The complex part is the schedulers. I believe they actually grow non-linearly with the number of execution units. The schedulers are ridiculously complex. They have to avoid data hazards due to dependent instructions. They have register renaming since AMD64 only has 16 GP registers. Zen 3 has 192 physical registers. You also have speculative execution. The internal representation and the schedulers have to track all of that information in addition to the actual instrution operands. There is going to be a point where adding more execution units increases the scheduler complexity significantly with zero performance return.

I suspect that Apple processors do very well mostly due to cache design; I would expect execution width has little to do with it. This includes automatic prefetch and such. The problem has been how to keep the core fed for a long time. I tried some test with bit vectors a long time ago since someone said that using a full byte per bit was faster. It wasn’t, even though the cpu has to do a bunch of bit masking which is extra instructions. The smaller cache footprint may have won. Instructions are cheap at 3 to 5 GHz though. A lot of code only reaches an IPC of around 1 due to waiting on the memory system, so the enthusiast idea that some super wide execution core is going to beat everything else is not reality.

Thanks for the well thought-out and written explanation. So really the problem isn't with making the processors wider, it's the dependency of one instruction on one not yet decoded and stored that makes the extra "lanes" kind of useless. Is this essentially correct? There is only so much parallelism that can be extracted from the code and I get the feeling from your post that we are nearing that limit as the Out-of-Order scheduling structures are already unbelievably complex in modern processors.

Of course we are talking about hardware parallelism here as opposed to software parallelism, which would be more as to how the developer has coded the application for multiple processors rather than a wider pipelined processor?

Cardyak · Jan 3, 2021

We are nowhere near the width limit for decoding instructions, x86 or otherwise.

Apple’s Firestorm core already decodes 8 instructions per clock (but the exact implementation is unknown)

Intel’s Tremont is also a very innovative design, as it has 2 clusters of 3 wide decoders, and allows out of order decoding. So the first cluster decodes the current “batch” of code, and the second cluster works in tandem with the branch predictor to determine the next block of code, and decodes that. If I had to guess, I think this is the future direction of decoding and general front end improvements.

Why stop at 2 clusters? As transistor density increases we could have numerous different clusters of decoders and more ambitious designs:

2x4 -> 3x3 -> 3x4 -> 3x5 .... -> 4x6 (With enough die space and branch prediction accuracy and bandwidth this is all theoretically possible)

You would need an incredible branch predictor though, as you are now essentially speculatively executing code from start to finish.

Cardyak · Jan 3, 2021

Accidentally double posted.

gdansk · Jan 3, 2021

I suspect x64's variable length instructions makes it use some more space.

But, as others have mentioned, x64 will often yield more micro-operations from fewer instructions. So it may not be the key bottleneck in modern Intel/AMD designs that you think it is. I think it's telling that neither Sunny Cove nor Zen 3 pushed that limit and instead focused on other areas.

As for it being difficult: Intel's small core 'Tremont' can be 6 wide decode. Who knows why they did not ship it that way. Intel said it offered a better balance of performance, die size, and power.

I am curious to see if they adapt a similar approach in their big cores. It seems wasteful of die space to have a usually sleeping second cluster but it'd gladly wake up for benchmarks.

wlee15 · Jan 3, 2021

It's pretty clear that both Intel and AMD are going to rely on op Caches to bypass x86 decoder limitations. (Intel even calls the x86 decoders as the "Legacy Decode Pipeline").

CluelessOne · Jan 3, 2021

Leeea said:
Why would the premium PC market go to Apple?

Apple has been making premium PCs a long time. Back in the Power PC days they had a faster processor for a while. Before that they had the 68000 series which for a time was also faster.

Apple is not even close to making a faster CPU than its x86 counter parts.

The CPU they did make has more transistors and only 4 high performance threads, which hardly compares to the 12 to 16 high performance threads of the competition. It is true they get single threaded performance wins in some benchmarks. However, these days nearly all apps that need performance implement multithreading, an area where the m1 gets thrashed hard. Even your internet browser is multithreaded. It is difficult to think of an app that needs performance that is not multithreaded.

Factoring in Apple's prison ( the so called "walled garden" ), combined with poor value for the money for the hardware, the expensive Apple tax on software and content, no upgrade options, no repair options, and a lack of legacy support it is hardly a compelling product for high end computing.

For those casual users who uses their PC or laptop for:
1. light office work (word processing etc.).
2. presentation on client premises.
3. Browsing internet, email, media consumption etc.

Basically people whose computer needs can be met with a "tablet" or iPad experience but needs keyboard, mouse or better monitor the Apple M1 CPU can met their needs with longer battery life or less power than x86 laptops with Windows on it. In some cases with better user experience also.

These people are the majority and they don't care what OS they use, they just want to do their job or need with the least hassle. And no, for them Linux is not an option. Too geeky, fiddly etc in their mind.

Apple gives them machines that looks good, fast enough for their need, good user experience, reasonable security and privacy and that "cool, posh and expensive" factor. "Bling" is important. "Bling" is expensive. Why would they look at stodgy and cheaply built Windows machines that is a hassle to maintain?

itsmydamnation · Jan 4, 2021

wlee15 said:
It's pretty clear that both Intel and AMD are going to rely on op Caches to bypass x86 decoder limitations. (Intel even calls the x86 decoders as the "Legacy Decode Pipeline").

ARM as well,

i'm not sure what the feature holds in terms of static leakage vs dynamic power but so long as static leakage and vmin keep improving more then dynamic power , big cache structures and holding things for long will probably be better for power then doing something extra twice.

Keysplayr · Jan 8, 2021

See Pentium 4.

Midwayman · Jan 8, 2021

CluelessOne said:
For those casual users who uses their PC or laptop for:
1. light office work (word processing etc.).
2. presentation on client premises.
3. Browsing internet, email, media consumption etc.

Basically people whose computer needs can be met with a "tablet" or iPad experience but needs keyboard, mouse or better monitor the Apple M1 CPU can met their needs with longer battery life or less power than x86 laptops with Windows on it. In some cases with better user experience also.

These people are the majority and they don't care what OS they use, they just want to do their job or need with the least hassle. And no, for them Linux is not an option. Too geeky, fiddly etc in their mind.

Apple gives them machines that looks good, fast enough for their need, good user experience, reasonable security and privacy and that "cool, posh and expensive" factor. "Bling" is important. "Bling" is expensive. Why would they look at stodgy and cheaply built Windows machines that is a hassle to maintain?

All true and the volume market is really vulnerable to shifting to apple, especially as more apps get implemented as web versions anyways. However its not the premium market. I'm really interested to see how Apple replaces it's mac pro desktop line. The m1 has some severe limitations for that market.

CluelessOne · Jan 9, 2021

I suppose I should have written luxury (or something that perceived as luxury anyway) instead of premium. Nevertheless if volume market goes to Apple, where does it leave Microsoft and it's OEM? Scrapping for enterprise and bottom of barrel market? So it's a death sentence in the long run.
Either Microsoft and it's suppliers and OEM shape up or they die. My prediction is the second scenario because they are run by MBAs type.

Question What's preventing AMD and Intel from widening their pipelines?

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Member

Member

Diamond Member

Senior member

Member

Diamond Member

Elite Member

Diamond Member

Member