Question RISC vs CISC in modern CPUs [Extremetech]

Carfax83 · Jun 1, 2021

Very interesting article from Joel Hruska over at Extremetech

As a non industry hardware enthusiast, the ARM vs x86 debate has me fascinated I must say. The article goes into the historic aspects of both RISC and CISC designs and attempts to answer crucial questions; namely whether ISA is relevant or irrelevant, and whether Intel and AMD's x86-64 chips can compete successfully with advanced ARM designs like Apple's M1 in the future just to name a few.

The article also cites the oft quoted Agner Fog, whom I've seen many references to on this forum. Apparently Mr. Fog is of the mind that ISA definitely matters, and that x86 is definitely encumbered; but also that it makes up for the legacy baggage by doing more work per instruction. Anyway, I still think that x86-64 has a long life ahead of it. Zen 3 isn't that far off from the M1 in terms of raw single threaded performance, despite being one node behind. But I wouldn't mind seeing both AMD and Intel do a ground up redesign of the x86-64 ISA which sacrifices some backward compatibility for performance and efficiency.

Thala · Jun 1, 2021

Its not so much RISC vs CISC, in the sense of the complexity of the instruction set but yes, the ISA matters a lot.
However since there are fundamental issues with x86-64 - i do not see how they could just sacrifice a bit of compatibility in order to create a modern ISA.

zir_blazer · Jun 2, 2021

The RISC vs CISC debate became secondary the moment that Processor designers figured out that it is preferable to have a completely different internal ISA (MicroOps exclusive to that Processor) with just a decoding frontend for the external ISA (x86).
The major remaining differences are that CISC have the potential advantage that you can have complex instructions that does many things at once with potentially a smaller size than if fitting the same operations into multiple simpler instructions, but also makes decoding such complex instructions harder. On ideal conditions and with comparable external CISC vs RISC ISAs, I would believe that CISC would use less Memory Bandwidth and fill more total operations on the Processor Cache, but the decoder unit would be bigger in size, clock slower and be less power efficient, whereas RISC would be the exact opposite. Can't really see any othe rmeaningful differences.

When it comes to x86, you're talking about an ISA that carries baggage from 40 years ago, and where early design mistakes due to Intel not taking serious the long term evolution of the x86 ISA (The original 8086, but most of all, the 80286, were considered pretty much filler products) ended up screwing every damn x86 Processor since then. There is a lot of extra complexity just to keep backwards compatibility (And not at the Processor level only, but also at the platform level). I wrote about that. Just by reading that, you may actually understand what is what actually makes x86 a rather bad base.

semiman · Jun 2, 2021

There are x86 taxes indeed, but the point is that if it really matters in terms of end user use cases or not. For example, x86 is paying taxes by adding uOP cache(kinda L0 I-cache) to store decoded x86 opcodes. This taxes are somehow compensated by higher opcode density of x86 by reducing DRAM to CPU transfer. Bandwidth is quite expensive and it can reduce burden in inter-cpu communications(UPI) resulting in a better performance, less power usage(for certain scenarios...etc)

The other thing I want to point out is that history of compiler itself. We already know that intel is not really adding additional ALUs(not vector ALUs but general purpose ones) to their CPU core. Note that M1s are using 6 ALUs per core while skylake has 5. Probably x86 is suffering more diminishing returns of additional ALUs compared to ARM due to ISA usage(how compiler generate codes for x86 binaries) evolution? Maybe that's the reason why x86 players are enthusiastically using SMT(no software's using more than 6 ALUs?).

Carfax83 · Jun 2, 2021

zir_blazer said:
When it comes to x86, you're talking about an ISA that carries baggage from 40 years ago, and where early design mistakes due to Intel not taking serious the long term evolution of the x86 ISA (The original 8086, but most of all, the 80286, were considered pretty much filler products) ended up screwing every damn x86 Processor since then. There is a lot of extra complexity just to keep backwards compatibility (And not at the Processor level only, but also at the platform level). I wrote about that. Just by reading that, you may actually understand what is what actually makes x86 a rather bad base.

Is there anything you think Intel and or AMD should do to further lessen the legacy baggage? There was talk several years ago that Intel was working on a brand new x86 architecture not based on the P6, that would sacrifice backward compatibility for more die space, efficiency and performance per watt. If this is true, I'd bet it would be Meteor Lake.

ThatBuzzkiller · Jun 2, 2021

Carfax83 said:
Is there anything you think Intel and or AMD should do to further lessen the legacy baggage? There was talk several years ago that Intel was working on a brand new x86 architecture not based on the P6, that would sacrifice backward compatibility for more die space, efficiency and performance per watt. If this is true, I'd bet it would be Meteor Lake.

There'd have to be an agreement between AMD and Intel for this happen otherwise one of them will just use this "legacy baggage" as a weapon against the other corporation that decides to voluntarily cut off this functionality that makes them less compatible and slower in applications with older codebases.

AMD64 was an attempt by AMD at cleaning up the x86 ISA a bit while IA64 from Intel was completely incompatible with x86 and Intel suffered a massive beating by AMD during that time for their decision. Politics will continue to dictate hardware design ...

Gideon · Jun 2, 2021

ThatBuzzkiller said:
AMD64 was an attempt by AMD at cleaning up the x86 ISA a bit while IA64 from Intel was completely incompatible with x86 and Intel suffered a massive beating by AMD during that time for their decision. Politics will continue to dictate hardware design ...

While that's true it's only part of the picture, IA64 was also a terrible terribe VLIW ISA that required ridiculously complex compilers and should have never seen the light of day. VLIW didn't even work out for GPUs (AMD replaced it with GCN) it couldn't have possibly worked for much more general-purpose CPUs.

If Intel had created a decent 64bit ISA and would have actually allowed licensing it (otherwise what choice would AMD have but to continue with x86?) I'm quite sure it would have been accepted by both.

ThatBuzzkiller · Jun 2, 2021

Gideon said:
While that's true it's only part of the picture, IA64 was also a terrible terribe VLIW ISA that required ridiculously complex compilers and should have never seen the light of day. VLIW didn't even work out for GPUs (AMD replaced it with GCN) it couldn't have possibly worked for much more general-purpose CPUs.

If Intel had created a decent 64bit ISA and would have actually allowed licensing it (otherwise what choice would AMD have but to continue with x86?) I'm quite sure it would have been accepted by both.

VLIW has been somewhat of a success in GPUs if we take Maxwell/Pascal as an example and has seen some use in CPUs too like Nvidia's Denver ARM cores both of which are seen in the Tegra X1 SoC so I don't totally buy the argument that VLIW isn't viable design ...

Even if an alternate IA64 wasn't VLIW, there still would've been tons of impending performance and compatibility issues in face of the more natural AMD64 successor. Would AMD have even agreed to accept IA64 as the new standard even if Intel offered them or would AMD take the chance to retaliate against Intel with AMD64 ? Why assume AMD will take the road for the greater good when they can mercilessly beat Intel ? Politics has always dictated hardware design and that won't change ...

Intel could've ditched x86 all by themselves but they didn't because there's too much money to be made!

Gideon · Jun 2, 2021

ThatBuzzkiller said:
VLIW has been somewhat of a success in GPUs if we take Maxwell/Pascal as an example and has seen some use in CPUs too like Nvidia's Denver ARM cores both of which are seen in the Tegra X1 SoC so I don't totally buy the argument that VLIW isn't viable design ...

Even if an alternate IA64 wasn't VLIW, there still would've been tons of impending performance and compatibility issues in face of the more natural AMD64 successor. Would AMD have even agreed to accept IA64 as the new standard even if Intel offered them or would AMD take the chance to retaliate against Intel with AMD64 ? Why assume AMD will take the road for the greater good when they can mercilessly beat Intel ? Politics has always dictated hardware design and that won't change ...

Intel could've ditched x86 all by themselves but they didn't because there's too much money to be made!

I agree with the latter 2 paragraphs but Maxwell/Pascal are not VLIW. AFAIK Nvidia has not used VLIW for GPUs ever since G80.And I consider Denver a resounding flop. Nvidia tried with Denver (2016) and Denver 2 (2018) but they failed to gain any foothold in the datacenter market and lost all of Tegra's gain in phones/tablets. There haven't been any updates since 2018 and the upcoming products are not using VLIW AFAIK. So I definitely wouldn't call it successful.

EDIT:
BTW I'm not saying VLIW is useless. It certainly can have it's uses in some paralel (HPC) code. It's just that there are few examples that have been truly successful. This is because it's often surprisingly suboptimal even for many kinds of massively parallel code (e.g. Graphics) let alone hardly threadable programs. What I mean there is only a small subset of problems that match VLIW architectures really well. On all the others you're wasting resources massively.

IA64 had slim chance in the datacenter. It could never have truly replaced x86 (or something like ARMv9) for general purpose consumer code that even now is mostly single-threaded.

ThatBuzzkiller · Jun 2, 2021

Gideon said:
I agree with the latter 2 paragraphs but Maxwell/Pascal are not VLIW. AFAIK Nvidia has not used VLIW for GPUs ever since G80.And I consider Denver a resounding flop. Nvidia tried with Denver (2016) and Denver 2 (2018) but they failed to gain any foothold in the datacenter market and lost all of Tegra's gain in phones/tablets. There haven't been any updates since 2018 and the upcoming products are not using VLIW AFAIK. So I definitely wouldn't call it successful.

Maxwell/Pascal are actually VLIW designs since they have explicit instruction encoding for dual-issue instructions. Just like Intel Itanium they have "control codes" too but Itanium is potentially capable of doing triple-issue. Explicit instruction encodings for multi-issue is a concept that underpins all VLIW designs. Tegra X1 is also seen in the Nintendo Switch as well which is currently a consumer success so I wouldn't totally write off VLIW altogether even if there is an incoming dark age since they might potentially reappear in the future again ...

With that being said, I think we are largely in agreement regarding AMD and Intel. Taking the entire reign of x86 is way too attractive of a thought for either of them. A single corporation deciding the future of an existing industry standard is an ultimately dream come true exclusive monopoly!

Gideon · Jun 2, 2021

ThatBuzzkiller said:
Maxwell/Pascal are actually VLIW designs since they have explicit instruction encoding for dual-issue instructions. Just like Intel Itanium they have "control codes" too but Itanium is potentially capable of doing triple-issue. Explicit instruction encodings for multi-issue is a concept that underpins all VLIW designs. Tegra X1 is also seen in the Nintendo Switch as well which is currently a consumer success so I wouldn't totally write off VLIW altogether even if there is an incoming dark age since they might potentially reappear in the future again ...

Thanks, I stand corrected. Learn something every day!

zir_blazer · Jun 2, 2021

Carfax83 said:
Is there anything you think Intel and or AMD should do to further lessen the legacy baggage? There was talk several years ago that Intel was working on a brand new x86 architecture not based on the P6, that would sacrifice backward compatibility for more die space, efficiency and performance per watt. If this is true, I'd bet it would be Meteor Lake.

"brand new x86 architecture not based on the P6" is still x86. I do recall some mentions of a pure x86-64 Processor without the former x86 16/32 Bits Modes, but it is not a good idea if you know what an 80376 is. Actually, thanks to virtualization I think that dependency on the old modes has increased, not diminished, since it made easier to mix modern and legacy OSes in the same system. Don't forget that when first generation Ryzen launched, it didn't took long before reports that there was something broken because you couldn't run Windows XP in a VM because the VME bug.

x86 was never an elegant ISA to begin with, and it wouldn't be worth saving or keep using at all if it wasn't due to the massive Software ecosystem that has been build on its back during these 4 decades. But if you are planning to drop backwards compatibility, why bother with something remiscient of x86 then? By the point that you are not backwards compatible anymore, you may as well start from scratch with something based on modern paradigms. And because these days most developers focus on high level languages, plus there is a lot of open source stuff, porting stuff en masse can be viable. Just look at the POWER based Talos II. Is mostly those that are bound to Windows the ones in shackles and forced to keep x86 going forward.

dr1337 · Jun 2, 2021

ThatBuzzkiller said:
Maxwell/Pascal are actually VLIW designs since they have explicit instruction encoding for dual-issue instructions. Just like Intel Itanium they have "control codes" too but Itanium is potentially capable of doing triple-issue. Explicit instruction encodings for multi-issue is a concept that underpins all VLIW designs. Tegra X1 is also seen in the Nintendo Switch as well which is currently a consumer success so I wouldn't totally write off VLIW altogether even if there is an incoming dark age since they might potentially reappear in the future again ...

With that being said, I think we are largely in agreement regarding AMD and Intel. Taking the entire reign of x86 is way too attractive of a thought for either of them. A single corporation deciding the future of an existing industry standard is an ultimately dream come true exclusive monopoly!

I think you should do some more reading in your link there and in general, because having the ability to dual issue instructions isn't really related to Very Long Instruction Word architecture. The maxwell whitepaper you have cited literally only mentions being able to issue a memory instruction at the same time as a compute one. The part that would actually give your argument some weight is right after it mentions dual issue, it says single issue instructions can still saturate the shaders.

Modern GPUs can be seen as very similar to VLIW but aren't considered as such due to how far they've distanced design. True VLIW designs rely on very little hardware scheduling and very heavily rely on the compiler. While yes nvidia has been very software centric in their GPUs execution, even just a single maxwell SM has as much scheduling capability as terrascale. And thats not considering there are multiple of them on a die with yet another scheduling engine above controlling them.

IMO while an argument can be made that maxwell can be similar to VLIW at its lowest level, its just a fact that there is still way more hardware overhead in any CUDA GPU than there is in any other true VLIW design. The added hardware schedulers is why CUDA is able to be used as a GPGPU arch and has such general GP compute performance in every implementation. And this distinction is why nobody except random ppl on forums tries to argue that maxwell and kepler have anything remotely in common with terrascale and ia64.

Doug S · Jun 2, 2021

Too much is made of the M1 when people try to use it to argue that ARM is superior to x86. Was x86 superior to RISC because it beat them all in the marketplace? No, it was simply economies of scale and process superiority that did that. And "x86" didn't beat RISCs, Intel designs did. AMD barely avoided bankruptcy during those years when Intel was snuffing out the RISCs in the workstation and almost all the server market despite using the same x86 ISA.

If Apple is able to scale M2 and so on to clearly beat x86 that doesn't prove ARM is superior, only that Apple's designs are. If there were multiple ARM implementations that were all beating x86 then you might be able to draw conclusions as to which ISA is better. If only one is better, why would that prove ARM is better? If Apple was better for a few years then Intel or AMD caught up and surpassed them, would that mean that x86 has become better than ARM? Or simply that Intel/AMD's designs had become superior to Apple's?

There's a bit of additional complexity required for x86 decoders to crack the instructions and turn them into nice fixed length instructions internally that are easier to execute. It means an extra cycle or two in pipeline length, so mispredicted branches are penalized a bit. Other than that, it doesn't make much difference in today's world of chips with tens of billions of transistors.

Gideon · Jun 2, 2021

Doug S said:
There's a bit of additional complexity required for x86 decoders to crack the instructions and turn them into nice fixed length instructions internally that are easier to execute. It means an extra cycle or two in pipeline length, so mispredicted branches are penalized a bit. Other than that, it doesn't make much difference in today's world of chips with tens of billions of transistors.

That same complexity also means you can't really do a 8-wide decoder meaning a very real-world bottleneck to the frontend

IntelUser2000 · Jun 2, 2021

Doug S said:
Too much is made of the M1 when people try to use it to argue that ARM is superior to x86.

This.

Is the M1 GPU beating competitors because it has a superior ISA or Apple's team is just executing better?

Is x86 the reason for Intel stumbling the past 5 years?

You can have the same ISA, same start and still end up with one CPU superior to the other. Not saying x86 is better, it's worse. But execution can more than make up for that.

Intel could have had bigger share of the phone/tablet market but they threw it away since Otellini allowed his finance background to take over innovation. Is x86 the reason Otellini threw away the chance of being the manufacturer of chips for iPhone/iPad?

Look at the innovation AMD is fielding with the virtual uop cache and stuff. I bet you such innovation is happening because they are trying to get around the limitations of x86. But you need ecosystem backing. x86 had the largest backing two decades ago. Now that's shifting to ARM.

ThatBuzzkiller · Jun 2, 2021

dr1337 said:
I think you should do some more reading in your link there and in general, because having the ability to dual issue instructions isn't really related to Very Long Instruction Word architecture. The maxwell whitepaper you have cited literally only mentions being able to issue a memory instruction at the same time as a compute one. The part that would actually give your argument some weight is right after it mentions dual issue, it says single issue instructions can still saturate the shaders.

It's literally from the CUDA documentation itself from Nvidia and you can tell that Maxwell practically falls into classic VLIW dichotomy because it does multi-issue via instruction "bundles" containing 4 instructions (1 control, 3 regular) ...

dr1337 said:
Modern GPUs can be seen as very similar to VLIW but aren't considered as such due to how far they've distanced design. True VLIW designs rely on very little hardware scheduling and very heavily rely on the compiler. While yes nvidia has been very software centric in their GPUs execution, even just a single maxwell SM has as much scheduling capability as terrascale. And thats not considering there are multiple of them on a die with yet another scheduling engine above controlling them.

"True VLIW" is nothing more than explicit instruction encoding (bundles) for multi-issue design at it's most basic concept so why are you changing this definition when it's widely accepted terminology in academia ?

dr1337 said:
IMO while an argument can be made that maxwell can be similar to VLIW at its lowest level, its just a fact that there is still way more hardware overhead in any CUDA GPU than there is in any other true VLIW design. The added hardware schedulers is why CUDA is able to be used as a GPGPU arch and has such general GP compute performance in every implementation. And this distinction is why nobody except random ppl on forums tries to argue that maxwell and kepler have anything remotely in common with terrascale and ia64.

The "lowest level" is the only level that mattered in this discussion and nobody else but you here is bringing up the narrative that there is commonality between Kepler/Maxwell and Terascale/IA64 ...

As far as I'm concerned there's only two ways to design a multi-issue system. There's implicit multi-issue which is featured superscalar architectures of which none of the mainstream GPUs fall into this category and we have "explicit" multi-issue which are featured in VLIW architectures ...

I think you need to seriously review some of the fundamentals again given the clear gaps you've demonstrated that's supposed to be universal knowledge and you disputing it just makes things worse ...

dr1337 · Jun 3, 2021

ThatBuzzkiller said:
I think you need to seriously review some of the fundamentals again given the clear gaps you've demonstrated that's supposed to be universal knowledge and you disputing it just makes things worse ...

I can dispell your entire argument with the fact that nvidia has never and will never refer to any of their architectures since adopting cuda as VLIW. Every architecture that is formally referred to as VLIW in the industry explicitly relies on software programming to schedule those instructions.

ThatBuzzkiller said:
The "lowest level" is the only level that mattered in this discussion and nobody else but you here is bringing up the narrative that there is commonality between Kepler/Maxwell and Terascale/IA64 ...

No not really, IA64 would have been a different story if intel had more overhead in the architecture and weren't leaning so hard on programming. You do realize that schedulers and front end have almost entirely been the source of performance improvements in semi-conductors for the last 10 years or so? Even if you can argue and say that maxwell is vliw at the lowest level, it simply doesn't operate that way functionally. Cuda literally exists so developers didn't have to spend all of their time at the instruction level optimizing their code to get good GPC performance. Cuda is made possible by all of the extra scheduling and overhead. Traditional VLIW archs like IA64 and terrascale never had all of that overhead and because of this IA64 failed and the industry has been moving away from VLIW especially in any instance where it could face the end consumer/developer. At the lowest level, yes there are multiple instructions being issued and executed in a manner like VLIW, outside of the lowest level, maxwell and kepler have nothing in common with IA64 and others.

Also your first few posts about nvidia using 'vliw' were entirely about the architecture being used in SOCs like the tegra. If you think the lowest level is the only thing that matters for an SOC then im afraid you're thinking too narrow mindedly.

moinmoin · Jun 3, 2021

Carfax83 said:
The article also cites the oft quoted Agner Fog, whom I've seen many references to on this forum. Apparently Mr. Fog is of the mind that ISA definitely matters, and that x86 is definitely encumbered; but also that it makes up for the legacy baggage by doing more work per instruction.

Agner Fog thinks that the fixed instruction length of ARM is unnecessarily restrictive whereas the variable instruction length of x86 makes it impossible to optimize hardware looking ahead beyond already decoded instructions as their length and format is not known before decoded.

His solution to both issues is creating his ideal ISA he called ForwardCom which is designed to be easily extensible through given instruction format templates and has every instruction start with a header that contains the overall instruction length. Work on a first hardware implementation in an FPGA soft core is underway. See https://forwardcom.info/

ThatBuzzkiller · Jun 3, 2021

dr1337 said:
I can dispell your entire argument with the fact that nvidia has never and will never refer to any of their architectures since adopting cuda as VLIW. Every architecture that is formally referred to as VLIW in the industry explicitly relies on software programming to schedule those instructions.

Nvidia's marketing terminology is irrelevant and no matter how much you use this argument, it isn't going to change years of established knowledge in academia. Good luck on trying to spin the tons of lecture materials in your favour ...

dr1337 said:
No not really, IA64 would have been a different story if intel had more overhead in the architecture and weren't leaning so hard on programming. You do realize that schedulers and front end have almost entirely been the source of performance improvements in semi-conductors for the last 10 years or so? Even if you can argue and say that maxwell is vliw at the lowest level, it simply doesn't operate that way functionally. Cuda literally exists so developers didn't have to spend all of their time at the instruction level optimizing their code to get good GPC performance. Cuda is made possible by all of the extra scheduling and overhead. Traditional VLIW archs like IA64 and terrascale never had all of that overhead and because of this IA64 failed and the industry has been moving away from VLIW especially in any instance where it could face the end consumer/developer. At the lowest level, yes there are multiple instructions being issued and executed in a manner like VLIW, outside of the lowest level, maxwell and kepler have nothing in common with IA64 and others.

I don't think you've got a grasp on what you've just posted ...

IA64 had to run full blown C++ along with other more powerful and fully featured programming languages. Meanwhile, Nvidia controlled the design of the CUDA source language so how do you figure that it's impossible for Nvidia to make a good compiler even on VLIW architectures under this constraint ?

Nvidia GPUs for all of their hardware scheduler prowess can never run real operating systems like Windows or Linux yet somehow IA64 with it's "traditional" VLIW design can, now why is that ? Could it be that the hardware schedulers you keep hyping up simply means nothing in the grand scheme of things ?!

dr1337 said:
Also your first few posts about nvidia using 'vliw' were entirely about the architecture being used in SOCs like the tegra. If you think the lowest level is the only thing that matters for an SOC then im afraid you're thinking too narrow mindedly.

I think you need to be better educated instead of making reactionary posts ...

Question RISC vs CISC in modern CPUs [Extremetech]

Diamond Member

Golden Member

Golden Member

Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Platinum Member

Golden Member

Platinum Member

Golden Member

Senior member

Diamond Member

Platinum Member

Elite Member

Golden Member

Senior member

Diamond Member

Golden Member