The RISC Advantage

Arachnotronic · Dec 17, 2014

You can't really benchmark an ISA independent of an implementation, so this discussion is pretty moot, IMO.

III-V · Dec 17, 2014

jhu said:
Because the scenario you outlined earlier is just as absurd. Unless you have gobs of money at your disposal, what you're proposing won't be possible and then you're left with discussing theory. And in theory there is no difference between theory and practice. In practice there is.

I've never met a group of people more opposed to to giving credit where credit is due, and admitting that I'm right.

jhu · Dec 17, 2014

III-V said:
I've never met a group of people more opposed to to giving credit where credit is due, and admitting that I'm right.

I think your perseptions are a bit off. For example, I see no instance where dmens was being rude. As for being right, please see Arachnotric's post right above yours.

witeken · Dec 17, 2014

Back in the real world, any advantage in power or transistor count is more than compensated by the difference in node and other architectural decisions. For example, A15 (at least the earlier versions of the core) was much less efficient than other ARM cores. I'd bet if ISA is neglected, Silvermont is still more efficient than Krait at 28nm.

Exophase · Dec 17, 2014

III-V said:
I've asserted that ARM, particularly its 32-bit variety, is a better ISA than x86, and that RISC-V is better than either, if we were to ignore compatibility as a factor. I suppose I should limit the scope of this claim to pertain to modern, commonly used consumer software only.

It seems ridiculous to assert that ISAs can't be better than others -- this would mean no other ISAs have improved over another, and that all ISAs after the very first ISA were redundant.

I don't think you're ever going to get a scenario where you can really say one ISA is objectively better than another in all possible cases.

You've said several times that RISC-V is objectively better than ARMv8. But RISC-V is missing a lot of operations that ARMv8 has, most of which are simple and have a low cost of implementation. I can guarantee you that there are algorithms that will benefit tremendously from having access to ARMv8's instructions that are missing on RISC-V. There are also algorithms that benefit from instruction idioms that are only on x86.

While some design decisions will be worse than others, pretty much everything has some kind of tradeoff. RISC-V is designed to be extensible so that it can be used for the next 50 years without breaking compatibility. And it's designed to be a good fit for the smallest microcontrollers all the way up to big workstations. Those two constraints have a tangible cost that may not apply to ARM, who has different ISAs for microcontrollers vs big 64-bit machines, and who probably expects replacing the ISA over the next 50 years to be a surmountable problem (as evidenced by the fact that they did it recently)

III-V · Dec 17, 2014

jhu said:
I think your perseptions are a bit off. For example, I see no instance where dmens was being rude. As for being right, please see Arachnotric's post right above yours.

What exactly does his post add? Speaking of being rude, you certainly were when you related the scenario I posited to searching for a mythological animal.

I've been continually told no, but no one has bothered delving into the subject. Everyone's simply tiptoed around it.

jhu · Dec 17, 2014

III-V said:
What exactly does his post add? Speaking of being rude, you certainly were when you related the scenario I posited to searching for a mythological animal.

What you proposed is the same mythological scenario.

III-V said:
I've been continually told no, but no one has bothered delving into the subject. Everyone's simply tiptoed around it.

Please read everyone else's posts again for the reasons why, especially Exophase's.

III-V · Dec 17, 2014

Exophase said:
I don't think you're ever going to get a scenario where you can really say one ISA is objectively better than another in all possible cases.

That's not what I was really arguing, though, and I agree that it's not likely.

You've said several times that RISC-V is objectively better than ARMv8. But RISC-V is missing a lot of operations that ARMv8 has, most of which are simple and have a low cost of implementation. I can guarantee you that there are algorithms that will benefit tremendously from having access to ARMv8's instructions that are missing on RISC-V. There are also algorithms that benefit from instruction idioms that are only on x86.

I updated my statement earlier, in regards to RISC-V, with the caveat that it superior for typical consumer workloads. I, of course, don't actually know this for certain.

I have no qualms with the idea that support for additional instructions can provide substantial performance improvements for the workloads they're targeted for -- that's the whole point behind them, of course.

While some design decisions will be worse than others, pretty much everything has some kind of tradeoff. RISC-V is designed to be extensible so that it can be used for the next 50 years without breaking compatibility. And it's designed to be a good fit for the smallest microcontrollers all the way up to big workstations. Those two constraints have a tangible cost that may not apply to ARM, who has different ISAs for microcontrollers vs big 64-bit machines, and who probably expects replacing the ISA over the next 50 years to be a surmountable problem (as evidenced by the fact that they did it recently)

So would it not be possible evaluate those tradeoffs, through benchmarking? I've given a pretty rough outline of how I'd think one would go about doing so, which I'll restate:

E.g., take however many iterations of MIPS there have been (or less as desired) and compare after doing the same for ARM (A-series, R-series, or M-series -- which ever would be most appropriate). These would often share the same process, from the same foundry, and have had similar design goals and target markets. Additionally, for the process, you would need to compare either between one of each product on each shared node, and do this for several node/product combinations and compare the calculated means, or normalize results with a simulated impact of the process used (process parameters are typically pretty easy to come by).

Thank you for being respectful, and actually taking me seriously.

jhu said:
What you proposed is the same mythological scenario.

If you're not going to listen, please refrain from commenting further.

jhu · Dec 17, 2014

III-V said:
If you're not going to listen, please refrain from commenting further.

Here's what you proposed first:

III-V said:
Say you have 100 programs, compiled for a processor utilizing ARMv8, and compare to a processor utilizing x86-64. Assume the target workloads for these processors are identical. Also assume their development budgets were identical. Assume they use the same manufacturing process. The compilers used were created with an equal development investment. Assume all personnel involved in the creation of the software and hardware mentioned are equally competent.

Measure the performance of those 100 programs, and compare between the two processors.

Not possible. Which ARMv8 processor should we use? Tegra K1 Denver? Snapdragon 810? Snapdragon 410? Apple A7? Which x64? Piledriver? Broadwell? Silvermont? And you want to assume the compilers are equivalent? They're not all that equivalent for x64 (icc vs. gcc) let alone between different ISAs.

III-V said:
take however many iterations of MIPS there have been (or less as desired) and compare after doing the same for ARM (A-series, R-series, or M-series -- which ever would be most appropriate). These would often share the same process, from the same foundry, and have had similar design goals and target markets. Additionally, for the process, you would need to compare either between one of each product on each shared node, and do this for several node/product combinations and compare the calculated means, or normalize results with a simulated impact of the process used (process parameters are typically pretty easy to come by).

This is feasible. Please report back with your results.

III-V · Dec 17, 2014

jhu said:
Here's what you proposed first:

Not possible. Which ARMv8 processor should we use? Tegra K1 Denver? Snapdragon 810? Snapdragon 410? Apple A7? Which x64? Piledriver? Broadwell? Silvermont? And you want to assume the compilers are equivalent? They're not all that equivalent for x64 (icc vs. gcc) let alone between different ISAs.

As with most/all surveys, the larger the sample size, the clearer and more reliable your findings. For ARM, it'd be most useful to stick to vanilla implementations, seeing as there are about two decades worth of iterations.

Using a compiler would influence results in a manner that cannot be accounted for. You'd have to hand code assembly. You'd need to write your own programs as well.

There is no doubt this all is well outside the realm of what any sane human being would ever want or even afford to do. That was not my point, though.

This is feasible. Please report back with your results.

I'm not going to do this, nor have I given any indication that I would. All of this has been conjecture.

witeken said:
Back in the real world, any advantage in power or transistor count is more than compensated by the difference in node and other architectural decisions. For example, A15 (at least the earlier versions of the core) was much less efficient than other ARM cores. I'd bet if ISA is neglected, Silvermont is still more efficient than Krait at 28nm.

I don't believe I or anyone else has argued otherwise.

FX2000 · Dec 17, 2014

witeken said:
Back in the real world, any advantage in power or transistor count is more than compensated by the difference in node and other architectural decisions. For example, A15 (at least the earlier versions of the core) was much less efficient than other ARM cores. I'd bet if ISA is neglected, Silvermont is still more efficient than Krait at 28nm.

I actually feel the inefficient, first gen A15 daily. My Nexus 10, from the ending of 2012, has a Samsung SoC. Two A15 cores at 32nm with a T604 GPU. Even through it has a much bigger than my LG G2. The SoC is drawing atleast twice the juice in the Nexus 10.

III-V · Dec 17, 2014

I really wonder what went so wrong with the A15. I'd imagine Tegra K1's revised iteration managed to solve the problem(s), but I guess it's a story that won't ever be revealed.

The A57 certainly seems to make up for its predecessors failure, that's for sure.

soresu · Dec 19, 2014

I thought that A15 was designed to be a server chip, or at least proof it could work for ARM ISA. Wasnt bigLITTLE introduced after Samsung made the Exynos 5 Dual and it became obvious that power consumption was too high for mobile without a more efficient core to back it up?

FX2000 · Dec 19, 2014

soresu said:
I thought that A15 was designed to be a server chip, or at least proof it could work for ARM ISA. Wasnt bigLITTLE introduced after Samsung made the Exynos 5 Dual and it became obvious that power consumption was too high for mobile without a more efficient core to back it up?

I think you're very right here about power consumption.
The A15 supposedly had 2x the processing power of 1 A9 core (2 A15 cores = 4 A9 cores.)
But after the exynos 5 dual (Nexus 10 SoC), Samsung kinda found out that a dual core A15 SoC, with a T604, and a 10" screen with a resolution of 2650x1600 would not be power efficient. However. When you look at the Snapdragon 800, it has 4 cores, that's stronger than 4 A15's, while being MUCH more efficient. My LG G2 has *2x the battery life of my Nexus 10*.

* 1080p screen, 4G LTE.

Seems to me, that the A15 was that core, that was rushed. It was not ready in 2012.

witeken · Dec 19, 2014

soresu said:
I thought that A15 was designed to be a server chip, or at least proof it could work for ARM ISA. Wasnt bigLITTLE introduced after Samsung made the Exynos 5 Dual and it became obvious that power consumption was too high for mobile without a more efficient core to back it up?

Big.Little was first used in the Tegra 3. It was an idea from ARM (so they could design many cores with a weak dynamic range with their tiny budget).

NTMBK · Dec 19, 2014

witeken said:
Big.Little was first used in the Tegra 3. It was an idea from ARM (so they could design many cores with a weak dynamic range with their tiny budget).

To be fair it wasn't big.LITTLE in Tegra 3- more like medium.MEDIUM.

Enigmoid · Dec 19, 2014

FX2000 said:
I think you're very right here about power consumption.
The A15 supposedly had 2x the processing power of 1 A9 core (2 A15 cores = 4 A9 cores.)
But after the exynos 5 dual (Nexus 10 SoC), Samsung kinda found out that a dual core A15 SoC, with a T604, and a 10" screen with a resolution of 2650x1600 would not be power efficient. However. When you look at the Snapdragon 800, it has 4 cores, that's stronger than 4 A15's, while being MUCH more efficient. My LG G2 has *2x the battery life of my Nexus 10*.

* 1080p screen, 4G LTE.

Seems to me, that the A15 was that core, that was rushed. It was not ready in 2012.

A15 is stronger clock for clock than krait (S800).

Nothingness · Dec 19, 2014

witeken said:
Big.Little was first used in the Tegra 3. It was an idea from ARM (so they could design many cores with a weak dynamic range with their tiny budget).

Tegra 3 didn't use big.LITTLE. NVIDIA solution is using the external PL310 (L2) cache-controller for both the 4-core cluster and the low power single core. ARM big.LITTLE works by having two clusters having their own L2 cache. This also means ARM needs a specific more or less complex interconnect to handle coherent requests across clusters, something NVIDIA didn't need for T3.

Nothingness · Dec 19, 2014

FX2000 said:
Seems to me, that the A15 was that core, that was rushed. It was not ready in 2012.

Indeed, and later revisions of Cortex-A15 have improved power efficiency.

FX2000 · Dec 19, 2014

Enigmoid said:
A15 is stronger clock for clock than krait (S800).

Might be. But in a device with a 5watt battery, a quad core SoC that is stronger than a dual core A15 SoC, and uses less power, is always preffered.
I could imagine my Nexus 10 would last 2 hours at most with a quad core A15 @ 2.2Ghz.

witeken · Dec 29, 2014

http://seekingalpha.com/article/1296521-intel-48-per-share-in-4-years#comment-16749321

Here's a comment (and the 2 next ones from him as well) about the topic that might be worth checking out.

TL;DR:

I'm afraid this is not correct. The instruction set CISC/RISC complexity argument comes up every single time but it's been irrelevant for over two decades.

The big winner, strangely karmic, is that the x86 instruction set is so freaking compact that it winds up being a benefit for modern-day cpu designs rather than a detractor, due to stalls being primarily a problem of cache misses in modern designs.

Exophase · Dec 29, 2014

Only x86 isn't really "so freaking compact." Average instruction size was around 3.5 bytes way back before x86-64 and before SSE2. Both have increased average instruction size, although they also increased average work/instruction. But starting with 3.5 bytes isn't really very impressive for a highly variable width instruction set.

These days the code density is probably somewhat better than ARMv8 and somewhat worse than Thumb-2.

Arachnotronic · Dec 29, 2014

witeken said:
Back in the real world, any advantage in power or transistor count is more than compensated by the difference in node and other architectural decisions. For example, A15 (at least the earlier versions of the core) was much less efficient than other ARM cores. I'd bet if ISA is neglected, Silvermont is still more efficient than Krait at 28nm.

We'll find out with SoFIA, won't we?

I'm actually very much looking forward to the comparison between Silvermont on 28nm v.s. comparable ARM/ARM ISA cores when SoFIA hits the market.

Nothingness · Dec 30, 2014

Exophase said:
Only x86 isn't really "so freaking compact." Average instruction size was around 3.5 bytes way back before x86-64 and before SSE2. Both have increased average instruction size, although they also increased average work/instruction. But starting with 3.5 bytes isn't really very impressive for a highly variable width instruction set.

These days the code density is probably somewhat better than ARMv8 and somewhat worse than Thumb-2.

You can't measure density by just measuring average instruction size, you also have to take into account the number of instructions required to do the job. Last time I measured that on S2K gcc, I got about a 10% advantage for x86-64 over ARMv8 AArch64 (IIRC the average instruction size for x86-64 was 3.5 as your figure for pre-x86-64). That was very early in ARMv8 life, so compilers might have improved.

Nothingness · Dec 30, 2014

witeken said:
http://seekingalpha.com/article/1296521-intel-48-per-share-in-4-years#comment-16749321

Here's a comment (and the 2 next ones from him as well) about the topic that might be worth checking out.

TL;DR:

Here is another quote:

So much of an advantage that ARM was forced to admit to the problem and have introduced a short-form instruction in their 64-bit architecture to try to address it.

Utterly wrong. That guy is an obvious Intel fanboy with little knowledge of the competition.

The RISC Advantage

Lifer

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Lifer

Senior member

Lifer

Senior member

Member

Senior member

Diamond Member

Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member