Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

exquisitechar · Dec 3, 2019

https://www.servethehome.com/aws-graviton2-64-core-arm-cpu-heightens-war-of-intel-betrayal/

Pretty big deal for ARM in servers. Interested in seeing a comparison between this and Rome.

Richie Rich · Apr 3, 2020

coercitiv said:
Because he compared 12c/12t versus 6c/12t and showed the obvious: workloads don't scale with more cores as expected.

Assume SMT scaling is 25% and core scaling is 100%
1c/2t - resulting performance is 125%
2c/2t - resulting performance is 200%
--> relative performance between 2c/2t and 1c/2t: 1.6X

However in the real world relative performance between 12c/12t versus 6c/12t is only 1.24X , meaning either:
Option A - SMT gains are actually 60%

Aha, you take real measure SMT benefit of 1.25X then instantly raping it to 1.6X and saying SMT benefit is 1.6X. That's insane and wrong conclusion.

The only thing what you are right is that 1.6X is "relative performance between 2c/2t and 1c/2t". Yes, different number of physical cores and different SMT activation status.

If you want to calculate SMT benefit then you have to normalize to same number of physical cores:

relative performance 1.6X needs to be divided it by 2 cores... 1.6 / 2 = 0.8X
this 0.8X is performance hit when SMT is OFF (now that's correct comparison of 1c/1t vs. 1c/2t)
do inverse function 1/0.8= 1.25X ....... voila.... you get same number like on beginning (different percentage base only)

You and Armnuke redated cannot do even a basic school math. That's pretty sad. I'm wonder how you can do some basic university stuff like diffencial equations when you are in trouble with this basics. However I hope now it's clear.

I think you can find a better word to use in tech.

esquared
Anandtech Forum Director

DrMrLordX · Apr 3, 2020

Richie Rich said:
Aha, you take real measure SMT benefit of 1.25X then instantly raping it to 1.6X and saying SMT benefit is 1.6X. That's insane and wrong conclusion.

You sure like that word. And he has a point. If you want to really know what SMT does in a vacuum, you run it on one core and test software with two threads. Then you turn off SMT and run it with one thread. Compare results. It's exactly what Anandtech did in their EPYC 7601 review that I linked elsewhere.

coercitiv · Apr 3, 2020

Oh look, Tachyum joined the chat and drops the expectation bomb again:

Every core is faster than a Xeon core or an Epyc core, and it is smaller than an Arm core, and overall, our chip is faster than a GPU on HPC and AI.

Wow, i guess it's too late for ARM now, they never really stood a chance. /s

However, this is not the most interesting part, but rather the discussion about software compatibility. Remember when people argued x86 has the upper hand based on the existing ecosystem and backwards compatibility? Well... get ready for a severe case of deja vu.

https://twitter.com/x/status/1245967869428596736

Those who reinvented the wheel warn against reinventing the wheel. "No point in bringing a processor to market that isn't compatible with all the software".

Nothingness · Apr 3, 2020

coercitiv said:
Oh look, Tachyum joined the chat and drops the expectation bomb again:

Wow, i guess it's too late for ARM now, they never really stood a chance. /s

That Itanium look and feel is funny, talk about deja vu

It just misses the x86 compatibility HW/SW Itanium had.

However, this is not the most interesting part, but rather the discussion about software compatibility. Remember when people argued x86 has the upper hand based on the existing ecosystem and backwards compatibility? Well... get ready for a severe case of deja vu.

https://twitter.com/x/status/1245967869428596736

Those who reinvented the wheel warn against reinventing the wheel. "No point in bringing a processor to market that isn't compatible with all the software".

He's stating the obvious: it's unlikely Tachyum will get anywhere before a few years. OTOH work on making x86 work on an alien architecture such as ARM will help porting to new architectures. But please not a glorified DSP/vector engine again

amrnuke · Apr 3, 2020

Richie Rich said:
Aha, you take real measure SMT benefit of 1.25X then instantly raping it to 1.6X and saying SMT benefit is 1.6X. That's insane and wrong conclusion.

The only thing what you are right is that 1.6X is "relative performance between 2c/2t and 1c/2t". Yes, different number of physical cores and different SMT activation status.

If you want to calculate SMT benefit then you have to normalize to same number of physical cores:

relative performance 1.6X needs to be divided it by 2 cores... 1.6 / 2 = 0.8X

this 0.8X is performance hit when SMT is OFF (now that's correct comparison of 1c/1t vs. 1c/2t)

do inverse function 1/0.8= 1.25X ....... voila.... you get same number like on beginning (different percentage base only)

You and Armnuke are raping numbers all the time and cannot do even a basic school math. That's pretty sad. I'm wonder how you can do some basic university stuff like diffencial equations when you are in trouble with this basics. However I hope now it's clear.

You're not understanding this.

Let's start from the beginning.

1) 6c/12t vs 12c/12t is a comparison designed to see whether a virtual core (SMT) = real core at the 12 thread level. It is not designed to estimate the performance boost of SMT on a given chip. In your very nice and accurate calculation, you have said nothing about the benefit of turning SMT on or off on a given chip. You've told us something about the benefit of using an extra "real" core over a "virtual" SMT thread, though. But again, I think your desire to be right is getting in the way of you actually thinking about what you're looking at. That test has nothing to do with the benefit of taking a set chip and turning SMT on or off. It was done for fun, that is all.

To find the benefit of SMT on vs SMT off, we keep the baseline thread count and target thread count the same, and achieve the target thread count via two different methods - doubling cores or turning on SMT.

2) 6c/12t vs 12c/24t would be designed to see the raw benefit of doubling threads via doubling cores
- overall, on all the tests, doubling the thread count from 12 to 24 by adding extra cores added ~28% to the performance
3) 12c/12t vs 12c/24t would be designed to see the raw benefit of doubling threads via SMT
- overall, on all the tests, doubling the thread count from 12 to 24 by turning on SMT added ~12% to the performance

We can then take SMT improvement / double core improvement and get a relative SMT benefit when turned on, which we did, which is ~43% overall.

(In such a test the 12c/12t may see a larger benefit because it would have double the cache per thread of the other chips, hence we may actually see an artificially SMALLER benefit of SMT than expected. The perfect test I think would require us to be able to limit the SMT off chip to 1/2 the available cache, so that all the threads in both SMT on and off have the same cache amount to work with.)

As for trying to insult coercitiv and me via disparaging our math skills (which look wrong to you because you aren't comprehending what is being talked about), it would be wise for you to not throw stones in a glass house...

Richie Rich · Apr 5, 2020

amrnuke said:
3) 12c/12t vs 12c/24t would be designed to see the raw benefit of doubling threads via SMT
- overall, on all the tests, doubling the thread count from 12 to 24 by turning on SMT added ~12% to the performance

We can then take SMT improvement / double core improvement and get a relative SMT benefit when turned on, which we did, which is ~43% overall.

To sum up:

real measurements SMT on/off ......... 12% according to TechPowerUp data
armnuke's crazy conclusion: SMT benefit is 43%

I wish I can convince my apps to run faster on my 3700X because Armnuke calculated so. However real code runs still the same no matter which crazy number you come up every day (it used to be 66% one day, 80% another, today 43%....).

If you want to extract pure SMT benefit clear of Ahmdal's scaling penalty then do not do crazy calculations but real measurement with multiple ST instances. period.

soresu · Apr 5, 2020

Richie Rich said:
armnuke's crazy conclusion: SMT benefit is 43%

You really should try reading things properly.

He said relative benefit - as in perf of SMT relative to the performance from doubling the cores is 43%, NOT perf relative to the non SMT score at the same core count (which is 12% as he said and you even quoted him saying oddly).

amrnuke · Apr 5, 2020

Richie Rich said:
To sum up:

real measurements SMT on/off ......... 12% according to TechPowerUp data

armnuke's crazy conclusion: SMT benefit is 43%

I wish I can convince my apps to run faster on my 3700X because Armnuke calculated so. However real code runs still the same no matter which crazy number you come up every day (it used to be 66% one day, 80% another, today 43%....).

If you want to extract pure SMT benefit clear of Ahmdal's scaling penalty then do not do crazy calculations but real measurement with multiple ST instances. period.

We are talking about two different things.

You're talking about raw performance. I'm talking about normalized performance.

Are you against normalizing performance? (Rhetorical question, but I want to hear your response.)

Richie Rich · Apr 5, 2020

soresu said:
He said relative benefit - as in perf of SMT relative to the performance from doubling the cores is 43%, NOT perf relative to the non SMT score at the same core count (which is 12% as he said and you even quoted him saying oddly).

And what is his crazy number good for? Does it help me to decide if whether I should turn SMT ON or OFF in BIOS? No. Maybe helpful to decide whether to buy 3600X or 3900X for desktop application.... however it's obvious what's better to buy. But since CPU is bought and running we are back in 12% SMT benefit with almost half of IPC per thread (112 / 2 = 56%).

Graviton2 has a higher IPC per thread and that's what really matters in some loads. G2 server applications (web and SQL services) scale perfect. There is no such a problem as for typical x86 desktop SW heavily dealing with Ahmdal's law (single user, multiple cores). G2 application is multiple users per one core. These Armnuke's conclusions based on single user desktop load is ridiculous and cannot be used for server G2 predictions.

amrnuke · Apr 5, 2020

Richie Rich said:
And what is his crazy number good for? Does it help me to decide if whether I should turn SMT ON or OFF in BIOS? No. Maybe helpful to decide whether to buy 3600X or 3900X for desktop application.... however it's obvious what's better to buy. But since CPU is bought and running we are back in 12% SMT benefit with almost half of IPC per thread (112 / 2 = 56%).

Doubling the cores takes the IPC per thread down to 128 / 2 = 64%.
Which is why SMT benefit was normalized.
When you want to build a chip, and you're trying to determine if adding cores would be better, or just designing around SMT, such data are critically important.
Also if you are deciding between a 64C/64T chip and a 32C/64T chip, it is helpful.

You seem to be against normalization, yet you are happy to normalize data when it fits your argument, for example, in your signature.

coercitiv · Apr 5, 2020

Richie Rich said:
Graviton2 has a higher IPC per thread and that's what really matters in some loads. G2 server applications (web and SQL services) scale perfect. There is no such a problem as for typical x86 desktop SW heavily dealing with Ahmdal's law (single user, multiple cores). G2 application is multiple users per one core. These Armnuke's conclusions based on single user desktop load is ridiculous and cannot be used for server G2 predictions.

Shifting goals again: web applications and SQL services because they scale better with core count? Well, surprise surprise, they also scale better with SMT.

SMT4 brings SQL gains up to 80%, web apps up to 40%. And that's according to people who actually build ARM server chips.

Richie Rich · Apr 5, 2020

Look, I'm big fan of SMT4 at super wide core. I know that some very low ILP code can benefit from SMT4 a lot. But....

But at what cost? There is always some trade off.

If SMT2 cost 10% more transistors and brings avg 28% (50% in SQL)..... that's a good deal.
If SMT4 cost 20% more transistors and brings avg 35% (80% in SQL).....that's lower performance gain per transistor but still a good deal.
But if you can fit two A77 cores in the same area of one Zen2 core that's a game changer. Especially when that small A77 is wider and has higher IPC by 8%. That's a great deal for Amazon and other ARM server contenders.

To sum up: SMT helps x86 to be less garbage. Apple is able to extract 83% more IPC without any SMT. ARM with his Cortex cores tries to follow number one in CPU business which is currently Apple. Nuvia CPU can have SMT2 or SMT4 however it will make number one even better in some server workloads.

Markfw · Apr 5, 2020

Richie Rich said:
Look, I'm big fan of SMT4 at super wide core. I know that some very low ILP code can benefit from SMT4 a lot. But....

But at what cost? There is always some trade off.

If SMT2 cost 10% more transistors and brings avg 28% (50% in SQL)..... that's a good deal.

If SMT4 cost 20% more transistors and brings avg 35% (80% in SQL).....that's lower performance gain per transistor but still a good deal.

But if you can fit two A77 cores in the same area of one Zen2 core that's a game changer. Especially when that small A77 is wider and has higher IPC by 8%. That's a great deal for Amazon and other ARM server contenders.

To sum up: SMT helps x86 to be less garbage. Apple is able to extract 83% more IPC without any SMT. ARM with his Cortex cores tries to follow number one in CPU business which is currently Apple. Nuvia CPU can have SMT2 or SMT4 however it will make number one even better in some server workloads.

Apple does not get 83% more IPC. Also, you can't just calculate it "per ghz" since the Apple is designed for low speed, it appear to have more IPC, but it won;t clock to 4 ghz.

The overall throughput of Ryzen is better than Apple, but each chip was designed for a different use, and optimized for that use.

insertcarehere · Apr 6, 2020

Markfw said:
Apple does not get 83% more IPC. Also, you can't just calculate it "per ghz" since the Apple is designed for low speed, it appear to have more IPC, but it won;t clock to 4 ghz.

The overall throughput of Ryzen is better than Apple, but each chip was designed for a different use, and optimized for that use.

Bringing Apple's cores in just illustrates that there's plenty of IPC headroom for both other x86 & ARM cores without using SMT.

Touting x86-64 as achieving higher throughputs due to higher clocks is also a moot point since the high-core count Rome SKUs mostly clock at 2.5-3.0 ghz... Which is in line with what Qualcomm & Apple are clocking their SoCs, and where Gravitron 2 has been reportedly clocked at as well.

lobz · Apr 6, 2020

amrnuke said:
You're not understanding this.

Let's start from the beginning.

1) 6c/12t vs 12c/12t is a comparison designed to see whether a virtual core (SMT) = real core at the 12 thread level. It is not designed to estimate the performance boost of SMT on a given chip. In your very nice and accurate calculation, you have said nothing about the benefit of turning SMT on or off on a given chip. You've told us something about the benefit of using an extra "real" core over a "virtual" SMT thread, though. But again, I think your desire to be right is getting in the way of you actually thinking about what you're looking at. That test has nothing to do with the benefit of taking a set chip and turning SMT on or off. It was done for fun, that is all.

To find the benefit of SMT on vs SMT off, we keep the baseline thread count and target thread count the same, and achieve the target thread count via two different methods - doubling cores or turning on SMT.

2) 6c/12t vs 12c/24t would be designed to see the raw benefit of doubling threads via doubling cores
- overall, on all the tests, doubling the thread count from 12 to 24 by adding extra cores added ~28% to the performance
3) 12c/12t vs 12c/24t would be designed to see the raw benefit of doubling threads via SMT
- overall, on all the tests, doubling the thread count from 12 to 24 by turning on SMT added ~12% to the performance

We can then take SMT improvement / double core improvement and get a relative SMT benefit when turned on, which we did, which is ~43% overall.

(In such a test the 12c/12t may see a larger benefit because it would have double the cache per thread of the other chips, hence we may actually see an artificially SMALLER benefit of SMT than expected. The perfect test I think would require us to be able to limit the SMT off chip to 1/2 the available cache, so that all the threads in both SMT on and off have the same cache amount to work with.)

As for trying to insult coercitiv and me via disparaging our math skills (which look wrong to you because you aren't comprehending what is being talked about), it would be wise for you to not throw stones in a glass house...

I sincerely admire your patience for

insertcarehere said:
Bringing Apple's cores in just illustrates that there's plenty of IPC headroom for both other x86 & ARM cores without using SMT.

Touting x86-64 as achieving higher throughputs due to higher clocks is also a moot point since the high-core count Rome SKUs mostly clock at 2.5-3.0 ghz... Which is in line with what Qualcomm & Apple are clocking their SoCs, and where Gravitron 2 has been reportedly clocked at as well.

Maybe so, but that's not how Richie meant it smh

coercitiv · Apr 6, 2020

insertcarehere said:
Bringing Apple's cores in just illustrates that there's plenty of IPC headroom for both other x86 & ARM cores without using SMT.

This reminds me of the days when bringing Apple's cores in illustrated the lack of true purpose for big.LITTLE configurations in mobile.

Andrei. · Apr 6, 2020

Markfw said:
Apple does not get 83% more IPC. Also, you can't just calculate it "per ghz" since the Apple is designed for low speed, it appear to have more IPC, but it won;t clock to 4 ghz.

The overall throughput of Ryzen is better than Apple, but each chip was designed for a different use, and optimized for that use.

Apple does get 83% more IPC. Please don't spread technical nonsense which can be disproved easily by things such as performance counters. Bolding such false statements also isn't a good image from a CPU forum moderator.

Moderator call-outs are not allowed.
If you have an issue with a moderator,
you create a post in Moderator Discussions.

Additionally, Markfw was not posting as a moderator,
so that is something else you can't bring up in a thread.

AT Mod Usandthem

NTMBK · Apr 6, 2020

The review of Samsung's M5 CPU is up on the front page... What a disaster. A big, wide, Apple style core which loses to stock ARM cores. If anything it makes me even more impressed with Apple's achievements!

Andrei. · Apr 6, 2020

NTMBK said:
The review of Samsung's M5 CPU is up on the front page... What a disaster. A big, wide, Apple style core which loses to stock ARM cores. If anything it makes me even more impressed with Apple's achievements!

It's wide-ish on paper but falls apart on some regards, stuff like dispatch width is still 50% narrower than Apple. The design was botched in the M3 and they never recovered.

Nothingness · Apr 6, 2020

NTMBK said:
The review of Samsung's M5 CPU is up on the front page... What a disaster. A big, wide, Apple style core which loses to stock ARM cores.

And that's the one we're getting in my area. I was about to pick the S20 but I will definitely skip it.

If anything it makes me even more impressed with Apple's achievements!

It also highlights that ARM Ltd cores have been getting very competitive, both from a performance point of view and a from a power efficiency point of view.

amrnuke · Apr 6, 2020

Andrei. said:
Apple does get 83% more IPC. Please don't spread technical nonsense which can be disproved easily by things such as performance counters. Bolding such false statements also isn't a good image from a CPU forum moderator.

While I agree that the A13 is a hell of a feat of engineering, I want to make sure we are being accurate, and not just consistent, with our grading of things. Because it's more complex than that, as you know.

Here's my diatribe:

About IPC, work done, uops, macroops, etc.
We don't actually have IPC. We have SPECint2006 scores, and AT use the same flags x86 vs mobile so it's not important whether it's SPECint2006 base or not. These scores are normalized ratios as well, which I'm sure some people will hate about it (just kidding!).

The way SPECint2006 measures "speed" - it is not actually considering IPC. It is considering time it takes to complete a set of tasks relative to a reference computer. But it is NOT measuring IPC or anything really analogous to it. You know why, Andrei, but I'll spell it out a little more for others like Richie Rich who seem to conflate work done with IPC.

Many people conflate IPC with work per cycle, which is false. IPC is the number of instructions the CPU can process per cycle. But what do we mean by instruction? Is it how many register transfers can be done per cycle (quite pure)? Or is it more complex: is it how many micro-ops can be done in a cycle, or how many macro-ops? And by whose definition (since even the x86 vendors use different definitions of both uop and macroop)? Or is it just how many program instructions can be burned through?

Perhaps we are most concerned with "work done" as a function of a program asking the CPU to do a task. So, if the benchmark asks the CPU to multiply the number at location x by location y and store it back at x, it sends different instructions to different ISAs:

As an example:
CISC:
MULT x, y

RISC:
LOAD A, x
LOAD B, y
PROD A, B
STORE x, A

For CISC that's one instruction and for RISC it's 4 instructions (if we take a pure "issue-based" count). If it takes four cycles to complete the task, as you'd expect, then that's 0.25 IPC for CISC and 1 IPC for RISC. But they're doing the same exact amount of work, it's just that MULT x,y is a container for 4 smaller instructions, uops. If we instead count uops, then RISC IPC = CISC IPC. But in the case of counting pure CPU instruction issues, then the IPC on the RISC chip will be artificially four times higher than on CISC.

So let's ask ourselves what we mean by IPC, and if it's even relevant: i.e. are we counting instructions issued to the chip, decoded instructions ("uops"), or macro ops? Or something else?

In the end, with the SPECint2006 scores, we are counting work done, not IPC. I think that work done is a more valid comparison. But that's up to each person. What is clear is that SPECint2006 is NOT IPC.

So it is truly VERY difficult to say with any certainty at all what the true IPC of any chip is, based on the information we have, and it's even harder when comparing x86 vs ARM because of the complexities in comparison between chips that heavily use uops and those that don't. Granted, yes, some benchmarks will send instructions that don't need to be decoded, thus removing this limitation, but boy are we going to have to dive deep if we want to compare, on a program-by-program basis, what the true IPC is. And that's only after deciding what constitutes a true "instruction".

About SPECint2006 normalized to GHz
Per the basic calculations in Richie Rich's signature, which are accurate given what he claims to measure, the A13 can burn through 83% more work per clock than 9900K or 3900X.

However, this is so simplified and does not go into detail. The A13 can clock up to 2.66 GHz, but doesn't all the time. Same for the 9900K and 3900X with their "boost" speeds. And on the SPECint2006 benchmarks, we don't actually know what the average clockspeed was on a test that takes a very long time to run, and may be thermally limited. So even TRYING to find out what the real IPC OR work per clock is for these chips is immediately a fairly futile task unless we get more data.

My conclusion
1. We are all over the place with our discussion with respect to IPC vs SPECint2006 vs real work achieved vs work per socket vs work per watt.
2. Richie Rich's signature claims an IPC victory for A13, but it's completely false, as no one has compared anything analogous to IPC on those chips. It doesn't even accurately encompass "work done per cycle" because he uses boost frequency as the normalizer and doesn't consider average clock speed during the benchmark (which by the way isn't even published it seems, hence we cannot know for certain the true work done per cycle).
3. This is still very fun to talk about and very applicable to Graviton2's future.

Andrei. · Apr 6, 2020

I'm not even going to read your whole post because it's the same old stupid IPC story which is just wrong with absolutely no tether to reality. The whole CISC vs RISC thing being brought up is akin to no technical knowledge on the topic.

(For those people understanding the topic know the value of the table I'm posting above and what I just did in a forum thread)

The above are actual IPC figures for the A12 along with retired instruction count. The whole argument is void because the instruction count for the workload isn't wildly different than on x86. I had done the x86 vs AArch64 instruction comparison before as I said repeatedly over the last year where this keeps being brought up, it doesn't majorly differ much beyond a ~10% divergence depending on the test.

I'll run fresh IPC figures on desktops in a few months but you can refer to many other x86 resources for actual IPC figures, for sample https://dl.acm.org/doi/fullHtml/10.1145/3369383 / https://dl.acm.org/cms/attachment/3cb26a5a-f323-4a19-ba4b-d7f3cdd23fb7/taco1604-46-f05.jpg

The TLDR; is that yes Apple has 80%+ higher IPC than Intel. Get over it, stop trying to deny reality.

Nothingness · Apr 6, 2020

@amrnuke The IPC term use here is overloaded.

FWIW I made some x86-64 vs AArch64 instruction measurements some years ago. AArch64 is very competitive both in terms of number of instruction and in terms of total instruction size (that was to assess instruction density). People who think AArch64 lags behind x86 in terms of ISA just didn't study it.

EDIT: @Andrei. beat me to it. Anyway our measurements seem to give similar results.

name99 · Apr 6, 2020

Andrei. said:
I'm not even going to read your whole post because it's the same old stupid IPC story which is just wrong with absolutely no tether to reality. The whole CISC vs RISC thing being brought up is akin to no technical knowledge on the topic.

View attachment 19232

(For those people understanding the topic know the value of the table I'm posting above and what I just did in a forum thread)

Oh dude, we know! Thanks a million

And that hmmer number!!! Damn!!!

amrnuke · Apr 6, 2020

Andrei. said:
I'm not even going to read your whole post because it's the same old stupid IPC story which is just wrong with absolutely no tether to reality. The whole CISC vs RISC thing being brought up is akin to no technical knowledge on the topic.

View attachment 19232

(For those people understanding the topic know the value of the table I'm posting above and what I just did in a forum thread)

The above are actual IPC figures for the A12 along with retired instruction count. The whole argument is void because the instruction count for the workload isn't wildly different than on x86. I had done the x86 vs AArch64 instruction comparison before as I said repeatedly over the last year where this keeps being brought up, it doesn't majorly differ much beyond a ~10% divergence depending on the test.

I'll run fresh IPC figures on desktops in a few months but you can refer to many other x86 resources for actual IPC figures, for sample https://dl.acm.org/doi/fullHtml/10.1145/3369383 / https://dl.acm.org/cms/attachment/3cb26a5a-f323-4a19-ba4b-d7f3cdd23fb7/taco1604-46-f05.jpg

The TLDR; is that yes Apple has 80%+ higher IPC than Intel. Get over it, stop trying to deny reality.

I wish you had read my post instead of just assuming it was a "stupid" old argument.

FWIW, I believe you that Apple has 80%+ higher IPC than Intel. I also believe that Apple has 80%+ work per cycle lead. I don't question this, and to the best of my memory, I never have. Apple have done remarkable work on their chips.

However, the topic in question is not so much RISC vs CISC, as it is Richie Rich's signature, which derives "IPC" from SPECint2006 scores, which is basically what my entire post is about - discussing what is actually meant by IPC, since he seems to have no clue. I really wanted an answer from him before you answered, so I should have quoted him instead, but, so be it. At least maybe he can read this and learn about it more. I believe you just saw four RISC calculations and one CISC calculation and you just assumed I was going to claim IPC is skewed - which it could be, but probably isn't that much, as I stated ("some benchmarks will send instructions that don't need to be decoded, thus removing this limitation").

Long story short, he's claiming 83% IPC lead based on SPECint2006 data, which is false equivalence. Just because he arrived at a correct answer does not mean he is intellectually correct. At least, not until he has the data to back it up - which he doesn't. His data backup an 83% workload per cycle lead. He needs to update his signature, because his signature wrongly labels relative work per cycle as IPC. That's what the majority of my post was about - what does he mean by IPC?

In any case, until we have the A13 vs 3900X and 9900K data, one cannot claim 83% IPC lead, correct?

But if we are going to talk about IPC on RISC vs CISC or ARM vs Intel vs AMD... well, it's absolutely interesting. With respect to the IPC chart and references you give, thank you. Do you mind providing a chart on retired instructions for x86 that you have done previously? I don't really care if its Skylake or Zen1 or whatever. Just any data. I'm curious about it. Because in the paper you linked, I don't see references for the total number of retired instructions, and I worry about whether they and you have done the calculations similarly and on similar-enough setups to make it worthwhile.

Also, finally, can you speak on their statement: "Obviously, IPC cannot be used as a performance metric since two different ISAs are being evaluated." Is this pertinent since ARM/x86 are different ISAs? The authors state that in such a case of different ISAs, FLOPC is a more accurate way of comparing two ISAs. Does that apply here, and if not, why not?

Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Senior member

Senior member

Lifer

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Platinum Member

Diamond Member

Senior member

Lifer

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Golden Member