The disappointing slowdown CPU progress in last 6 years vs 4 years before (10 yrs) ago

Thala · Jun 18, 2019

JoeRambo said:
Yeah, me neither, it's not only addressing width that is relevant to instruction size, but also operands, immediates, modifiers etc. There is simply no way for ARM to add lets say 40 bit constant value to register, yet x86 can do that with simple add reg, [mem]. Try that on ARM where max you get is 8 bits of constant value with some mods.

While this is true, memory operands are one of the biggest pitfalls of X86/x64. There are so many issues with memory operands, that they are generally avoided in more contemporary ISAs.
What looked to be a good idea in the 70s is rather a disaster today.

IntelUser2000 · Jun 18, 2019

Nothingness said:
The issue is that, in the case of encoding and instruction sets in general, you're bound by compatibility with decades of old software.

Yes, I agree. The compatibility requirements for Windows x86 are quite extreme. One of the flacks Vista received was because it had compatibility issues changing to the NT core. It isn't just about "old" software because you found modern software that had issues with it. When the ecosystem is wide and big as WinPC, you find such niche issues, and the biggest selling point is about compatibility, its no longer a small issue anymore.

It isn't a negative though. It's just a trade-off. But this is what I mean. Intel was the dominant vendor for many years. Just like with DRAM. But they no longer are.

Consoles are a different, yet another example of a tradeoff. Watch the next generation. The things they can do with SSDs are going to show what a dedicated platform can do.

JoeRambo · Jun 18, 2019

Thala said:
While this is true, memory operands are one of the biggest pitfalls of X86/x64. There are so many issues with memory operands, that they are generally avoided in more contemporary ISAs.
What looked to be a good idea in the 70s is rather a disaster today.

It absolutely is, but discussion was about instruction/uCode counts. The obviuos claim here was that 32bit instructions are easy to decode due to being fixed size, but You need more of them to express operations, thyrefore for same sized bit read you might decode same amount of uCode for CPU to do.
( and of course the

Nothingness said:
I measured dynamic code size (that is add size of executed instructions) on SPECint 2000 176.gcc some years ago and AArch64 was less than 10% larger than x86-64. That was back when the compiler used (gcc) was not mature for AArch64; it might be better now... or not 🙂 Anyway x86-64 is not that denser than AArch64 contrary to some claims I read.

Hard to tell without measuring, AArch64 is very sensible instruction set. But it does not take a leap of faith to claim that same sized binary will contain more ARM instructions to decode, than equivalent sized x64 🙂

Nothingness · Jun 18, 2019

IntelUser2000 said:
It isn't a negative though. It's just a trade-off. But this is what I mean. Intel was the dominant vendor for many years. Just like with DRAM. But they no longer are.

Intel is still dominant where binary compatibility matters most to end users: laptops and desktops and I don't think the current direction of high-end Windows on ARM devices could change that. Add to that they still have the best CPU on the market and we might have to endure the horrible x86 ISA for too long 😛

Consoles are a different, yet another example of a tradeoff. Watch the next generation. The things they can do with SSDs are going to show what a dedicated platform can do.

Yeah. BTW I wonder how much Windows is tweaked for consoles. But that is not the right place to discuss that 🙂

IntelUser2000 · Jun 18, 2019

Nothingness said:
Intel is still dominant where binary compatibility matters most to end users:

You know what I mean. Now you have several vendors working on ARM chips, big ones at that.

Previously, it would have been on the x86 side, purely because it was the dominant one. It's impossible to explain this in a concrete way, but results speak for themselves. Rather than going with the flow and having smaller, lower power chips much earlier as Moore's Law would have allowed them, Intel wanted to keep selling high margin, large chips and they were simply not ready when the mobile revolution happened. Therefore, in panic response they tried to catch up and got even their core competency(process) screwed up.

Add to that they still have the best CPU on the market and we might have to endure the horrible x86 ISA for too long 😛

Sometimes, people do better on challenging conditions, and you are forced to innovate so it may not be much of a hindrance.

Thala · Jun 18, 2019

Nothingness said:
Intel is still dominant where binary compatibility matters most to end users: laptops and desktops and I don't think the current direction of high-end Windows on ARM devices could change that. Add to that they still have the best CPU on the market and we might have to endure the horrible x86 ISA for too long 😛

I am not too pessimistic regarding Windows on ARM. Myself having an Windows on ARM tablet, i am abolutely happy with it - more than i possibly could be with any current x86 tablet. Of course for serious work i have other devices - but more often than not i catch myself only taking the ARM tablet with me when travelling.
But i agree, we are in mid term really stuck with x86/x64 at the high end - especially on Windows PC. But if Apple is really ready to design a higher end ARM SoC for their own devices things are bound to change.

zrav · Jun 18, 2019

Unless there is a breakthrough in basic research, we will reach a density wall in not so many years. The remaining ways to increase performance are increasing die area, clock and IPC, none of which can be upped indefinitely. A chiplet architecture is a clever way out, kudos to AMD for innovating a key area, again. But thinking ahead, even chiplets don't solve all issues. If we scale up the chiplet concept to a system containing hundreds of chiplets, the latency will be significant, by virtue of physical distance alone (this is already an issue on a single die). You end up with a system that due to the latency and the number of cores has to be treated similar to a distributed system, with all the complexity that brings to software development. Power density constraints will of course mean that such a system would have to run each chiplet at extremely low power. There's just not that much wiggle room in any direction we try to grow. Eventually we'll reach a hardware performance plateau and we're left with making our software as efficient as possible.

ubern00b · Jun 18, 2019

In response to the OP I think the last 2 years alone we have seen a significant change in the mainstream CPU market that has brought about the biggest change since 2007 when we went from single core and dual core to quad core processors, noteably the C2Q and Phenom x4 chips, since their release the CPU market stagnated mainly owing to that AMD was unable to commpete with Intel when it came to raw power and IPC and the mainstream stayed at 4 cores for around 10 years or so until Ryzen changed all this and even Intel was forced to change their respective core counts and classifications (i3 2c/4t - i5 4c - i7 4c/8t) in response to Ryzen in 2017. and I'm not talking about HEDT as that was a different platform, nor am I talking about the AMD FX series that had up to 8 cores, with shared resources. Since then when you couldn't buy an Intel 6 core CPU for anything less than $500 things have changed a lot, and of course now you can get an "true" 8 core CPU for less than that price from both AMD and Intel.

Where does it stop though? I mean AMD have just announced 16c/32t on a mainstream platform, I think the future is going to be multicore more than IPC increases and clock speed, software is and will have to change to accomodate this and that shift is happening, whether or not it will keep going that way until we're at 256c/512t+ CPU's or not, I'm not sure, but raw clock speed and IPC increases of double digits with every generation are long gone so software devs need to adapt, Windows scheduler is already much better at dealing with high core count CPU's and this will only get better over time. Console devs tend to focus on what HW they have at their disposal which is why many games/ports are now able to utilise more than 4 threads and again, this trend will only continue and is not likely to slow down, even GPU's will likely go the MCM way in the future, perhaps at some point core count won't even be taken into consideration and the focus will be on total system performance, kind of like GPU's now with shaders/rops and CU's etc

narzy · Jun 19, 2019

Honestly, I think you are looking at the wrong CPU segments and ignoring other CPU advancements to have come to your conclusions.

First multi-core x86_64. Once Moore's wall was hit CPUs naturally needed to add cores to maintain performance gains. This shift wasn't easy to accomplish or get the software updated to take advantage of. We also have multicore CPUs that can have cores run stable at 5ghz and processors that can be pushed to 6GHz.

Continued process improvements. We're at ~10nm. From where we were at that time is progress.

Mobile CPU, ARM, Custom CPU, GPU
These areas have seen enormous and exciting growth over the last 10 years. The fact that my phone does all it can do is really quite amazing. The processors that go into doing that have pushed node size forward at a pace that likely wouldn't have happened.

I agree that progress on the desktop market has slowed in performance gains but the CPU industry as a whole is really more exciting and competitive than it has ever been.

Carfax83 · Jun 19, 2019

Question for those in the know. Does Spec CPU 2006 run predominantly from the cache?

I've been doing some reading to try and understand how the A12x can be so performant in these benchmarks. My mind has tremendous difficulty accepting that such a low power CPU can perform the way that it does. Anyway, I came across an interesting Reddit thread during my research:

ARM vs x86 IPC

The OP of that post basically thinks that Apple ARM CPUs have surpassed Intel in terms of IPC, and seems to spend the majority of the rest of the thread defending that assertion rather successfully....until another Reddit poster EqualityofAutonomy retorted with a counter argument that to me, was finally able to render the OP's assertion as false in a logical manner.

I'll let EqualityofAutonomy's words speak for itself rather than attempt to paraphrase him:

Lowering clocks increases IPC.
Raising clocks decreases IPC.
Most software, in practical use, is probably memory bound.
These are bad comparisons because you're likely just seeing a highly theoretical packed blob that fits neatly in L1 cache performing SIMD to maximize throughput.
Real world problems often don't fit in cache and aren't ridiculously simple and flawlessly optimized canned benchmarks and never get close to theoretical IPC.
You clearly don't even understand IPC. You can't manipulate it like that because it's not a linear relationship. It's a curve, as in higher frequencies produce less and less IPC. Lower frequencies produce greater and greater IPC. But it's not smooth. It's wild and rocky as boundaries of alignment are passed through. You'll see plateaus and seemingly unpredictable spikes. Because at the end of the day with all the background noise there's no reproducible test. Every run is slightly different. The scheduler dispatching similarly but differently. No run is truly identical to another.
The greater the frequency the more stalls that occur and instructions can take longer to retire. That's the more important metric. Increasing frequency can(will) increase the number of cycles instructions take to retire. Thus lowering IPC. The benefit is sometimes that's okay because the clock increase outweighs IPC increase.
Would you rather have a 1 GHz with 10 IPC or a 5 GHz with 3 IPC? That's 10 billion vs 15 billion. That's the sad reality. Okay those are totally made up for an example. But under clocking is very real. Sometimes performance gains happen due to factors like better thermals and less throttling.

To me this made sense. It's common knowledge that many PC workloads are sensitive to memory latency. Gaming is a good example. In fact, that is exactly why Zen 2 comes furnished with such a large L3 cache, to reduce memory latency. However what I did not know, assuming EqualityofAutonomy's assertion is true, was that raising clock speeds can decrease IPC. And the reason he postulates is because the faster a CPU is cycling at, the more it is affected by memory latency and stalls. Both Intel and AMD have spent large amounts of transistors to minimizing the effects of memory latency and stalls because the majority of code tends to be poorly written.

But with Apple's closed ecosystem, they control both the hardware and the software from the top down, so they are able to realize an efficiency that Intel or AMD could never approach in the PC's open ecosystem. I mean, realistically, would an x86-64 based CPU resembling the A12x Vortex do well in real world code that isn't hyper optimized compared to something like a Core i7? My gut tells me no.

IntelUser2000 · Jun 19, 2019

Carfax83 said:
Question for those in the know. Does Spec CPU 2006 run predominantly from the cache?

SPEC CPU is a workstation benchmark, so no not really. I did some clock scaling comparisons and it scales ~80-85%. Meaning 100% clock increase = 80-85% performance improvement. And I'm just talking about SpecInt. SpecFP is relevant for HPC workloads but for client comparisons its way too dependent on memory bandwidth.

Though, its still a benchmark so it has its flaws.

However what I did not know, assuming EqualityofAutonomy's assertion is true, was that raising clock speeds can decrease IPC.

Yes, but that misses the point of why it happens. If you want to double the top speed of a car, its not enough just to have an engine that's twice as powerful. The transmission system, the tires, weight of the car all needs a significant boost. Heck, you even need roads that's unconstrained by speed limits and traffic!

So if you can double EVERYTHING, then you can get 2x. The reason is because real world workloads vary vastly, and you don't know as a CPU architect what will become the bottleneck.

That's why if you have a code that fits entirely into L1/L2 caches performance will scale linearly(100% or close to 100%) with clock speed.

But with Apple's closed ecosystem, they control both the hardware and the software from the top down, so they are able to realize an efficiency that Intel or AMD could never approach in the PC's open ecosystem.

I think this could be a big reason why they are executing so well. They don't need to wait on other parts to deliver because they can synchronize perfectly. For example, it wasn't enough that Intel CPUs had SpeedShift. Microsoft needed to write a patch for Windows 10 to support it. You need BIOS and firmware support too. Apple controls all that.

Carfax83 · Jun 20, 2019

IntelUser2000 said:
SPEC CPU is a workstation benchmark, so no not really. I did some clock scaling comparisons and it scales ~80-85%. Meaning 100% clock increase = 80-85% performance improvement. And I'm just talking about SpecInt. SpecFP is relevant for HPC workloads but for client comparisons its way too dependent on memory bandwidth.

My point in asking was because I was wondering whether the A12's massive L1 cache could allow the program to run mostly from cache.

BTW, how unusual is it for a CPU to have such a large L1 cache? I can't think of any CPU off the top of my head that had such a large L1 cache.

I'm not an engineer by any means, but from my limited understanding, it seems that L1 cache size can potentially severely limit clock speeds because of the power consumption and die space penalties.

So if you can double EVERYTHING, then you can get 2x. The reason is because real world workloads vary vastly, and you don't know as a CPU architect what will become the bottleneck.

I agree, that workloads and programs vary vastly in how they can tax the hardware. Which begs the point, the Vortex CPU in the A12 seems to be hyper optimized for particular (mobile) workloads in a particular (closed) environment, which desktop CPUs simply cannot compare with.

I honestly believe that if Apple were to ever put high performance ARM based CPUs in their desktop lineup, they would look a lot different than the A12, and would be closer to current x86-64 designs.

IntelUser2000 · Jun 20, 2019

Carfax83 said:
My point in asking was because I was wondering whether the A12's massive L1 cache could allow the program to run mostly from cache.

BTW, how unusual is it for a CPU to have such a large L1 cache? I can't think of any CPU off the top of my head that had such a large L1 cache.

No, for SPEC its still way too small. Even CPUs with massive 20-plus MB caches don't scale 100%. Because even though its a benchmark, its still fairly realistic.

There's a now-defunct CPU that was designed by HP called PA-RISC that had L1 cache sizes in the range of MBytes. I don't think they had any other caches though. So 128KB is very large.

I'm not an engineer by any means, but from my limited understanding, it seems that L1 cache size can potentially severely limit clock speeds because of the power consumption and die space penalties.

SRAM is quite power efficient, and 128KB is a small one in terms of die area. That's not what limits clocks.

The L1 cache being a limitation is because say you want a 4-cycle latency out of the L1. So whether it runs at 2GHz, or 4GHz, the latency is 4-cycles. At 4GHz though, a 4-cycle latency L1 is a mere 1ns. That's why L1 caches are small. L2 caches are much bigger, but has higher latency. Skylake architectures have 11-12 cycle latency for L2.

So one could say if Apple wanted to clock it higher, the very large L1 cache with its very fast access being a limiter. In that case, they'll have to change the latency to say, 5 or 6 cycles, or make them smaller. So in addition to the normal small loss in perf/clock that comes from higher clocks without scaling memory, you lose little bit more from the slower cache.

And its not just about the caches. It's actually every part of the core. The possibility is there that Apple simply executes better, if not the best. But they all have to obey the laws of physics

Thala · Jun 20, 2019

Carfax83 said:
My point in asking was because I was wondering whether the A12's massive L1 cache could allow the program to run mostly from cache.
BTW, how unusual is it for a CPU to have such a large L1 cache? I can't think of any CPU off the top of my head that had such a large L1 cache.
I'm not an engineer by any means, but from my limited understanding, it seems that L1 cache size can potentially severely limit clock speeds because of the power consumption and die space penalties.

SRAM dynamic power consumption is largely independent of size. This is because for access you only activate a row in the SRAM for access - does not matter how many rows you have. But it indeed limits the clockspeed in the sense, that the bit-lines are longer and the sense-amplifiers still need to detect the signal. This means that the clock skew for the sense amplifiers is larger for deeper SRAMs, which affects cycle time.
But in the bigger picture it does not look like the large L1$ is limiting the clock frequency of the whole design severely.

I agree, that workloads and programs vary vastly in how they can tax the hardware. Which begs the point, the Vortex CPU in the A12 seems to be hyper optimized for particular (mobile) workloads in a particular (closed) environment, which desktop CPUs simply cannot compare with.

I honestly believe that if Apple were to ever put high performance ARM based CPUs in their desktop lineup, they would look a lot different than the A12, and would be closer to current x86-64 designs.

Not sure what kind of optimizations regarding "mobile" on architectural side you have in mind. I have honestly no idea how such an optimization might look like*. In any case when running benchmarks like Spec, we sure are not talking about a mobile benchmark. There is not a single reason i can imagine, what prevents A12 to be put into a desktop.
And why should the CPU care about closed environment? It does not suddenly slow down in an open environment.

*You can possibly assume, that in a mobile environment the footprint of the code and data is smaller, such that it is feasible to getting away with smaller caches. Aside from this the code properties are not really different.

JoeRambo · Jun 20, 2019

Carfax83 said:
BTW, how unusual is it for a CPU to have such a large L1 cache? I can't think of any CPU off the top of my head that had such a large L1 cache.

Apple has advantages with L1 caching, cause they control full hardware/software stack. There is a reason why most modern x86 all have same L1 cache structure - VIPT, 8-way Associative 32KB sized D/I caches @ 4 clocks latency ( ZEN1 had 4-way 64KB Instruction cache, but they have seen the light ( or ran the sims? ) and moved back to the fold with ZEN2 ). That reason is 4KB sized memory pages and that is what Apple can circumvent with full stack - they are using 16KB page size ( custom added by ARM @ their request i believe ) and 128KB L1 size for D/I. I can't find associativity info, but they are probably same 8-way stuff as x86 peers.

Nothingness · Jun 20, 2019

Thala said:
*You can possibly assume, that in a mobile environment the footprint of the code and data is smaller, such that it is feasible to getting away with smaller caches. Aside from this the code properties are not really different.

One of the most used app of a phone has a very large code footprint: the browser 🙂

Thala · Jun 21, 2019

JoeRambo said:
Apple has advantages with L1 caching, cause they control full hardware/software stack. There is a reason why most modern x86 all have same L1 cache structure - VIPT, 8-way Associative 32KB sized D/I caches @ 4 clocks latency ( ZEN1 had 4-way 64KB Instruction cache, but they have seen the light ( or ran the sims? ) and moved back to the fold with ZEN2 ). That reason is 4KB sized memory pages and that is what Apple can circumvent with full stack - they are using 16KB page size ( custom added by ARM @ their request i believe ) and 128KB L1 size for D/I. I can't find associativity info, but they are probably same 8-way stuff as x86 peers.

I would not be too much concerned about aliasing issues of larger VIPT caches - thats something the OS should be able to handle.
Anyway ARMv7 already had 4kByte and 64kByte pages, while ARMv8 introduced the intermediate size of 16kByte but i doubt this has anything to do with Apple. The next larger page size is 1Mbyte.

JoeRambo · Jun 21, 2019

Thala said:
I would not be too much concerned about aliasing issues of larger VIPT caches - thats something the OS should be able to handle.

And how do You propose OS can handle this in time critical 4 cycle latency operation? Wont these "measures" to checks VM tags (or whatever proposed solution is) and so on add quite few cycles and complexity ?

There Is reason why we ended up in current situation and Apple is happily exploiting 128KB sized L1 caches without paying taxes so to say.

Thala · Jun 21, 2019

JoeRambo said:
And how do You propose OS can handle this in time critical 4 cycle latency operation? Wont these "measures" to checks VM tags (or whatever proposed solution is) and so on add quite few cycles and complexity ?

There Is reason why we ended up in current situation and Apple is happily exploiting 128KB sized L1 caches without paying taxes so to say.

The solution is, that you just avoid certain mappings. In fact only a subset of the mapping, where you map 2 VAs to the same PA are affected. Thats why i was saying it is not a big issue, because there is no runtime overhead.
I dont think there is a HW solution for this. The moment you decide that you going to index with VA (and having a sufficiently large l1$) - you need to be more careful what mappings are allowed.

JoeRambo · Jun 22, 2019

Thala said:
The solution is, that you just avoid certain mappings. In fact only a subset of the mapping, where you map 2 VAs to the same PA are affected. Thats why i was saying it is not a big issue, because there is no runtime overhead.

The real world is calling, what about all those shared libraries and MMAPS from different processes? Can't really apply such restrictions, without shooting yourself in the foot.

VIPT with 4K pages does all this magic for You, 16KB pages just allow same "speed", but 4x sized caches with obvious gains

Nothingness · Jun 22, 2019

Thala said:
I would not be too much concerned about aliasing issues of larger VIPT caches - thats something the OS should be able to handle.

VIPT aliasing can and should be resolved in hardware. It's even specified by ARMv8 documentation that data caches should either be PIPT or non-aliasing VIPT.

Thala · Jun 22, 2019

JoeRambo said:
The real world is calling, what about all those shared libraries and MMAPS from different processes? Can't really apply such restrictions, without shooting yourself in the foot.

Then tell me what is about those shrared libaries. Having restricted mapping is no showstopper for shared libraries nor a showstopper for memory mapped files or the more generalized shared objects and even much less for anonymous mappings.
You need to be more concrete with where you see issues instead of using claims like "shooting yourself in the foot", which is not really helpful in a technical discussion.

Nothingness said:
VIPT aliasing can and should be resolved in hardware. It's even specified by ARMv8 documentation that data caches should either be PIPT or non-aliasing VIPT.

PIPT is not really an options for L1$ unless you want to lose performance.

JoeRambo · Jun 22, 2019

Thala said:
You need to be more concrete with where you see issues instead of using claims like "shooting yourself in the foot", which is not really helpful in a technical discussion.

The irony with Your strong words, is that You started the discussion by brushing away the advantages of 16kb pages size in L1 VIPT caches, by claiming that OS magic can help that and continue with claims that aliasing is rare, and mmaps are not important.

I guess with strong "technical" claims like that, the burden of proof is on You. Cause Intel/AMD and Apple seem to think otherwise, they somehow stick with non-aliasing VIPT designs and do not rely on magic to handle performance for them?

Thala · Jun 22, 2019

JoeRambo said:
The irony with Your strong words, is that You started the discussion by brushing away the advantages of 16kb pages size in L1 VIPT caches, by claiming that OS magic can help that and continue with claims that aliasing is rare, and mmaps are not important.

No i never said this. I did say, that the OS can avoid aliasing be chosing only appropriate mappings.
Also did not say that 16kByte pages do not have advantages - i did say that i would not be too concerned about aliasing - please avoid putting words into my mouth i have never said. I was strictly commenting on the aliasing issue, nothing else.

I guess with strong "technical" claims like that, the burden of proof is on You. Cause Intel/AMD and Apple seem to think otherwise, they somehow stick with non-aliasing VIPT designs and do not rely on magic to handle performance for them?

Do you call it magic, because it is not clear to you how page coloring or similar techniques work? In any case, you have to accept that there are more solutions to a problem, than what Intel/AMD/Apple prefers to implement.

Thala · Jun 22, 2019

Nothingness said:
VIPT aliasing can and should be resolved in hardware. It's even specified by ARMv8 documentation that data caches should either be PIPT or non-aliasing VIPT.

You are right, ARMv8 requires for the data side, that the caches behaves like PIPT for both ordering of writes and cache maintenance. They do not explicitly state the non-aliasing VIPT is required (in the sense of limiting the cache size of VIPT caches) - hardware synonym detection would also be an option.
For instruction side synonyms are allowed and the ARM v8 architecture reference manual explicitly states, that the architecture gives no guarantee that a maintenance operation affects all aliases of a PA. This is reasonable, as synonyms are not a problem for read-only memory - so the OS just have make sure that all aliases are invalidated in self-modifying code scenarios.

The disappointing slowdown CPU progress in last 6 years vs 4 years before (10 yrs) ago

Golden Member

Elite Member

Golden Member

Diamond Member

Elite Member

Golden Member

Junior Member

Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member