Question x86 and ARM architectures comparison thread.

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DavidC1

Golden Member
Dec 29, 2023
1,743
2,824
96
They aren't clock normalized by the way. You have to lop off 20% on Zen 4.

Also, comparison to Gracemont is important as doubled FP unit is the single biggest boost for FP performance that accounts for the 60-70% gain versus 30% that would be without. FP would have stayed at 30% without them!

If you take existing config(whatever it is) and double their numbers, performance in all existing applications would increase significantly, without recompiling code.

There are many talks in software where technical complexity is needing extra resources and burden in development. Hence, every time you introduce a new feature that requires such a thing, significantly less adoption happens each and every time. AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2. And every 2 years at that too!

Note that Intel were hinting at even wider 1024-bit extensions. You know what stopped them? When they stopped being a monopoly and is close to bankruptcy! If wider vectors are such a magic thing, why don't they keep expanding them?

This is stupid. Game developers especially keep talking about how much time and resources it takes to address just the technical side. More complexity, more resources, more compute power resulting in more complexity. AI too, with recent article talking about how it resulted in increased electricity for the residents of many places in the US. How is that a benefit? It's not far away from being a mass ponzi scheme.
 
Last edited:

Geddagod

Golden Member
Dec 28, 2021
1,493
1,590
106
Anyone know, or have any speculation, on why the x925 has such a smaller L1D cache compared to the ARM competition, despite running at lower clocks, and having the same latency in cycles?
 

johnsonwax

Senior member
Jun 27, 2024
312
483
96
This is stupid. Game developers especially keep talking about how much time and resources it takes to address just the technical side. More complexity, more resources, more compute power resulting in more complexity. AI too, with recent article talking about how it resulted in increased electricity for the residents of many places in the US. How is that a benefit? It's not far away from being a mass ponzi scheme.
The other problem is that you can't design a game for a subset of an architecture unless that's in a major console because it makes it unportable to other architectures and at least for AAA games, you usually need all the sales you can get. Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
 
Jul 27, 2020
26,829
18,480
146
Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
It's called being lazy. There's nothing stopping game developers from detecting the ISA extensions available on the CPU and then using the appropriately optimized DLL (which may sometimes be as simple as just targeting the desired ISA extensions during compilation) to take full advantage of the CPU's capabilities.
 

511

Diamond Member
Jul 12, 2024
3,596
3,400
106
Games are already complicated and if you have to support 3 different version it's going to be a pain for developers
 

MS_AT

Senior member
Jul 15, 2024
805
1,626
96
You do know it's not clock normalized right?
I do but LunarLake Skymont core and Zen5c have the same frequency. (In the second article, I mean the same max frequency, but that's the best we can count on under this circumstances).

You will of course say that Skymont on LunarLake is handicapped by lack of L3. That is why we have the first article, where Desktop Skymont is clocked at 4.6GHz, and has access to much more generous cache setup and access to lower latency memory than Zen5c in Strix Point and is still loosing in some FP subtests to Zen5c core.

Likewise with Zen4 comparisons if the fpu pipe mix was deciding factor and the ability to do 4 fp adds or 4 fp muls per cycle was dominating any benchmark, then Skymont would have pulled ahead of Zen4 despite the freq disadvantage( 20% freq disadvantage vs 100% per cycle execution advantage). Yet it cannot dominate even the Zen5c mobile core. What suggests the pipe arrangment other cores have is sufficient in practice.

Anyway the point is, I do not know of a benchmark that would single out and show Skymont FPU pipe arrangement to be in practice better than LionCove/Zen5. So I was asking if you know any?

Also, comparison to Gracemont is important as doubled FP unit is the single biggest boost for FP performance that accounts for the 60-70% gain versus 30% that would be without. FP would have stayed at 30% without them!
Well, I was not saying doubling does not bring benefits...

If you take existing config(whatever it is) and double their numbers, performance in all existing applications would increase significantly, without recompiling code.
That is questionable statement. It is too broad so can be easily proven wrong if we assume cache to be a resource that is getting doubled. Still that was not what I was asking about.

There are many talks in software where technical complexity is needing extra resources and burden in development. Hence, every time you introduce a new feature that requires such a thing, significantly less adoption happens each and every time. AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2. And every 2 years at that too!
The reason for low adoption rate is Intel business strategy. If they adapted new instructions sets as baselines across whole stack (meaning 5$ Celeron and Xeon would support the same ISA regardless of performance) AVX512 would be enjoying wide adoption by now.

But could you link please the talks you have mentioned? They are usually interesting to watch.

You know what stopped them? When they stopped being a monopoly and is close to bankruptcy! If wider vectors are such a magic thing, why don't they keep expanding them?
Vector register wider than cacheline does not make sense. Even Intel engineers commented as such. Marketing is well marketing. But from personal experience having the reg width match cacheline width makes the life easier.

The other problem is that you can't design a game for a subset of an architecture unless that's in a major console because it makes it unportable to other architectures and at least for AAA games, you usually need all the sales you can get. Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
Teach people that object oriented design is a tool and not an universal answer to every problem and that data oriented design also exists. If they have cpu friendly code in the first place then regardless where they port the code it will do well as operating principles are the same across architectures.
 

poke01

Diamond Member
Mar 8, 2022
3,989
5,319
106
The video shows why the Intel laptop is priced higher. Just a better overall experience if you are not married to an internet browser. And the ARM laptop emitting more fan noise? Wow. Trust Qualcomm to make ARM look bad!
Qualcomm should’ve held off till V3 was ready. Oh well, they totally rushed their WoA launch once again. Hopefully, the 6th times the charm right?
 

camel-cdr

Member
Feb 23, 2024
31
98
51
AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2
While I agree with the principle, AVX512 was a much bigger upgrade over AVX2 than AVX2 over SSE4 imo, because it added much more powerful instructions.

The better solution is obviously proper ISA design that doesn't require a full rewrite if you want to widen your vector units.
 

johnsonwax

Senior member
Jun 27, 2024
312
483
96
Teach people that object oriented design is a tool and not an universal answer to every problem and that data oriented design also exists. If they have cpu friendly code in the first place then regardless where they port the code it will do well as operating principles are the same across architectures.
How do you port the physical SIMD unit that allows the game to function at the necessary performance in the first place? Note, there are two kinds of AAA games - platform exclusives designed to emphasize the differences between competitors and widely accepted games that need to avoid those differences so they can have a reasonable degree of parity. So if the new compute unit enables the game to do something new, it won't be able to do that new thing on other hardware without an equivalent to that compute unit. This is why features like ray tracing took a while to actually arrive in AAA, because while it was one thing to add it as a shader pack in Minecraft, it wasn't ubiquitous enough for Rockstar to design a game around - that took a few years. You can get away with some scaling issues - simpler geometry on weaker platforms, possibly shorter load distances provided that isn't crucial to the game play, etc. You can't lower FPS. you can't simplify physics, etc. So any specialized compute can't be used in an enabling capacity or else your game isn't portable. It has f-all to do with the nature of how you write the code, and everything about the design of the game.

I mentioned this before in how CP2077 failed on PS4 because the game was designed for the PS5 generation of consoles and assumed there would be a fast SSD to steam assets off of storage. But the PS4 had a fairly slow spinning drive and couldn't do that and was the main bottleneck to the game working on the prior generation of consoles. It had nothing to do with the coding paradigm and everything about the assumption being made about the capabilities of the underlying hardware. CDPR couldn't just massage their way around storage that was 10x slower than what their game needed. Had they assumed spinning drives, they would have had to consider design solutions like loading screens, greater asset reuse to keep it on the GPU, not having cars in the game that require loading the map at a certain rate, etc. These are design questions, not OO vs functional programming.

If you've never written code around hardware constraints like this you don't understand what's involved. A lot of us old guys learned to code in such an era - when platform A had a blitter that could support 8 sprites and platform B had one that could support 16 sprites, if you wanted your game to run on both, you had to design it around 8 sprites. You couldn't use the extra 8 sprites as part of your core gameplay. Maybe it could be aesthetic, but for any cross-platform design that additional hardware capability would have to go largely unused. Now, portability wasn't an issue back then - you were usually writing everything in assembly, and your 'port' was a complete rewrite. Thankfully applications were a lot simpler then.

Today most of that is handled by the game engine which handles 90% of the lift, but these engines are also designed to help the developer deal with hardware disparities by handling LOD issues, etc. But that also means that the game engine isn't going to run too far down some hardware rabbit hole that would unlock a new set of capabilities because much of the point of game engines is that they also handle most of the porting lift for you, and that rabbit hole makes that harder to impossible. It's great for platform exclusives, and a trap for everything else, and that's why developers avoid it. It's not worth boxing them into one platform. Once that bit of technology gets wide adoption and you see SIMD more broadly, etc. then you start to build around it. Again, the point of a platform exclusive is to show off the capabilities of the platform and so they will make the most of it (note the Wii games that were designed around the controllers, and in no universe could be ported to other platforms, not because it was 'too object oriented' whatever the f that means, but because nobody else had nunchucks that would let you stand in the middle of your living room which is what made those games fun.) But because the Wii rejected the emerging console design language around two triggers, two shoulder buttons, two analog sticks, 4 action buttons and a D pad, it was hard to port games to the Wii because the controls needed simply weren't there. Note the Switch adopts this design language, and games can be ported to it. Again, nothing to do with the coding paradigm. So AAA ported games are largely an exercise in maximizing your lowest common denominator hardware, allowing the game to scale where it doesn't impact gameplay, and refusing to utilize hardware features where it would.

And because the economics of the game industry have gotten so bad around AAA titles, they simply can't function as platform exclusives unless they are being heavily subsidized, now by TV and movie rights, merch, etc. And so it's reached a point that there is simply no room for hardware diversity. Xbox and PS are pretty close to the exact same hardware now because every time they introduce haptic triggers or whatever that gets sanded off by the GTA VI/CP/Battlefield games and becomes irrelevant. Gaming hardware is fully a by committee design exercise now.
 

MS_AT

Senior member
Jul 15, 2024
805
1,626
96
How do you port the physical SIMD unit that allows the game to function at the necessary performance in the first place?
I think we have gone like two levels deeper than what may comment was talking about.

So bear with me, as I try to understand. You are talking about hypothetical situation where you have based a program (or a game?) around specific characteristics of a SIMD unit available on particular microarchitecture (let's call it A) and are asking how to account for that case when you port to microarchitecture B, where the specific unit or specific functionality is not available? Do you have real life example for what that may be?

These are design questions, not OO vs functional programming.
I agree. According to Data Oriented Design principles, they would have designed their software around data flow on the target platforms and their constraints. That means PS4 was not considered at the time when they have been designing the game, as they would have accounted for the fact they cannot stream the assets at the same speed form HDD as they can from SSD. Still, I hope you agree than in this case it did not matter if they were targetting cat cores or Zen cores. The problem was related to storage subsystem. AVX2 vs AVX wouldn't make a difference.

If you've never written code around hardware constraints like this you don't understand what's involved.
I guess you are speaking about system level constraints, including memory subsystem, storage subsystem, etc. It's a wider scope than what I had in mind when responding initially to your message.
you were usually writing everything in assembly, and your 'port' was a complete rewrite. Thankfully applications were a lot simpler then.
Compilers are smarter, you can nudge them to do the right thing nowadays. I mean I don't think we have to write something fully in asm to extract reasonable performance even if that was maintainable. Inline asm here and there. Proper data layout to nudge it towards auto vectorization and solid libraries around the case which autovectorization cannot handle, with cpu specific backends where you can you intrinsics freely usually do the job.
 
  • Like
Reactions: Tlh97 and yottabit

yottabit

Golden Member
Jun 5, 2008
1,652
824
146
You can't lower FPS. you can't simplify physics, etc. So any specialized compute can't be used in an enabling capacity or else your game isn't portable.
Your long rant contains some valuable insights but then also stuff like this that is nonsensical. Games target different FPS on different platforms all the time. Some console releases even offer choice of high framerate or high resolution. Physics can very much be simplified from one platform to the other - very few games these days feature heavily “physics-based” gameplay so the devs can choose how elaborate they want their cloth or foliage simulations to be based on target platform or Low/Med/Ultra settings.

In a similar vein developers and/or their engines can easily include vectorized SIMD instructions either “manually” through intrinsics or automatically by the compiler. Platforms that support it will just end up having faster load times or lower CPU frame times, not going to break anything.

The bigger thing is really, what is the developer incentive to do so? Gamers seem to be happy to drop $100 on a some Platinum Edition pre-order of a game that has terrible performance problems so the developer incentive is not really there.

Most games are bottlenecked by GPU these days at the resolutions and framerates people care about.

SIMD is not a magic easy button for game engines like Unreal and Unity that are heavily structured around “game objects” or “actors” and a single threaded main update loop. It’s great and I’m sure it’s being used for contiguous blocks of data like loading assets and decompression, some procedural generation tasks, etc.

It’s things like all the virtual function calls (in the case of Unreal) and garbage collection (in the case of Unity) that wreck the game
thread performance. As well as lazy developers slapping things they shouldn’t be on tick/update events. Going to an ECS data-oriented style programming stack can greatly improve this but there’s only a subset of developers that like working that way and it gets second class treatment in the engines.

Where we tend to see developers focus CPU optimization effort (perhaps ironically) in the industry is on the part gamers never see- dedicated servers and backend. There’s real dollars associated with savings there in terms of hosting cost.

I’m only a hobbyist at this stuff but have done some CPU/GPU profiling and watched a lot of GDC talks.
 
Last edited:

johnsonwax

Senior member
Jun 27, 2024
312
483
96
So bear with me, as I try to understand. You are talking about hypothetical situation where you have based a program (or a game?) around specific characteristics of a SIMD unit available on particular microarchitecture (let's call it A) and are asking how to account for that case when you port to microarchitecture B, where the specific unit or specific functionality is not available? Do you have real life example for what that may be?
I gave you one - CP2077. It had the unfortunate timing of the PS5 being delayed and forcing it to ship on the PS4, which it very obviously could not run on to the degree that Sony pulled it from the store and issued refunds. My understanding is that they were always planning on backporting to PS4, but made the design decision early enough that they only later realized they couldn't address the IO latency/bandwidth problem at which point there wasn't much they could do and their hope that people would have PS5s they could back off on the PS4 launch didn't materialize.

MGS IV is an example of a game that couldn't make it for multiple reasons. It was an early PS3 release and got more performance off of Cell than the Xbox could manage, but also utilized the pressure sensitive buttons on the PS3 controller and was too big to fit on an Xbox DVD. Which of these was the dealbreaker is impossible to know but the bottom line is it was designed around the hardware offering of the PS3/PC and the Xbox lacked that. Regardless Hideo Kojima is also a difficult character to extrapolate game industry decision making from.

I agree. According to Data Oriented Design principles, they would have designed their software around data flow on the target platforms and their constraints. That means PS4 was not considered at the time when they have been designing the game, as they would have accounted for the fact they cannot stream the assets at the same speed form HDD as they can from SSD. Still, I hope you agree than in this case it did not matter if they were targetting cat cores or Zen cores. The problem was related to storage subsystem. AVX2 vs AVX wouldn't make a difference.

I guess you are speaking about system level constraints, including memory subsystem, storage subsystem, etc. It's a wider scope than what I had in mind when responding initially to your message.
Yeah, it doesn't matter. If you design to a given hardware profile, it doesn't matter if it's AVX or GPU or SSD. You can't fix hardware shortcomings in software.

Compilers are smarter, you can nudge them to do the right thing nowadays. I mean I don't think we have to write something fully in asm to extract reasonable performance even if that was maintainable. Inline asm here and there. Proper data layout to nudge it towards auto vectorization and solid libraries around the case which autovectorization cannot handle, with cpu specific backends where you can you intrinsics freely usually do the job.
Somewhat. Normally these asymmetric units aren't being added because they give marginal performance gains, they're added because they give HUGE performance gains in a narrow set of compute conditions. So that sort of begs the question - do you have those conditions? If so, you won't be able to just #IFDEF your way out of it because you aren't getting marginal gains. And if you don't have those conditions then you aren't targeting the unit, but the compiler may shove some things over to it or you may take some time to do some optimizations here and there because it helps with FPS stability, etc. but it's not like the presence of that unit gave a competitive advantage, enabled something new in gaming, etc.

You just don't get those step changes any more, not because they are technologically out of reach, but because you can only exploit them with platform exclusives, and that's an economically difficult problem to solve given that a game like GTA VI has an estimated development cost north of $1B. That game needs to launch everywhere to pay that off. Even Microsoft's acquisitions which looked like they were going to result in a load of platform exclusives now look like they can't afford to do that. We'll get limited exclusives so you need an Xbox to play it first, but they'll make their way to other platforms later, and as a result can't be hardware constrained to the Xbox. So if the Xbox has a new SIMD unit, etc. even Microsoft probably can't afford to use it to its potential. Note, there are aspects of this which scale and aspects that don't. Raytracing is a good example of a sort of all-or-nothing technological step, as opposed to general compute where you can maybe LOD your way out of a performance disparity.
 

johnsonwax

Senior member
Jun 27, 2024
312
483
96
Your long rant contains some valuable insights but then also stuff like this that is nonsensical. Games target different FPS on different platforms all the time. Some console releases even offer choice of high framerate or high resolution. Physics can very much be simplified from one platform to the other - very few games these days feature heavily “physics-based” gameplay so the devs can choose how elaborate they want their cloth or foliage simulations to be based on target platform or Low/Med/Ultra settings.
I meant at the margins. Sure, you can offer 30FPS 4K vs 60FPS HD, but you can't offer 25FPS baseline, not for most games. And if the game utilizes physics as a core mechanic, you can't just degrade that. You can't have your spaceship in orbit just fall out of orbit because you can't afford to run the algorithm. You have specific constraints based on your design decisions. There's a nice example of this in Factorio which I've mentioned elsewhere that causes it to be so heavily single core constrained. It's because they have as a design requirement that the game be deterministic, and maintaining thread coherency in a deterministic simulation is very hard and expensive to do, to the extent that you can pretty quickly consume all of your gains in overhead management.

Note, the issue here isn't whether the hardware or programming language can or can't multithread, the problem is you set a constraint that is extremely unsuited to threading and multiple cores and you get rapidly diminishing returns on threading/cores. Change that constraint like other games do and you can thread to your hearts content and rely on stochastic mechanisms to cover the coherency gaps.

But I think you also highlight why AAA games are what they are - because they allow for a lot of smearing. There aren't a lot of hard constraints on a game like Skyrim, so you can adapt it to pretty much any environment by shoving LOD, frame rates, resolution, and the like. That's kind of an indictment of the industry, by the way, because it means entire other categories of games go by the wayside not so much because they are harder to implement, but they are less flexible when the marketing folks walk into the room and say 'hey, let's port this to iPhone'. And that's really what drives the industry, not making interesting or novel games.

In a similar vein developers and/or their engines can easily include vectorized SIMD instructions either “manually” through intrinsics or automatically by the compiler. Platforms that support it will just end up having faster load times or lower CPU frame times, not going to break anything.

The bigger thing is really, what is the developer incentive to do so? Gamers seem to be happy to drop $100 on a some Platinum Edition pre-order of a game that has terrible performance problems so the developer incentive is not really there.

Most games are bottlenecked by GPU these days at the resolutions and framerates people care about.
Right, but it's not like GPU doesn't have similar constraints on RAM, how quickly you can stream assets to it, and so on. Someday someone will develop a game that requires raytracing as a gameplay element. I dunno, you can only see enemies in reflections or something. Understand, I'm not a game developer by profession, but I've been a gamer for half a century and I did write games in the ancient times and I continued to write software through my career. I've seen all this stuff come and go and the long and short of it is that the industry is extremely lowest common denominator oriented now, and my friend group includes a load of Blizzard guys that have confirmed this. It's about a 6 years development cycle for a AAA game, so quite often your launch platform doesn't even exist yet on paper apart from the compute target. It's not like the 90s when you could knock out a game in 18 months often with prototype hardware in hand and it was much easier to target specific hardware/compute features.

SIMD is not a magic easy button for game engines like Unreal and Unity that are heavily structured around “game objects” or “actors” and a single threaded main update loop. It’s great and I’m sure it’s being used for contiguous blocks of data like loading assets and decompression, some procedural generation tasks, etc.

It’s things like all the virtual function calls (in the case of Unreal) and garbage collection (in the case of Unity) that wreck the game
thread performance. As well as lazy developers slapping things they shouldn’t be on tick/update events. Going to an ECS data-oriented style programming stack can greatly improve this but there’s only a subset of developers that like working that way and it gets second class treatment in the engines.
Except the virtual function calls is how you make the economics of the industry work. You call it lazy, but without that you don't get portability. And that kind of abstraction is EVERYWHERE. We moved up from hand written assembly to C to C + Lua, from direct code to game libraries that carry a lot of overhead, and so on and so on. You're always trading out potential performance for portability, for time to market, etc because games have gotten so large in terms of code, assets, platform reach and so on. You can bypass that all of course and you get wonderful games like Factorio which is WAY better in terms of revenue per employee and things like that, and is likely much better optimized, but isn't going to turn $500M in sales.

One of the things I don't really like about the games industry is that the really big AAA games are visually very impressive and generally kind of sh*t games, and not moving forward the underlying mechanics of games - storytelling, etc. They're mainly just bigger and shinier. And that's largely because there is this economic driver there that pulls everything along with it including the game engines, etc.

Where we tend to see developers focus CPU optimization effort (perhaps ironically) in the industry is on the part gamers never see- dedicated servers and backend. There’s real dollars associated with savings there in terms of hosting cost.
Yep. But that's not just because there are real dollars but they have absolute control over that hardware. A friend of mine was the lead for Battlenet when WOW first launched. I didn't see him for an entire year. But he had staff waiting at HP (I think it was) factory to pull blade servers off the assembly line and drive them directly over to the datacenter to install them. They didn't need to write to a lowest common denominator like the WOW developers did, they could write to utilize every ounce of compute that specific bit of hardware had because they specced it, and they were buying it in the hundreds or thousands and didn't need to port. They had control. The game developers didn't.
 

MS_AT

Senior member
Jul 15, 2024
805
1,626
96
I gave you one - CP2077.
No, you gave me a system constraint reason. While valid, it's not what we were talking about previously. You have generally made the subject more broad. While I am not saying you are not right, I don't think discussion about constraints and difficulties in game design belongs in this thread. We could have a new thread for that if you would like to create one.