Question x86 and ARM architectures comparison thread.

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DavidC1

Golden Member
Dec 29, 2023
1,743
2,823
96
They aren't clock normalized by the way. You have to lop off 20% on Zen 4.

Also, comparison to Gracemont is important as doubled FP unit is the single biggest boost for FP performance that accounts for the 60-70% gain versus 30% that would be without. FP would have stayed at 30% without them!

If you take existing config(whatever it is) and double their numbers, performance in all existing applications would increase significantly, without recompiling code.

There are many talks in software where technical complexity is needing extra resources and burden in development. Hence, every time you introduce a new feature that requires such a thing, significantly less adoption happens each and every time. AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2. And every 2 years at that too!

Note that Intel were hinting at even wider 1024-bit extensions. You know what stopped them? When they stopped being a monopoly and is close to bankruptcy! If wider vectors are such a magic thing, why don't they keep expanding them?

This is stupid. Game developers especially keep talking about how much time and resources it takes to address just the technical side. More complexity, more resources, more compute power resulting in more complexity. AI too, with recent article talking about how it resulted in increased electricity for the residents of many places in the US. How is that a benefit? It's not far away from being a mass ponzi scheme.
 
Last edited:

Geddagod

Golden Member
Dec 28, 2021
1,493
1,588
106
Anyone know, or have any speculation, on why the x925 has such a smaller L1D cache compared to the ARM competition, despite running at lower clocks, and having the same latency in cycles?
 

johnsonwax

Senior member
Jun 27, 2024
309
474
96
This is stupid. Game developers especially keep talking about how much time and resources it takes to address just the technical side. More complexity, more resources, more compute power resulting in more complexity. AI too, with recent article talking about how it resulted in increased electricity for the residents of many places in the US. How is that a benefit? It's not far away from being a mass ponzi scheme.
The other problem is that you can't design a game for a subset of an architecture unless that's in a major console because it makes it unportable to other architectures and at least for AAA games, you usually need all the sales you can get. Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
 
Jul 27, 2020
26,825
18,471
146
Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
It's called being lazy. There's nothing stopping game developers from detecting the ISA extensions available on the CPU and then using the appropriately optimized DLL (which may sometimes be as simple as just targeting the desired ISA extensions during compilation) to take full advantage of the CPU's capabilities.
 

511

Diamond Member
Jul 12, 2024
3,578
3,394
106
Games are already complicated and if you have to support 3 different version it's going to be a pain for developers
 

MS_AT

Senior member
Jul 15, 2024
804
1,623
96
You do know it's not clock normalized right?
I do but LunarLake Skymont core and Zen5c have the same frequency. (In the second article, I mean the same max frequency, but that's the best we can count on under this circumstances).

You will of course say that Skymont on LunarLake is handicapped by lack of L3. That is why we have the first article, where Desktop Skymont is clocked at 4.6GHz, and has access to much more generous cache setup and access to lower latency memory than Zen5c in Strix Point and is still loosing in some FP subtests to Zen5c core.

Likewise with Zen4 comparisons if the fpu pipe mix was deciding factor and the ability to do 4 fp adds or 4 fp muls per cycle was dominating any benchmark, then Skymont would have pulled ahead of Zen4 despite the freq disadvantage( 20% freq disadvantage vs 100% per cycle execution advantage). Yet it cannot dominate even the Zen5c mobile core. What suggests the pipe arrangment other cores have is sufficient in practice.

Anyway the point is, I do not know of a benchmark that would single out and show Skymont FPU pipe arrangement to be in practice better than LionCove/Zen5. So I was asking if you know any?

Also, comparison to Gracemont is important as doubled FP unit is the single biggest boost for FP performance that accounts for the 60-70% gain versus 30% that would be without. FP would have stayed at 30% without them!
Well, I was not saying doubling does not bring benefits...

If you take existing config(whatever it is) and double their numbers, performance in all existing applications would increase significantly, without recompiling code.
That is questionable statement. It is too broad so can be easily proven wrong if we assume cache to be a resource that is getting doubled. Still that was not what I was asking about.

There are many talks in software where technical complexity is needing extra resources and burden in development. Hence, every time you introduce a new feature that requires such a thing, significantly less adoption happens each and every time. AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2. And every 2 years at that too!
The reason for low adoption rate is Intel business strategy. If they adapted new instructions sets as baselines across whole stack (meaning 5$ Celeron and Xeon would support the same ISA regardless of performance) AVX512 would be enjoying wide adoption by now.

But could you link please the talks you have mentioned? They are usually interesting to watch.

You know what stopped them? When they stopped being a monopoly and is close to bankruptcy! If wider vectors are such a magic thing, why don't they keep expanding them?
Vector register wider than cacheline does not make sense. Even Intel engineers commented as such. Marketing is well marketing. But from personal experience having the reg width match cacheline width makes the life easier.

The other problem is that you can't design a game for a subset of an architecture unless that's in a major console because it makes it unportable to other architectures and at least for AAA games, you usually need all the sales you can get. Apart from platform exclusives, you have to design to more or less to a lowest common denominator.
Teach people that object oriented design is a tool and not an universal answer to every problem and that data oriented design also exists. If they have cpu friendly code in the first place then regardless where they port the code it will do well as operating principles are the same across architectures.
 
  • Like
Reactions: CouncilorIrissa

poke01

Diamond Member
Mar 8, 2022
3,989
5,309
106
The video shows why the Intel laptop is priced higher. Just a better overall experience if you are not married to an internet browser. And the ARM laptop emitting more fan noise? Wow. Trust Qualcomm to make ARM look bad!
Qualcomm should’ve held off till V3 was ready. Oh well, they totally rushed their WoA launch once again. Hopefully, the 6th times the charm right?
 

camel-cdr

Member
Feb 23, 2024
31
97
51
AVX is much less of an advantage of SSE, AVX2 is much less over AVX, and AVX512 is much less over AVX2
While I agree with the principle, AVX512 was a much bigger upgrade over AVX2 than AVX2 over SSE4 imo, because it added much more powerful instructions.

The better solution is obviously proper ISA design that doesn't require a full rewrite if you want to widen your vector units.
 

johnsonwax

Senior member
Jun 27, 2024
309
474
96
Teach people that object oriented design is a tool and not an universal answer to every problem and that data oriented design also exists. If they have cpu friendly code in the first place then regardless where they port the code it will do well as operating principles are the same across architectures.
How do you port the physical SIMD unit that allows the game to function at the necessary performance in the first place? Note, there are two kinds of AAA games - platform exclusives designed to emphasize the differences between competitors and widely accepted games that need to avoid those differences so they can have a reasonable degree of parity. So if the new compute unit enables the game to do something new, it won't be able to do that new thing on other hardware without an equivalent to that compute unit. This is why features like ray tracing took a while to actually arrive in AAA, because while it was one thing to add it as a shader pack in Minecraft, it wasn't ubiquitous enough for Rockstar to design a game around - that took a few years. You can get away with some scaling issues - simpler geometry on weaker platforms, possibly shorter load distances provided that isn't crucial to the game play, etc. You can't lower FPS. you can't simplify physics, etc. So any specialized compute can't be used in an enabling capacity or else your game isn't portable. It has f-all to do with the nature of how you write the code, and everything about the design of the game.

I mentioned this before in how CP2077 failed on PS4 because the game was designed for the PS5 generation of consoles and assumed there would be a fast SSD to steam assets off of storage. But the PS4 had a fairly slow spinning drive and couldn't do that and was the main bottleneck to the game working on the prior generation of consoles. It had nothing to do with the coding paradigm and everything about the assumption being made about the capabilities of the underlying hardware. CDPR couldn't just massage their way around storage that was 10x slower than what their game needed. Had they assumed spinning drives, they would have had to consider design solutions like loading screens, greater asset reuse to keep it on the GPU, not having cars in the game that require loading the map at a certain rate, etc. These are design questions, not OO vs functional programming.

If you've never written code around hardware constraints like this you don't understand what's involved. A lot of us old guys learned to code in such an era - when platform A had a blitter that could support 8 sprites and platform B had one that could support 16 sprites, if you wanted your game to run on both, you had to design it around 8 sprites. You couldn't use the extra 8 sprites as part of your core gameplay. Maybe it could be aesthetic, but for any cross-platform design that additional hardware capability would have to go largely unused. Now, portability wasn't an issue back then - you were usually writing everything in assembly, and your 'port' was a complete rewrite. Thankfully applications were a lot simpler then.

Today most of that is handled by the game engine which handles 90% of the lift, but these engines are also designed to help the developer deal with hardware disparities by handling LOD issues, etc. But that also means that the game engine isn't going to run too far down some hardware rabbit hole that would unlock a new set of capabilities because much of the point of game engines is that they also handle most of the porting lift for you, and that rabbit hole makes that harder to impossible. It's great for platform exclusives, and a trap for everything else, and that's why developers avoid it. It's not worth boxing them into one platform. Once that bit of technology gets wide adoption and you see SIMD more broadly, etc. then you start to build around it. Again, the point of a platform exclusive is to show off the capabilities of the platform and so they will make the most of it (note the Wii games that were designed around the controllers, and in no universe could be ported to other platforms, not because it was 'too object oriented' whatever the f that means, but because nobody else had nunchucks that would let you stand in the middle of your living room which is what made those games fun.) But because the Wii rejected the emerging console design language around two triggers, two shoulder buttons, two analog sticks, 4 action buttons and a D pad, it was hard to port games to the Wii because the controls needed simply weren't there. Note the Switch adopts this design language, and games can be ported to it. Again, nothing to do with the coding paradigm. So AAA ported games are largely an exercise in maximizing your lowest common denominator hardware, allowing the game to scale where it doesn't impact gameplay, and refusing to utilize hardware features where it would.

And because the economics of the game industry have gotten so bad around AAA titles, they simply can't function as platform exclusives unless they are being heavily subsidized, now by TV and movie rights, merch, etc. And so it's reached a point that there is simply no room for hardware diversity. Xbox and PS are pretty close to the exact same hardware now because every time they introduce haptic triggers or whatever that gets sanded off by the GTA VI/CP/Battlefield games and becomes irrelevant. Gaming hardware is fully a by committee design exercise now.
 

MS_AT

Senior member
Jul 15, 2024
804
1,623
96
How do you port the physical SIMD unit that allows the game to function at the necessary performance in the first place?
I think we have gone like two levels deeper than what may comment was talking about.

So bear with me, as I try to understand. You are talking about hypothetical situation where you have based a program (or a game?) around specific characteristics of a SIMD unit available on particular microarchitecture (let's call it A) and are asking how to account for that case when you port to microarchitecture B, where the specific unit or specific functionality is not available? Do you have real life example for what that may be?

These are design questions, not OO vs functional programming.
I agree. According to Data Oriented Design principles, they would have designed their software around data flow on the target platforms and their constraints. That means PS4 was not considered at the time when they have been designing the game, as they would have accounted for the fact they cannot stream the assets at the same speed form HDD as they can from SSD. Still, I hope you agree than in this case it did not matter if they were targetting cat cores or Zen cores. The problem was related to storage subsystem. AVX2 vs AVX wouldn't make a difference.

If you've never written code around hardware constraints like this you don't understand what's involved.
I guess you are speaking about system level constraints, including memory subsystem, storage subsystem, etc. It's a wider scope than what I had in mind when responding initially to your message.
you were usually writing everything in assembly, and your 'port' was a complete rewrite. Thankfully applications were a lot simpler then.
Compilers are smarter, you can nudge them to do the right thing nowadays. I mean I don't think we have to write something fully in asm to extract reasonable performance even if that was maintainable. Inline asm here and there. Proper data layout to nudge it towards auto vectorization and solid libraries around the case which autovectorization cannot handle, with cpu specific backends where you can you intrinsics freely usually do the job.
 
  • Like
Reactions: yottabit

yottabit

Golden Member
Jun 5, 2008
1,651
817
146
You can't lower FPS. you can't simplify physics, etc. So any specialized compute can't be used in an enabling capacity or else your game isn't portable.
Your long rant contains some valuable insights but then also stuff like this that is nonsensical. Games target different FPS on different platforms all the time. Physics can very much be simplified from one platform to the other - very few games these days feature heavily “physics-based” gameplay so the devs can choose how elaborate they want their cloth or foliage simulations to be based on target platform or Low/Med/Ultra settings.

In a similar vein developers and/or their engines can easily include vectorized SIMD instructions either “manually” through intrinsics or automatically by the compiler. Platforms that support it will just end up having faster load times or lower CPU frame times, not going to break anything.

The bigger thing is really, what is the developer incentive to do so? Gamers seem to be happy to drop $100 on a some Platinum Edition pre-order of a game that has terrible performance problems so the developer incentive is not really there.

Most games are bottlenecked by GPU these days at the resolutions and framerates people care about.

SIMD is not a magic easy button for game engines like Unreal and Unity that are heavily structured around “game objects” or “actors” and a single threaded main update loop. It’s great and I’m sure it’s being used for contiguous blocks of data like loading assets and decompression, some procedural generation tasks, etc.

It’s things like all the virtual function calls (in the case of Unreal) and garbage collection (in the case of Unity) that wreck the game
thread performance. As well as lazy developers slapping things they shouldn’t be on tick/update events. Going to an ECS data-oriented style programming stack can greatly improve this but there’s only a subset of developers that like working that way and it gets second class treatment in the engines.

Where we tend to see developers focus CPU optimization effort (perhaps ironically) in the industry is on the part gamers never see- dedicated servers and backend. There’s real dollars associated with savings there in terms of hosting cost.

I’m only a hobbyist at this stuff but have done some CPU/GPU profiling and watched a lot of GDC talks. Also stayed at a Holiday Inn Express last night