News Intel GPUs - Battlemage officially announced, evidently not cancelled

Page 188 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

blckgrffn

Diamond Member
May 1, 2003
9,179
3,141
136
www.teamjuchems.com
-I've always been perplexed by AMD's unwillingness to add IC to their APUs. Figure it would have an outsized effect in the most bandwidth constrained scenarios.

Guess we'll start seeing IC after AMD's apu's get squeezed by Intel.

3D cache APU? Who's here for that? :D

Heck, AFAIK they don't have LLC/IC in the console APUs either, where you would think it would make so much sense as it is so power efficient and that is a huge boon to getting performance out of a tiny box and reducing costs elsewhere in size, cooling capacity, power delivery, etc.

Thinking next gen is when we'll see it.
 

ToTTenTranz

Member
Feb 4, 2021
103
152
86
-I've always been perplexed by AMD's unwillingness to add IC to their APUs. Figure it would have an outsized effect in the most bandwidth constrained scenarios.

Guess we'll start seeing IC after AMD's apu's get squeezed by Intel.

Word out there says they had IC in Strix Point until Microsoft mandated NPUs everyone for Copilot+ compliance.

Of course, changes like these would have happened years ago so there's no way to know for sure.
The one thing we do know is how small the iGPU portion ended up being in Strix Point.
And I bet the performance gains at sub-1080p would have been a lot larger had AMD put like 8 to 16MB Infinity Cache instead of those extra 2 WGPs. That lack of balance between processing throughput and effective bandwidth points to removing IC might have been a "last-minute" change, obviously within what "last minute" really means.
 

DavidC1

Senior member
Dec 29, 2023
319
518
96
IMO the plan is probably to match the previous generation peak performance in the U series but using much less area to do so. And the H series gets a wider GPU to keep up the 2x gains gen on gen while also creeping up into the Nvidia xx50 performance.
This is always the case. Do not assume SIMD32 will save huge space. We knew RDNA3 won't perform well after Angstronomics revealed the compactness of the die. It does require transistors to do so, and it is hardware, so if it's too compact, then it's suspicious. It might save some space, but if it's say 40% more compact, then they would have removed features.
AFAIK Intel had a perfect storm in their hands, a piece of hardware that required special software attention... and a software team paralyzed by the recent war.
I think this is better way of looking at it than just politics. They are still inexperienced.

This isn't the first time the driver team was blamed nearly entirely for problems.

Remember X3000? It took them forever to add hardware T&L in their drivers. Yes, that is a big mistake. However, we found out the performance was low, which turned out to be due to lack of hardware. In the 4000 series they doubled the performance of their geometry unit, and they did that further in GMA HD while adding much better culling engine, which would have further increased the performance of the geometry unit.
Let's hope they take out this stupid re-sizable bar requirement this time. For older systems, the performance uplift between Arc1 and Arc2, would be +130%.
Resizable bar requirements are because they had the iGPU mentality for so long. They didn't need to care. Now they have a dGPU they'll really understand what is needed.

Theory is one thing, but absolutely nothing substitutes experience for real world.
 
Jul 27, 2020
17,466
11,259
106
Especially with the manager of that time that seems to mostly play stupid politics games instead of delivering.
CEO Brian Krzanich was responsible for the complacency of the company during his tenure and Intel TMG's Sohail for the 10nm delays.


Intel TMG's Sohail was the biggest borderline criminal executive that was forced out a couple of years ago and was responsible for the 10nm delays.

Hopefully Raptor Lake Refresh is the last bad decision that Intel makes in this decade.
 

NTMBK

Lifer
Nov 14, 2011
10,264
5,115
136
3D cache APU? Who's here for that? :D

Heck, AFAIK they don't have LLC/IC in the console APUs either, where you would think it would make so much sense as it is so power efficient and that is a huge boon to getting performance out of a tiny box and reducing costs elsewhere in size, cooling capacity, power delivery, etc.

Thinking next gen is when we'll see it.
I was honestly surprised the PS5 Pro didn't add an Infinity Cache. They blew the transistor budget on extra shaders, raytracing etc, but didn't give it any more memory bandwidth.
 

ToTTenTranz

Member
Feb 4, 2021
103
152
86
I was honestly surprised the PS5 Pro didn't add an Infinity Cache. They blew the transistor budget on extra shaders, raytracing etc, but didn't give it any more memory bandwidth.
There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.

Regardless, the PS5 Pro doesn't need a lot more memory bandwidth because it will actually be targeting a lower base resolution. It'll render 1080p upscaled to 4K using AI PSSR whereas the PS5 renders 1440p upscaled to 4K using temporal FSR2 or similar.
 
  • Like
Reactions: Tlh97 and Mahboi

blckgrffn

Diamond Member
May 1, 2003
9,179
3,141
136
www.teamjuchems.com
There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.

Regardless, the PS5 Pro doesn't need a lot more memory bandwidth because it will actually be targeting a lower base resolution. It'll render 1080p upscaled to 4K using AI PSSR whereas the PS5 renders 1440p upscaled to 4K using temporal FSR2 or similar.

Which is funny, because I was referring to it in terms of PPW which is a big deal, less power for bandwidth which allows for more juice for presumably the GPU. With a fixed power budget, it would be interesting in the data they received that put the product o this path.

It's Doubly interesting because at lower resolutions (like 1080p) is where you can get the biggest gains in terms of percentage LLC hits from even a "paltry" 16MB IC. AMD has some pretty graphs of this that I linked in another thread, but where you really need a lot of IC is when you want to benefit higher resolutions. It seems like 1080p upscaled to 4k would be a sweet spot for a budget IC implementation.
 
  • Like
Reactions: beginner99

DavidC1

Senior member
Dec 29, 2023
319
518
96
There's more memory bandwidth in the form of higher-clocked GDDR6, from 448GB/s to 576GB/s.
Yes but a cache will speed it up in places where it needs lower latency such as instructions. Also a cache is much better at extracting theoretical bandwidth for that same reason.

Intel said with the eDRAM that it acts as something that has twice the bandwidth because it's a cache.
 

cherullo

Member
May 19, 2019
48
115
106
Resizable bar requirements are because they had the iGPU mentality for so long. They didn't need to care. Now they have a dGPU they'll really understand what is needed.
The Draw/Execute Indirect speed-up for BattleMage is another one of these cases.

For those who don't know, Draw/Execute Indirect is a mechanism which allows a draw command or compute shader to be dispatched based on the results of a previous shader.

For example, you may have a large list of asteroids in your scene, and you'd like to do culling using a compute shader. This culling shader writes the list of visible asteroids to a buffer. Next, you'd like to draw each one of the visible asteroids. Without Draw Indirect, the CPU would have to read the number of visible asteroids from the GPU's buffer to then dispatch the draw command for the correct number of asteroids. With Draw Indirect, you can prepare the draw command as soon as possible and have it read the number of visible asteroids directly from the buffer in the GPU memory.

Now, in a iGPU, all memory can be accessed by the CPU. So to implement Draw Indirect, the iGPU raises an interrupt, the driver copies the number of asteroids in the buffer into the draw command using the CPU and dispatches it. Pretty fast. On a dGPU you have to the same copy from the buffer into the draw command, but you need some dedicated hardware (or an onboard processor, like GCN's Asynchronous Compute Engine) to do this in order to avoid the CPU round-trip.

Arc probably doesn't have such onboard processor. So yeah, all this amazing Draw/Execute Indirect speed up is actually getting grips with dGPU development. Remember, GCN is 12 years old now.
 
Last edited:

DavidC1

Senior member
Dec 29, 2023
319
518
96
Alchemist needs hand-tuning by the driver writers to optimize for weak APIs and engines such as Unreal Engine 5. It is because they said Alchemist emulates the feature widely used by UE5.

Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.
 

KompuKare

Golden Member
Jul 28, 2009
1,047
1,049
136
Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.
Well as long as the hardware team were able to blame the driver team!

Intel internal politics being what it is, and the former Intel GPU boss being someone who came across as a keen player of internal politics!
 
Jul 27, 2020
17,466
11,259
106
The blaming game may have been real, but I doubt they were convincing enough.
He blamed Lisa, the name that is a success even when it's a failure!



I hope he learned his lesson and will stay away from all LISA's in future.
 

ToTTenTranz

Member
Feb 4, 2021
103
152
86
Alchemist needs hand-tuning by the driver writers to optimize for weak APIs and engines such as Unreal Engine 5. It is because they said Alchemist emulates the feature widely used by UE5.

Hardware team has been bottlenecking the driver team.
NOT
Driver team has been bottlenecking the hardware team.

The idea I get from the chipsandcheese's microbenchmarks on the A770 is that execution latencies are high and bandwidth for low workgroup count is low.

So it does look like the hardware is highly dependent on hand-tuned driver optimizations to keep many ALUs occupied and thus hide the low effective bandwidth. It does look a bit like the same problems GCN used to have.
 

blckgrffn

Diamond Member
May 1, 2003
9,179
3,141
136
www.teamjuchems.com
The idea I get from the chipsandcheese's microbenchmarks on the A770 is that execution latencies are high and bandwidth for low workgroup count is low.

So it does look like the hardware is highly dependent on hand-tuned driver optimizations to keep many ALUs occupied and thus hide the low effective bandwidth. It does look a bit like the same problems GCN used to have.
Almost immediately Intel stated that the design suffered from memory bandwidth issues. I am pretty sure Raja said that out loud in a post launch interview. Based on that I assume it was already being addressed in the hardware design of the next generation parts.

It seems like a sophomore effort that addresses this (many details were probably baked very close to the actual retail launch of the current gen cards) that addresses some of the biggest gotcha's while moving to a new node could create something much more desirable.
 

DavidC1

Senior member
Dec 29, 2023
319
518
96
I wonder which CPU benefits ARC the best, helping to keep it busy.
Ironically, it would be the one that is best at games, or the Ryzen X3D series.
Almost immediately Intel stated that the design suffered from memory bandwidth issues. I am pretty sure Raja said that out loud in a post launch interview. Based on that I assume it was already being addressed in the hardware design of the next generation parts.
Saying that is akin to saying Vega suffered from memory bandwidth issues. It's just that both have a difficult time utilizing the said bandwidth.

C&C tests show this clearly, that it may have 512GB/s bandwidth but is it available for all workload sizes? It said it required high workload count to fully utilize it. Vega tried to counter it by using HBM, it wasn't sufficient, and it doesn't help with small file size performance which is entirely dependent on the performance of the caching system for example.

This is why A770 was rumored to have x70 level performance. The MLID-type leakers have almost no technical knowledge so they thought Shaders x Clock speed + Memory speed looked like x70 level so they thought "A770 is x70!!!"

Same conclusion as the folk that believed 2x performance for RDNA3. Of course for AMD it was slightly different. The dual-issue design is very different from actually spending transistors to have double the amount of shaders. It literally did not have enough transistors to perform at 2x. But the leakers thought "2x flops = 2x performance".
 
Last edited:

blckgrffn

Diamond Member
May 1, 2003
9,179
3,141
136
www.teamjuchems.com
Saying that is akin to saying Vega suffered from memory bandwidth issues. It's just that both have a difficult time utilizing the said bandwidth.

Yes, as they sad at that time, bandwidth issues. They stated that raw bandwidth was available but architecturally they were not able to exploit it.

It's actually pretty neat to get into the details on that. Having Raja talk about it and having Vega suffer from similar issues is ironic as well.

If Intel fumbles the ball on this gen in the same way that will be pretty disappointing. Also, since I am assuming that it is resolved to a large degree, it will be interesting to see what the next bottleneck is performance is on their side. It seems likely there will be more hardware dedicated to scheduling and some of the most costly functions the drivers are doing in software now, let's hope there isn't another "bandwidth" type issue that ties battlemages shoelaces together before the race starts.