Info AMD CDNA Compute GPU architecture

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Seems like AMD is truly creating a permanent separation in GPU between gaming and compute focused uArch - the new compute focused uArch family is called CDNA.

1QXw1RdMYEkpb9Yq.jpg

Hopefully this doesn't mean any significant drop in async compute and general Vulkan compute performance with RDNA.
 

Krteq

Senior member
May 22, 2015
991
671
136
One snapped by me
cdnawjkk6.jpg
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136

Veradun

Senior member
Jul 29, 2016
564
780
136
I guess for now CDNA is just a label put on GCN (Vega). The true departure will happen somewhere down the road, my guess is with CDNA3.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Not quite, RDNA still has general compute capabilities, it's just not focused on that, so likely no DP FP RDNA ever, and it will lack the tensor acceleration of CDNA too, so ML will not run nearly as well or efficiently on RDNA.
I can imagine the first thing they will do is nerf Navi10's fp64 capabilities and lots of IF function blocks. Navi10 has a lot more fp64 throughput than Turing.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
6,719
7,016
136
I guess for now CDNA is just a label put on GCN (Vega). The true departure will happen somewhere down the road, my guess is with CDNA3.

-AMD seems to draw a distinction in their own slides and there are some serious changes under the hood as well.

They're ripping out all the rasterization HW used for pumping out graphics and replacing it with Tensor cores and other compute focused stuff.
 
  • Like
Reactions: Tlh97 and Stuka87

Hitman928

Diamond Member
Apr 15, 2012
5,182
7,633
136
Is there a write up of this anywhere?

Short write-up here:


It's faster than A100 in 'traditional' fp32 and fp64 but slower in pure matrix/bfloat calculations. In mixed workloads, MI100 may have the advantage as well. A100 has more VRAM but MI100 should be considerably cheaper unless Nvidia adjusts price in response.
 

Qwertilot

Golden Member
Nov 28, 2013
1,604
257
126
I did think I'd seen a few people here say that the A100 could somehow dual purpose it's tensor cores to give it a bunch more effective FP performance?

Probably depends somewhat on the details of specific workloads though.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,078
136
I did think I'd seen a few people here say that the A100 could somehow dual purpose it's tensor cores to give it a bunch more effective FP performance?
Yes and they are wrong and i kept asking them to prove it and crickets.

if you understand how a tensor core actually works its easy to understand why.

But the thing to also remember is memory/register pressure, the execution is largely the easy part. its the data movement that costs. So A100 / MI100 etc will be largely be designed so they have more or same execution resources as bandwidth/register/cache read/write , because execution resources are cheap and easy and data movement is expensive and hard so they just aren't going to leave that performance on the table.

If you could rewrite your FMA code to be GEMM then of course that is a different situation.
 

gdansk

Golden Member
Feb 8, 2011
1,988
2,357
136
On a similar node, it must be at least twice as big as Vega 20. Would this be the closest AMD has ever come to a reticle limit chip?

Seem to be winning some big deals but I guess if you're building a very expensive super computer you also have the money to hand tune the software to run well on the machine. Where this will suffer is selling to cloud vendors who rent them out and their clients will still prefer CUDA to HIP.
 
  • Like
Reactions: lightmanek

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
and their clients will still prefer CUDA to HIP.
The entire point of HIP is portability from CUDA.

Not just to AMD hardware, but also back to CUDA from HIP if you wish it = so you can keep a dual hardware codebase if you don't mind it lagging the CUDA state of the art a bit.
 

Hitman928

Diamond Member
Apr 15, 2012
5,182
7,633
136

gdansk

Golden Member
Feb 8, 2011
1,988
2,357
136
The entire point of HIP is portability from CUDA.

Not just to AMD hardware, but also back to CUDA from HIP if you wish it = so you can keep a dual hardware codebase if you don't mind it lagging the CUDA state of the art a bit.
Have you tried using it? There are many corner cases where HIPify doesn't actually work and you'll have to dig through and fix it manually. And as you say you must have to restrict yourself to CUDA8 which hasn't been a big deal but might be a step back for some people.

It's basically their only shot to get Nvidia customers over to their side but they need people to use it. But no one wants to use it because Nvidia hardware isn't prohibitively expensive and the code we have written works as-is.
 

gdansk

Golden Member
Feb 8, 2011
1,988
2,357
136
Why is there supposedly VCN still included in these? Are the expected to be used for video decode/encode at all? Seems strange when all graphics capability has been stripped out.
For machine learning applications which need to decode video, per the overview:
the AMD CDNA family retains dedicated logic for HEVC, H.264, and VP9 decoding that is sometimes used for compute workloads that operate on multimedia data, such as machine learning for object detection
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
This thing must be *massive*
I think people were already estimating it to be in the low to mid-700mm2 range for die size based on the size of the HBM PHYs, which seems kind of large to be honest since it's got 50% extra CUs than Big Navi, which is estimated to be in the low 500mm2 range with a huge 128 MB LLC, yet has all of the graphics pipeline stripped out (no TMUs, ROPs, geometry engines, etc). Are the tensor cores and doubled register files really that space hungry? I wouldn't imagine so.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
On a similar node, it must be at least twice as big as Vega 20. Would this be the closest AMD has ever come to a reticle limit chip?

Seem to be winning some big deals but I guess if you're building a very expensive super computer you also have the money to hand tune the software to run well on the machine. Where this will suffer is selling to cloud vendors who rent them out and their clients will still prefer CUDA to HIP.

Vega20 still had rasterization hardware in it. It was a full blown GPU.

CDNA cards arent video cards. They don't have the ability to output video. So yes, it will be larger than Vega20, but unlikely to be double as it had lots of stuff removed that Vega20 had.
 
  • Like
Reactions: prtskg