Question Speculation: RDNA2 + CDNA Architectures thread

Page 75 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,166
15,311
136
Anyone else passing the time by reading the insane theories online?
I am getting bored about
1) Hearing about what AMD is going to do with their next video cards...
2) hearing about what Intel is GOING to do with anything !
3) Hearing about what AMD is doing with Zen3....
4) hearing about when we might be able to buy a 3000 series Ampere.

Come on ! I want something to discuss !!! 3000 is faster, but you can't buy it, so that leaves.....


NOTHING !!
 

reb0rn

Senior member
Dec 31, 2009
240
73
101
I must say there is so many misinformation, no one can even speculate on memory bandwith to start from it
like 16GB can only be 256/512bit or HBM2
and 12GB is 384bit

if its just 256bit i can`t see being any if at all faster then 3070
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
I am getting bored about
1) Hearing about what AMD is going to do with their next video cards...
2) hearing about what Intel is GOING to do with anything !
3) Hearing about what AMD is doing with Zen3....
4) hearing about when we might be able to buy a 3000 series Ampere.

Come on ! I want something to discuss !!! 3000 is faster, but you can't buy it, so that leaves.....


NOTHING !!
I personally love, maybe even loathe the threads on other sites of what AMD should do. How AMD can counter bots. How AMD has failed already based on leaks, namely the 256 bit bus engineering sample card, or that they should sell RTG off to NVidia.
 

Saylick

Diamond Member
Sep 10, 2012
3,531
7,858
136
Found this on Reddit. 20% IPC gains incoming?

https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

Abstract:
Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high through-put for a wide range of applications. However, the memory wall often limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improving cachehit rates, which in turn can improve throughput for memory intensive applications. However, almost all of the prior works assume a conventional cache hierarchy where each GPU core has a private local L1 cache and all cores share the L2 cache. Our analysis shows that this canonical organization does not allow optimal utilization of caches because the private nature of L1 caches allows multiple copies of the same cache line to get replicated across cores.
We introduce a new shared L1 cache organization, where all ccores collectively cache a single copy of the data at only one location (core), leading to zero data replication. We achieve this by allowing each core to cache only a non-overlapping slice of the entire address range. Such a design is useful for significantly improving the collective L1 hit rates but incurs latency overheads from additional communications when a core requests data not allowed to be present in its own cache. While many workloads can tolerate this additional latency, several workloads show performance sensitivities. Therefore, we develop lightweight communication optimization techniques and a run-time mechanism that considers the latency-tolerance characteristics of applications to decide which applications should execute in private versus shared L1 cache organization and reconfigures the caches accordingly. In effect, we achieve significant performance and energy efficiency improvements, at a modest hardware cost, for applications that prefer the shared organization, with little to no impact on other applications.
 
Last edited:

blckgrffn

Diamond Member
May 1, 2003
9,299
3,440
136
www.teamjuchems.com
I must say there is so many misinformation, no one can even speculate on memory bandwith to start from it
like 16GB can only be 256/512bit or HBM2
and 12GB is 384bit

if its just 256bit i can`t see being any if at all faster then 3070

Fine. I promise that Big Navi is Hawaii reborn with a 512 bit bus. And infinity cache. Pinky swear.

Don’t ask for sources because I don’t have any. I just can’t let this train slow down.

If only I had a YouTube channel where I got paid per view 🤔
 

DiogoDX

Senior member
Oct 11, 2012
747
279
136
I must say there is so many misinformation, no one can even speculate on memory bandwith to start from it
like 16GB can only be 256/512bit or HBM2
and 12GB is 384bit

if its just 256bit i can`t see being any if at all faster then 3070
12GB can be 192bits too.
 

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
That's fair to ask - but if I could sell today and get nearly $400 and then in a five weeks there is such a bountiful crop of AMD Navi cards that I lose ~$200 on resale then that seems like lost money to me. Like deciding when to sell a stock...

And I've got a years use out of this thing, so I could look it at is as ~no cost per month of usage (sell now) or ~$20 per month (sell post launch) OR just pass it down to my son like I intended to and like you just get years of functional use out of it.

If used GPU prices hadn't been so crazy last year I probably would have tried to find a Vega 56 or something to nurse myself into RDNA2. I was so close to buying a Fury Nano on eBay for ~$105 shipped - I am kind of annoyed I didn't because of how niche that card was :tearsofjoy: (I put in an offer for $100 and he countered at $105 and I let it expire)

Maybe if you consider GPUs an investment? I usually end up giving them away. I routinely rebuild PCs and give them to family, friends, and those less fortunate (not necessarily in that order). To me it's a sunk cost. A part of my hobby.

I must say there is so many misinformation, no one can even speculate on memory bandwith to start from it
like 16GB can only be 256/512bit or HBM2
and 12GB is 384bit

if its just 256bit i can`t see being any if at all faster then 3070
What's funny is that the bus size can actually be any size. Most people don't realize this, but yes, it's possible to have 16gb of GDDR6 and a 352 or 384 bit bus. There are a number of ways to do this (though to be fair they aren't used as far as I'm aware). I'll leave it to your imagination to figure this out.

Found this on Reddit. 20% IPC gains incoming?

https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

Abstract:

It is my understanding that the actual "IPC" (in quotes because can one really use the term 'IPC'?) of the architecture, including everything (rendering, shaders, etc.) is closer to 7%. We will see, however. My information is based mostly on console related stuff. I've seen numerous rumors and leaks that indicate that PC RDNA2 parts are at least somewhat different from console parts, but I'm not sure those changes will help "IPC". AMD is going to reach performance by scaling CU count upwards. An "IPC" increase isn't needed. It's just icing on the cake. Coincidentally, a 50% perf/watt increase would allow them to have a 72CU part run at the same TDP and same clocks as the RX 5700 XT. Food for thought. Assuming that they are able to scale up performance with CU count, well...

I know some people here may not understand the concept of AMD delivering solid execution, but they've been literally "executing" Intel. Anyone that claims they can't do the same thing to NVIDIA should stop posting here and short AMD stock. :D

EDIT: As an addendum to why "IPC" isn't really valid for GPUs, the "TFLOPs" measurement is the closest you'd get to IPC, which as you can see is wildly abused (NVIDIA claims double FP32 TFLOPs with the 3080 over the 2080ti, yet as we've witnessed, it performs 20-30% faster). Once you start factoring in geometry, textures, clocks, shaders, etc. all bets are off.

EDIT 2: As an example of why IPC can't really be measured. Vega64 has 12.66 TFLOPs of compute power, or nearly 30% more than the RX 5700 XT. However, you'll note that the RX 5700 XT beats the Vega 64 soundly in gaming. No, AMD isn't making up the TFLOPs number. Vega has really strong compute performance, but not so great gaming performance.
 
Last edited:

Tup3x

Golden Member
Dec 31, 2016
1,086
1,084
136
Well, new Xbox does have this weird memory layout. It might work for it since CPU will mainly use the slower pool but I'm not so sure how well it would work for top end discreet GPU. GTX 550 Ti also had similar thing.
 

pandemonium

Golden Member
Mar 17, 2011
1,777
76
91
It really is.

If I'm understanding what they're laying out in theory, they want to basically AI the entire pipeline from the start, by task.

Given their wide range of compute tests they used, I can see this having an impact on real-time rendering. Like DLSS improving vastly over a generation, this could have broad ramifications for how efficiently GPGPUs handle their tasks.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
7,166
7,666
136
It really is.

If I'm understanding what they're laying out in theory, they want to basically AI the entire pipeline from the start, by task.

Given their wide range of compute tests they used, I can see this having an impact on real-time rendering. Like DLSS improving vastly over a generation, this could have broad ramifications for how efficiently GPGPUs handle their tasks.

- But is it going to be ready for a top to bottom RDNA2 stack? This type of radical technological shift looks like it would be a prime candidate for a pipe cleaner product or a mid gen refresh, not a top to bottom stack launch.

Wonder if this is the kind of thing that's being kept in the pipe for an RDNA3 launch or even further down the line.

After all the promises of the new pathways and discard accelerators etc in Vega and Polaris I would be less surprised if AMD managed to bork the physical design so the feature is useless than not.
 

Zstream

Diamond Member
Oct 24, 2005
3,395
277
136
It really is.

If I'm understanding what they're laying out in theory, they want to basically AI the entire pipeline from the start, by task.

Given their wide range of compute tests they used, I can see this having an impact on real-time rendering. Like DLSS improving vastly over a generation, this could have broad ramifications for how efficiently GPGPUs handle their tasks.
This will likely not help FPS games by 20% IMO, but rather compute workloads. I could see this being used in certain game engines when textures and meshes are reused.
 

Zstream

Diamond Member
Oct 24, 2005
3,395
277
136
What about hybrid ray tracing? Just wondering is all.
I believe ray tracing is unique at every turn, or at least was when I ran simulations 10 years ago or so on my CPU. However, if meshes and textures are reused, it will help with tessellated scenes while offloading compute to RT.
 

Zstream

Diamond Member
Oct 24, 2005
3,395
277
136
Well...Yes.

Ray Tracing (BVH + Intersections) is quite heavy on cache and bandwidth
In what fluid scenes would you cache RT objects? The moment the camera moves, the light object changes course and it must compute Rays all over again. Maybe static scenes.
 

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,249
136
In what fluid scenes would you cache RT objects?

I'm not really into the technical aspects of the way things work.

What about when your panning up, down, left, right looking for your next victim? At least it sounds like it would be better to cache then do it all over again.
 

Krteq

Senior member
May 22, 2015
993
672
136
In what fluid scenes would you cache RT objects? The moment the camera moves, the light object changes course and it must compute Rays all over again. Maybe static scenes.
What? I'm not talking about caching any objects, it doesn't make any sense

I'm talking about BVH + Ray Intersection calculations are heavy on cache and bandwith, because you have to store BVH Data in cache for Ray intersection testing etc.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
And by writing needless replies to people that have differing opinions and 0% likelihood of changing said opinions? Yup.
I also like the angry ones. Spotted one on another forum yesterday where two AMD fans were threatening to murder each other because one said AMD drivers have always been crap and the other said he was lying, and then they realized they lived in the same country, and then the threats began. It was quite the show. I wish I had chips to snack on as it unfolded.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
- But is it going to be ready for a top to bottom RDNA2 stack? This type of radical technological shift looks like it would be a prime candidate for a pipe cleaner product or a mid gen refresh, not a top to bottom stack launch.

Wonder if this is the kind of thing that's being kept in the pipe for an RDNA3 launch or even further down the line.

After all the promises of the new pathways and discard accelerators etc in Vega and Polaris I would be less surprised if AMD managed to bork the physical design so the feature is useless than not.

This is AMD we're talking about. It's not unheard of them to drop everything for something new even if it means leaving older customers flailing in the wind. A mid-gen refresh doesn't seem like the type of thing you'd do with a so-called MCM GPU coming after that. You'd want to set up your foundations well in advance and gather remotely collected data over time to improve your design process.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
- But is it going to be ready for a top to bottom RDNA2 stack? This type of radical technological shift looks like it would be a prime candidate for a pipe cleaner product or a mid gen refresh, not a top to bottom stack launch.

Wonder if this is the kind of thing that's being kept in the pipe for an RDNA3 launch or even further down the line.

After all the promises of the new pathways and discard accelerators etc in Vega and Polaris I would be less surprised if AMD managed to bork the physical design so the feature is useless than not.

They do need some outside of the box solution to compete with Nvidia. I don’t think they will compete well if they just make a bigger gpu. They are increasing efficiency significantly, but that doesn’t seem sufficient, just necessary. They may have done better than Nvidia though, with TSMC vs. Samsung processes.

The cache rumors are still a bit odd. Such a cache may help significantly with ray tracing though. If they are saying that it will perform like it has a 384-bit bus with only 256-bits, then it seems like it needs to be a 4 IFOP style links to deliver that kind of bandwidth given the speed of GDDR6. I guess it could actually be 4x single link devices which would be very interesting, but expensive. That might still be cheap though, especially if it was made at GF or something.

This had me wondering how such a device could be re-used with Epyc processors. For Epyc, the most sense would be as an additional chip that would fit in between the IO die and the cpu die without really needing to change the IO die or the cpu die. If they could fit such a cache chip on either side of the Epyc IO die, they could have 2x 128 MB transparent L4 caches per Epyc package if they have 8 IFOP on the cache chip rather than just 4. Four could connect to the IO die and the other 4 to the cpu dies. The problem is, such a device would be quite large, possibly 200 to 250 square mm or so. That probably wouldn’t fit with the existing Epyc package layout.

This is wild speculation, but this led me to wonder if it could be a “pipe cleaning“ stacked device. With Zen 4, they would want to move to an active interposer for the IO die with cpu and possibly memory stacked on top. That is a big change to do all at once. Could this cache device be a precursor with 8x IFOP in one layer and 128 MB of cache stacked? Maybe later the cache die stack with the cpu. That would probably fit on an existing Epyc package with essentially the same layout. They could make some without cache and some with cache. Intel wouldn’t even compete with the cheaper parts with no L4.

This is continuing wild speculation, but if an RDNA GPU has 4 IFOP, then they could technically connect two GPUs together directly, or with one of these caches in between, with something like 150 to 200 GB/s or so in each direction. There has been some infinity architecture slides that show CDNA GPUs with what looks like 6 links, connecting up to 8 GPUs to each other. That isn’t actually fully connected, but the slide may not representative, or they don’t support fully connected with 8 GPUs. It may be possible to connect GPUs with IFOP on the same board. The die used in current AMD MCMs are really just BGA packages, which is why they shouldn't be called chiplets; chiplets should be reserved for devices on silicon interposers. For HPC, I could see them possibly mounting 4 to 8 HBM GPUs very close to each other with IFOP connecting adjacent GPUs together and IFIS style links for the longer runs. You would definitely need water cooling, but a lot of HPC has already gone water cooling. If they are IFIS links rather than IFOP, then it wouldn’t be quite as fast or power efficient, but it would allow for multiple GPUs on the same board with larger spacing between them.

If they are using it on their CDNA architecture then it wouldn’t be that much of a stretch that it would show up in the consumer cards if they can figure out how to make good use of it. Multi-gpu works fine for some compute applications. I have seen cases where 2 GPUs with half of the compute and half the memory bandwidth perform almost the same as one large gpu. That doesn’t necessarily work well with rendering. Although, at that speed, they could do some unified memory and some other things that might be interesting. AMD has fully virtualized memory for their GPUs similar to CPU memory, which should facilitate sharing.

This is brings me back around to wonder if the 128 MB cache rumor is that someone misunderstood. We seem to only have one source for that. Could it actually be that it is 128-bit infinity fabric connection (4x IFOP) rather than 128 MB “infinty cache”? The whole infinity cache thing could just be made up, or it could refer to something completely different, like sharing memory across infinity fabric.

I think this has been my journey for this weekend. I have things to do.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
I haven't bothered to look into as much as you have, @jamescox but when Andrei mentioned it last week, he did state (from memory) that it would benefit their entire range of processors as a simple cache system, and it wouldn't be costly at scale.

Zen 4 could be larger. Ideally you wouldn't want it too large with too much space, as that will begin to affect the little things.

This is continuing wild speculation, but if an RDNA GPU has 4 IFOP, then they could technically connect two GPUs together directly, or with one of these caches in between, with something like 150 to 200 GB/s or so in each direction. There has been some infinity architecture slides that show CDNA GPUs with what looks like 6 links, connecting up to 8 GPUs to each other.
Sorry, are you referring to the patent going around with the crossbar in the middle? It's only suspected that is for CDNA and not CDNA and RDNA. It is a stepping stone towards MCM which is rumored to be slated for RDNA3.
 
Last edited: