Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 41 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Can the smarty pants in this thread break down this VLIW news a bit? I recall AMD was on VLIW5/4 with with their og Dx10/11 archs, but moved away from VLIW for GCN and up because of occupancy issues and the arch's weakness with compute workloads etc.

What does VLIW"2" do for AMD in modern workloads and how does it overcome the issues that got AMD to move away from it in the first place?
It's a follow up of the patent "super-simd", RDNA3 can (aparently) chain mutiple instructions in theyr dual ALUs, pretty much bouncing the results of one alu to another... it's nothing alike the old gpus

This last patent descibes that a compiler keeps track of any dependencys, and if it finds one, it can delay one ALU while it waits for the results of another... Aka, it makes the gpu lose 1 or 2 cicles to avoid a complete gpu stall
 

GodisanAtheist

Diamond Member
Nov 16, 2006
6,719
7,016
136
It's a follow up of the patent "super-simd", RDNA3 can (aparently) chain mutiple instructions in theyr dual ALUs, pretty much bouncing the results of one alu to another... it's nothing alike the old gpus

This last patent descibes that a compiler keeps track of any dependencys, and if it finds one, it can delay one ALU while it waits for the results of another... Aka, it makes the gpu lose 1 or 2 cicles to avoid a complete gpu stall

- Gotcha, so it basically reduces stalling and increases occupancy, so the CUs are doing more productive work more often.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
It's a follow up of the patent "super-simd", RDNA3 can (aparently) chain mutiple instructions in theyr dual ALUs, pretty much bouncing the results of one alu to another... it's nothing alike the old gpus

This last patent descibes that a compiler keeps track of any dependencys, and if it finds one, it can delay one ALU while it waits for the results of another... Aka, it makes the gpu lose 1 or 2 cicles to avoid a complete gpu stall
The whole GPU? Or just one or more CUs?
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
The whole GPU? Or just one or more CUs?
From the patent:
An ALU pipeline has a length that corresponds to a predetermined number of cycles, such as a four-cycle long pipeline. A dependent instruction can therefore stall if it is sent to the pipeline before completion of the instruction it depends upon. For example, if an add instruction is dependent upon a move instruction and the ALU pipeline is four cycles long, the add instruction stalls for three cycles if it is sent to the pipeline one cycle after the move instruction. A conventional GPU includes a hardware instruction scoreboard to store information (e.g., in one or more flops) that is used to delay transmission of dependent instructions to the ALU pipeline until completion of the instructions that the dependent instructions depend upon. For example, in some cases, the instruction scoreboard includes six registers (entries) to store information indicating the processing status of six instructions that were previously issued to the pipeline. Every instruction compares its source registers to the destination registers of the instructions in the instruction scoreboard to identify any dependencies. If an instruction is dependent upon one or more of the instructions in the instruction scoreboard, the corresponding entry in the instruction scoreboard is monitored to determine when to send the dependent instruction to the pipeline. This process involves circuitry to perform instruction decoding and numerous comparisons of the registers. Consequently, the hardware instruction scoreboard incurs high costs in both power consumption and area on the chip.
The patent is all about moving that process from hardware to software:
FIGS. 1-4 illustrate an instruction scoreboard for the ALU pipeline that is implemented in software to reduce power consumption and area consumed by hardware in the GPU. The software-based instruction scoreboard indicates dependencies between instructions issued to the ALU pipeline with a separation between instructions smaller than the pipeline duration (referred to as “closely-spaced” instructions). The software-based instruction scoreboard selectively inserts one or more delay instructions, referred to as “control words”, into the command stream between the dependent instructions in the program code, which is then executed by the GPU. The control words identify the instruction(s) upon which the dependent instructions depend (referred to herein as “parent instructions”) so that the GPU hardware does not issue the dependent instruction to the ALU pipeline and cause the ALU pipeline to stall because the parent instruction has not yet completed.

In some embodiments, the software-based instruction scoreboard inserts a control word into the command stream immediately prior to the dependent instruction and the control word indicates the previous instruction from which the dependent instruction depends. For example, the control word indicates that the next instruction in the command stream depends on the Nth previous vector ALU (VALU) instruction. In some embodiments, the software-based instruction scoreboard implements a control word compression technique to include two or more delay values identifying two or more dependencies of upcoming instructions in a single control word to reduce instruction stream overhead. For example, a single control word identifies a parent instruction to the next instruction in the command stream and further includes a “skip” indicator identifying an instruction issuing subsequent to the next instruction as dependent on another instruction in the command stream. This control word compression technique can apply to any number of dependency specifiers per control word. In some embodiments, the control word indicates a dependency of one instruction on two or more parent instructions executing at more than one ALU pipeline. For example, in some embodiments the control word indicates a dependency on instructions executing at both a scalar ALU pipeline and at a vector ALU pipeline, or on both a special function unit (e.g., sine/cosine) ALU pipeline and one of the scalar ALU pipeline and the vector ALU pipeline.

The software-based instruction scoreboard generates the control words based on a dependency graph maintained by a compiler. The dependency graph identifies all dependencies within a program. However, not every dependent instruction requires the delay occasioned by a control word. Depending on the depth of the ALU pipeline and the number of independent instructions issuing between a parent instruction and a dependent instruction, it may not be necessary to insert extra idle cycles between the parent instruction and the dependent instruction. In some embodiments, the software-based instruction scoreboard only inserts control words as necessary, based on the number of independent instructions between dependent instructions and the number of stages of the ALU pipeline. For example, if the dependent instruction issues more than a threshold number of cycles based on the length of the ALU pipeline after its parent instruction, the parent instruction will have completed before the dependent instruction issues, and no additional idle cycles will be needed to avoid a stall of the dependent instruction. Thus, the software-based instruction scoreboard only inserts a control word in the command stream if the dependent instruction issue within the threshold number of cycles after its parent instruction. In some embodiments, the threshold number of cycles is based on the number of stages of the ALU pipeline. The techniques described herein are, in different embodiments, employed at any of a variety of parallel processors (e.g., vector processors, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), non-scalar processors, highly-parallel processors, artificial intelligence (AI) processors, inference engines, machine learning processors, other multithreaded processing units, and the like).
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
Thanks for linking the patent. This is starting to sound like a move back to software based scheduling for the graphics side of things, much like how Nvidia did with Kepler, and they got perf/W and perf/A gains that generation. Maybe we will see 50% perf/W gains for the 6nm variant of RDNA 3...

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

1655819687275.png

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
6,719
7,016
136
AMD'd driver team has been firing on all cylinders recently.

Moving back to software scheduling sounds like a high risk/high reward strategy for them.

Sounds like game level performance optimizations will skew more heavily to the software side of things for AMD with RDNA 3.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
The reason that the industry moved away from VLIW in the first place was that the compilers weren't good enough to be able to fully leverage the hardware.

Also, is this AMD adding support for the instruction format so that it can be used in cases where it makes sense or are they jumping in fully and restructuring the architecture around that format.

There's not real problem with the former. It's not really any different from CPUs adding support for SIMD instructions so that programs that can take advantage of them get better performance. However, it's an entirely different matter to completely overhaul a CPU so that everything has to be done with those SIMD instructions.

I don't know what AMD's long term plans are with respect to their GPU architecture, but I think it would be better for them to dip their toes in and evolve this rather than jumping in to such a big change.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
The reason that the industry moved away from VLIW in the first place was that the compilers weren't good enough to be able to fully leverage the hardware.

Also, is this AMD adding support for the instruction format so that it can be used in cases where it makes sense or are they jumping in fully and restructuring the architecture around that format.

There's not real problem with the former. It's not really any different from CPUs adding support for SIMD instructions so that programs that can take advantage of them get better performance. However, it's an entirely different matter to completely overhaul a CPU so that everything has to be done with those SIMD instructions.

I don't know what AMD's long term plans are with respect to their GPU architecture, but I think it would be better for them to dip their toes in and evolve this rather than jumping in to such a big change.
I agree, especially from a driver standpoint. I suppose RDNA 1 and RDNA 2 will be unaffected if AMD shifts to a software/static scheduler, largely because those architectures already can manage on their own due to their own hardware scheduler, but the drivers would need to be rebuilt for RDNA 3 if RDNA 3 heavily relies on software scheduling. I've seen MLID say stuff about AMD wanting to really polish their drivers for RDNA 3's launch, and this could be one additional reason for that.

With that said, VLIW does have its complications but perhaps not so bad when it's just VLIW2, especially when the GPU has the fallback option of operating in "VLIW1" mode. It seemed like the biggest hurdle with Terascale (VLIW4 and VLIW5) was getting the compiler good enough to keep all of the execution units fed; I don't feel like that is a problem with RDNA because the width of the SIMD is sized to match the warp. Like Nvidia figured out with Kepler, if you never have to worry about splitting a warp across multiple clock cycles and you also knew the latency of the instruction, the software scheduling is vastly simplified. As far as I am aware, even Nvidia's HPC GPUs still use software scheduling. AMD may want to stick with GCN/hardware scheduling for their HPC GPUs, if only because there might be some tangible benefits to have one in the HPC space, but at least for consumer graphics a hardware-based scheduler is likely unnecessary. AMD went from software-based scheduling to hardware-based scheduling because GCN was their foray into GPU computing; now that they have two GPU architectures, they can afford to specialize each architecture further for its intended purpose.
 
  • Like
Reactions: Tlh97

Frenetic Pony

Senior member
May 1, 2012
218
179
116
With that said, VLIW does have its complications but perhaps not so bad when it's just VLIW2, especially when the GPU has the fallback option of operating in "VLIW1" mode. It seemed like the biggest hurdle with Terascale (VLIW4 and VLIW5) was getting the compiler good enough to keep all of the execution units fed; I don't feel like that is a problem with RDNA because the width of the SIMD is sized to match the warp. Like Nvidia figured out with Kepler, if you never have to worry about splitting a warp across multiple clock cycles and you also knew the latency of the instruction, the software scheduling is vastly simplified. As far as I am aware, even Nvidia's HPC GPUs still use software scheduling. AMD may want to stick with GCN/hardware scheduling for their HPC GPUs, if only because there might be some tangible benefits to have one in the HPC space, but at least for consumer graphics a hardware-based scheduler is likely unnecessary. AMD went from software-based scheduling to hardware-based scheduling because GCN was their foray into GPU computing; now that they have two GPU architectures, they can afford to specialize each architecture further for its intended purpose.

AMD seems really bent on pushing their perf per mm with RDNA3, and shrinking CUs altogether will go a long way towards that, ie software scheduling. Other than better RT performance, which might come from that much earlier hybrid texture fetch/RT accelerator, I can't think of any features they really need to add, so getting more perf per transistor might doable in RDNA3 even above higher clockspeeds from going to 5nm and higher power draw. I wonder if VLIW addition has anything to do with that.

I'd still be surprised if that 50% perf per watt increase came from almost the same node, IE the vague maybe false notion of a 6nm only RDNA3 that AMD didn't acknowledge at all in their presentation for whatever reason. They're already doing great on perf per watt, judging from the Samsung collab they're on par with Qualcomm, the second best mobile GPU vendor out there in terms of perf per watt. But I'm willing to be proven wrong, after all RDNA2 isn't on par with Apple's incredibly performant arch, so there's already an example in mass production that better still can be accomplished. After all shouldn't software scheduling preclude a bunch of reading and writing to registers, giving lower power usage right there?
 
Last edited:
  • Like
Reactions: Tlh97

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Other than better RT performance, which might come from that much earlier hybrid texture fetch/RT accelerator, I can't think of any features they really need to add.

Pure expeculation: for next generation, AMD might improve the "dual vector issue" to be more flexible... Like 4 lanes of SIMD16, for a "quad vector issue"...do 1 wave 64, 2 wave32 and 4 wave16 per clock... Maybe even do unussual issues like wave48 + wave16
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
Pure expeculation: for next generation, AMD might improve the "dual vector issue" to be more flexible... Like 4 lanes of SIMD16, for a "quad vector issue"...do 1 wave 64, 2 wave32 and 4 wave16 per clock... Maybe even do unussual issues like wave48 + wave16
There's an old patent that people speculated on for a bit before Vega came out that AMD might have a variable width SIMD design:

In essence, if implemented, it would help with keeping the whole SIMD bank fed with instructions. Not sure how complicated the logic needs to be to get it to work, but ARM v9 does it if I'm not mistaken.

1655906650473.png
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
There's an old patent that people speculated on for a bit before Vega came out that AMD might have a variable width SIMD design:

In essence, if implemented, it would help with keeping the whole SIMD bank fed with instructions. Not sure how complicated the logic needs to be to get it to work, but ARM v9 does it if I'm not mistaken.

View attachment 63415

AFAIK the guy behind this went out to do his own startup, one which Jim Keller is a major backer of and is now their CTO. Might be a missed opportunity on AMD's part, not that they can't just buy the startup.
 

Leeea

Diamond Member
Apr 3, 2020
3,599
5,340
106
I will believe this whole software scheduler thing when I see it.

I have seen the speculation on RDNA3 go all over the place. Right now this is just wild rumors.
 
Last edited:

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
I will be these whole software scheduler thing when I see it.

I have seen the speculation on RDNA3 go all over the place. Right now this is just wild rumors.

Same, however it would not surprise me. AMD has already shown with Zen 4 they are willing to use the competitor techniques (Higher power limits, higher frequencies for Zen 4) in their own products. A move to software scheduling would just be heading in the direction of what NVIDIA does. It also has an advantage in that game-specific or engine-specific optimizations can be applied.
 

Karnak

Senior member
Jan 5, 2017
399
767
136
Interview with Naffziger from AMD:

“It's really the fundamentals of physics that are driving this,” Naffziger explained. "The demand for gaming and compute performance is, if anything, just accelerating, and at the same time, the underlying process technology is slowing down pretty dramatically — and the improvement rate. So the power levels are just going to keep going up. Now, we've got a multi-year roadmap of very significant efficiency improvements to offset that curve, but the trend is there.”
"Performance is king," stated Naffziger, "but even if our designs are more power-efficient, that doesn't mean you don't push power levels up if the competition is doing the same thing. It's just that they'll have to push them a lot higher than we will."
We asked at one point whether the chiplets would be similar to Aldebaran (two large dies with a fast interface linking them) or more like the Ryzen CPUs with an I/O chiplet and multiple compute chiplets. The best we could get out of him was a statement that the latter approach was "a reasonable inference" and that AMD would be doing its chiplet-based GPU architecture in "a very graphics-specific way."

 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Based on Sam's statements, seems chiplet is not only for memory/cache. So does that mean multiple GCDs is back on the table?
We asked at one point whether the chiplets would be similar to Aldebaran (two large dies with a fast interface linking them) or more like the Ryzen CPUs with an I/O chiplet and multiple compute chiplets. The best we could get out of him was a statement that the latter approach was "a reasonable inference" and that AMD would be doing its chiplet-based GPU architecture in "a very graphics-specific way."
Any hints from the usual tipsters? or everybody is just rolling some dice and whatever is up is how they leaked chiplet count? :D

When I checked David's presentation, seems MI300 will have "3D Chiplet packaging", whereas RDNA3 will have "Advanced Chiplet Packaging".
So it seems 3D chiplets are out of question for Navi 3x, probably for good reasons too, like cost etc., probably just EFB

So far the consensus I have seen is 96 CUs, would seem very conservative to me, if they are going chiplet route.
1 CU = ~2.05 mm2 for N21, and CP+L2+ACE+etc. (excluding IC, PHY, IO, Multimedia) is around 110mm2 for N21.
Rough estimation even for 30% bigger CU, 2x L2 for 6 SEs and 20% bigger ACE/CP etc., it will still be around 300m2 for 96 CUs.

What's the new consensus after the reveal?

Are we gonna get the real Ryzen style chiplet (hinted by Sam) like Layout 1 below, or a more conservative Layout 2.
1656254581925.png
400mm2 just for the SEs and ACE/CP would be plenty of MTr on N5 if we are not getting 2x GCDs.
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Based on Sam's statements, seems chiplet is not only for memory/cache. So does that mean multiple GCDs is back on the table?

Any hints from the usual tipsters? or everybody is just rolling some dice and whatever is up is how they leaked chiplet count? :D

When I checked David's presentation, seems MI300 will have "3D Chiplet packaging", whereas RDNA3 will have "Advanced Chiplet Packaging".
So it seems 3D chiplets are out of question for Navi 3x, probably for good reasons too, like cost etc., probably just EFB

So far the consensus I have seen is 96 CUs, would seem very conservative to me, if they are going chiplet route.
1 CU = ~2.05 mm2 for N21, and CP+L2+ACE+etc. (excluding IC, PHY, IO, Multimedia) is around 110mm2 for N21.
Rough estimation even for 30% bigger CU, 2x L2 for 6 SEs and 20% bigger ACE/CP etc., it will still be around 300m2 for 96 CUs.

What's the new consensus after the reveal?

Are we gonna get the real Ryzen style chiplet (hinted by Sam) like Layout 1 below, or a more conservative Layout 2.
View attachment 63617
400mm2 just for the SEs and ACE/CP would be plenty of MTr on N5 if we are not getting 2x GCDs.
N21 = 80 CU & N31 = 96 CU? Seems wrong if everything else assumed correct.

A 96 CU N31 on N7 using inflated areas to compare to N21 = roughly 388 mm^2 (96x2.05x1.3 + 110x1.2)

Surely this design on N5 would be at least 60% smaller or ~ 232 mm^2, or ~ 200 mm^2 @ 2X scaling.

Definitely much more than 96 CU. The WGP are supposed to be 2X previous. Maybe mixing CU and WGP numbers especially if N7>N5 scaling is closer to 2X.
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Based on Sam's statements, seems chiplet is not only for memory/cache. So does that mean multiple GCDs is back on the table?

No. It's just typical PR speak. Saying something so vague it doesn't mean anything. "graphics-specific way". What is that supposed to mean? Exactly nothing. If you want to interpret you could also say it's too hard in graphics to separate compute dies so only everything else will be separate chips. Honestly, it makes so much sense if AMD went evolutionary way and not revolutionary. Also for risk management. first step: take out the non compute stuff into separate die(s) on cheaper process. That seems at least somewhat straight-forward compared to 2 or more compute dies.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,079
136
i place my bets on Bondrewd from beyond3d before he stopped talking about stuff. He was bang on about RDNA2 and MI200, Trento etc , took so much crap because he only answers in one liners then when everything was 90-100% as he said to the letter all the NV shills started to leave him alone.

He was very much adamite about two compute tiles and 2.5 to 3x performance any AMD are going to charge so much its all irrelevant anyway

edit: lines like this seem especailly interesting now with all the leaks around RDNA 2 vs 3 WGP's

120 per GCD, but they're 30WGP and you should count them as such.
 
Last edited:

DiogoDX

Senior member
Oct 11, 2012
746
277
136
i place my bets on Bondrewd from beyond3d before he stopped talking about stuff. He was bang on about RDNA2 and MI200, Trento etc , took so much crap because he only answers in one liners then when everything was 90-100% as he said to the letter all the NV shills started to leave him alone.

He was very much adamite about two compute tiles and 2.5 to 3x performance any AMD are going to charge so much its all irrelevant anyway

edit: lines like this seem especailly interesting now with all the leaks around RDNA 2 vs 3 WGP's
I think that was him that people (me also) tought hat was crazy to say that RDNA2 would reach 2.5GHZ overclocked.

Even if the 2GDC gpu exiits maybe is just for pro and that would not make all other leakers wrong.
 
Last edited:
  • Like
Reactions: lightmanek

Frenetic Pony

Senior member
May 1, 2012
218
179
116
N21 = 80 CU & N31 = 96 CU? Seems wrong if everything else assumed correct.

A 96 CU N31 on N7 using inflated areas to compare to N21 = roughly 388 mm^2 (96x2.05x1.3 + 110x1.2)

Surely this design on N5 would be at least 60% smaller or ~ 232 mm^2, or ~ 200 mm^2 @ 2X scaling.

Definitely much more than 96 CU. The WGP are supposed to be 2X previous. Maybe mixing CU and WGP numbers especially if N7>N5 scaling is closer to 2X.

With 2 of those you're reaching max bandwidth and power capabilities. Without being able to choose between 1 or 2 compute chiplets there's no way you can cover the >=$1600 - <$199 market with just three designs.