[Rumor, Tweaktown] AMD to launch next-gen Navi graphics cards at E3

Page 94 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

maddie

Diamond Member
Jul 18, 2010
5,204
5,613
136
And how will the game see it as one GPU? ;)
How does a game see the individual transistors in the GPU as one unit? You act as if signals don't route from one part of the die to another. Do you really think it matters from an operational functionality if the circuits are close or far? The problem is power used and lack of clocks.

The ability of the command processor to assign work to various CU is, I think, a key .

So, speculating here, we can have this solution. A shared command processor for the various GPU processing chipets, to maintain cohesion vs a traditional CF solution. 3D stacking is an obvious solution to the chiplet communicating with the command processor, but we all know the thermal barriers to use. From a signals consideration, 3D stacking will be more efficient than planer GPUs.
 

beginner99

Diamond Member
Jun 2, 2009
5,320
1,768
136
And how will the game see it as one GPU? ;)

I'm gonna argue that is rather simple problem. The real problem is performance / latency over the GPU chiplets and consistent performance. For compute / deeplearning it doesn't matter if there are performance spikes but in graphics / gaming that is a no go.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
And how will the game see it as one GPU? ;)
As I understand, and I'm by no means an expert, getting games to see it as one GPU is not hard, though it could have its own difficulties.

The problem we're trying to tackle is how to increase GPU performance, and as I understand there are 3 currently feasible "physical" ways - increase die size, shrink process, or chiplet design to increase yields/decrease cost. Maybe someone could add any other possible routes, obviously improving overall microarchitecture can play a huge role too.

The limits as best I can tell are fairly well known. For process shrinks, it will only be an answer for a few more cycles on current electronics, perhaps photonics or quantum is the next route, but that's going to be expensive. You could also just wait until process and silicon costs drop, and increase die size, but there are a lot of heat and topology issues that mean each additional transistor doesn't add linearly to the power of the processor (as you add more to the edges, communication between one edge and the other takes longer, for instance, which is exacerbated on a larger process, as well as more difficult heat dissipation). A much harder answer, of course, is a chiplet design, because of the importance of latency and bandwidth, and the lack of a current robust communication link that works at low latency over longer distances (i.e. inter-chiplet).

For a GPU chiplet, unlike the Zen2 chiplet design, they'd almost certainly require direct chiplet-to-chiplet communication link (as best I could tell), because the latency from distance on two links from chiplet-I/O-chiplet would be unacceptable in GPUs. I think it's already been theorized that they'd need active interposers or some other more robust link between chiplets directly.

So there might be two possible routes for chiplets to become feasible for GPUs: 1) someone figures out a cost-effective and robust inter-chiplet latency/bandwidth issue making it feasible to use chiplets; or 2) some new development makes SLI/Crossfire type of chiplet work delegation more feasible. Rather than relying on drivers/software perhaps there is a hardware scheduler-like solution that might be more efficient.

Another consideration is to decrease die size of the execution section. While 70-75% of the die of current GPUs is the actual execution unit (unlike CPUs which have smaller execution die allocation), still, moving 25-30% front-end off-die (with a 5-10% die size tax for communication from front-end to execution unit) might allow reasonable performance gains and some cost-reduction but latency again is a big issue when you start talking about having the scheduler and other front-end work done from afar.

I'd like to hear more from those with more knowledge of GPU design, really interesting in the little bit I've read about it.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Read this post as a whole.

How does a game see the individual transistors in the GPU as one unit? You act as if signals don't route from one part of the die to another. Do you really think it matters from an operational functionality if the circuits are close or far? The problem is power used and lack of clocks.

The ability of the command processor to assign work to various CU is, I think, a key .

So, speculating here, we can have this solution. A shared command processor for the various GPU processing chipets, to maintain cohesion vs a traditional CF solution. 3D stacking is an obvious solution to the chiplet communicating with the command processor, but we all know the thermal barriers to use. From a signals consideration, 3D stacking will be more efficient than planer GPUs.
What I have said, and to which you responded is perfectly summed by Beginner99's post:
I'm gonna argue that is rather simple problem. The real problem is performance / latency over the GPU chiplets and consistent performance. For compute / deeplearning it doesn't matter if there are performance spikes but in graphics / gaming that is a no go.

This is the problem GPU chiplets face, and why I say that in order to diminish it you have to make chiplet based GPUs to look like single one. For compute - there is no problem. Apps see it simply as ALUs. But for Rasterization, and from this moment on - RAY TRACING(!), its not as easy.

In theory Split Frame Rendering could be answer. But it has its own problems, which do not solve Chiplet based gaming GPU problems.

Chiplets, for gaming are future, but very far future.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
A GPU is already made up of thousands of cores. Adding more cores, even if broken out into chiplets has no impact on games. SLI/CF had issues with games because it was two physical devices that the OS had to call down to. With a bunch of chiplets on a single card, the OS has no clue that its multiple chiplets instead of one giant chip.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
A GPU is already made up of thousands of cores. Adding more cores, even if broken out into chiplets has no impact on games. SLI/CF had issues with games because it was two physical devices that the OS had to call down to. With a bunch of chiplets on a single card, the OS has no clue that its multiple chiplets instead of one giant chip.
Thats not how GPU Chiplets would work. Do not think for a second you can make Chiplets that only accomodate Shader Engines, or ALU clusters. Its not how it works.
 

maddie

Diamond Member
Jul 18, 2010
5,204
5,613
136
Thats not how GPU Chiplets would work. Do not think for a second you can make Chiplets that only accomodate Shader Engines, or ALU clusters. Its not how it works.
When the 1st computers were designed out of discrete transistors, capacitors, etc, do you think they were not a full computer when seen by the software?

When we then advanced to integrated circuitry but still with separate components such as cache, ALUs, FPUs, etc, do you think they were not a full computer when seen by the software?

When we then advanced to full integrated SOCs, is this the point where you first see them as a full computer?

The fact that circuitry is not integrated does not mean that it cannot function as a unified whole from the software angle. You are mistaking inefficient operation with inability to function.

Remember when L2 cache was outside of the CPU? What horror, how could it work.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
Thats not how GPU Chiplets would work. Do not think for a second you can make Chiplets that only accomodate Shader Engines, or ALU clusters. Its not how it works.

You misunderstood what I said. SLI/CF were an issue because of software. The OS saw two discrete devices, and it required a special layer of software to divide up the work.

When it comes to multiple chiplets on a single discrete device, the software (ie: OS and drivers) have no clue what is going on inside this device. All it knows is it sends work, and it gets a response back. There could be one big unified GPU, or there could be a whole bunch of chiplets that divide up work. The OS won't care.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
You misunderstood what I said. SLI/CF were an issue because of software. The OS saw two discrete devices, and it required a special layer of software to divide up the work.
In current state similar thing will happen with Chiplet GPUs.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
When the 1st computers were designed out of discrete transistors, capacitors, etc, do you think they were not a full computer when seen by the software?

When we then advanced to integrated circuitry but still with separate components such as cache, ALUs, FPUs, etc, do you think they were not a full computer when seen by the software?

When we then advanced to full integrated SOCs, is this the point where you first see them as a full computer?

The fact that circuitry is not integrated does not mean that it cannot function as a unified whole from the software angle. You are mistaking inefficient operation with inability to function.

Remember when L2 cache was outside of the CPU? What horror, how could it work.
Can you read anything that David Wang said about MCM modules?

Like here: https://www.pcgamesn.com/amd-navi-monolithic-gpu-design

Chiplets are not happening for a long time. Its an issue of both: hardware AND software.
 
Mar 11, 2004
23,444
5,852
146
More recently, an AMD person said the I/O chiplet of Zen 2 lets them bypass NUMA issues because it makes the CPU look like a single monolithic one to the OS. If that can be done for GPUs as well, that would seemingly remove a big roadblock.

As for devs not wanting to put in the work to support mGPU. What if the next consoles do chiplets, so they have to do it anyway? (We got lots of dev bellyaching about multicore on the 360/PS3, they still did it anyway, and by the time the next consoles came out they actually were so happy to have x86 that they didn't really complain and had already done a lot for multi-core support; now its not an issue at all and they're outright eager for even more cores/threads in the next gen consoles). I'm not talking the fact that CPU and GPU chiplets with an I/O die is likely, I'm talking multiple GPU chiplets.

I'd guess the next Xbox is at least double the TF of the One X, which would put it around 12, which is 20% higher than Navi 10. I highly doubt they put a much bigger single GPU die in. Plus, knowing that the 5700XT is clocked quite high - and thus uses a lot of power and puts out quite a bit of heat, to get its 9.75TF, means that a lower clocked GPU going in the consoles will need to be even bigger than just the ~20% over Navi 10 to offer ~20% better performance, it'll need to be likely more in the 33-50% larger range since it'll likely have 10-33% lower clocks than the 5700XT. There's been some rumors that suggested the Xbox is closer to 15TF, which means we're potentially talking quite a lot bigger (like nearly double the size in order to offer that performance at the clock speeds that a console would be able to support).

Google is touting mGPU support with Stadia, and that's on Linux to boot, and Microsoft almost certainly will be trying to sort out mGPU game rendering to get the best utilization and performance on their cloud gaming stuff.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
More recently, an AMD person said the I/O chiplet of Zen 2 lets them bypass NUMA issues because it makes the CPU look like a single monolithic one to the OS. If that can be done for GPUs as well, that would seemingly remove a big roadblock.

As for devs not wanting to put in the work to support mGPU. What if the next consoles do chiplets, so they have to do it anyway? (We got lots of dev bellyaching about multicore on the 360/PS3, they still did it anyway, and by the time the next consoles came out they actually were so happy to have x86 that they didn't really complain and had already done a lot for multi-core support; now its not an issue at all and they're outright eager for even more cores/threads in the next gen consoles). I'm not talking the fact that CPU and GPU chiplets with an I/O die is likely, I'm talking multiple GPU chiplets.

I'd guess the next Xbox is at least double the TF of the One X, which would put it around 12, which is 20% higher than Navi 10. I highly doubt they put a much bigger single GPU die in. Plus, knowing that the 5700XT is clocked quite high - and thus uses a lot of power and puts out quite a bit of heat, to get its 9.75TF, means that a lower clocked GPU going in the consoles will need to be even bigger than just the ~20% over Navi 10 to offer ~20% better performance, it'll need to be likely more in the 33-50% larger range since it'll likely have 10-33% lower clocks than the 5700XT. There's been some rumors that suggested the Xbox is closer to 15TF, which means we're potentially talking quite a lot bigger (like nearly double the size in order to offer that performance at the clock speeds that a console would be able to support).

Google is touting mGPU support with Stadia, and that's on Linux to boot, and Microsoft almost certainly will be trying to sort out mGPU game rendering to get the best utilization and performance on their cloud gaming stuff.
Nobody is questioning doability of this for compute, which is CPU's domain.

But everybody should be questioning GRAPHICS, and the possibility of making seperate chiplets look like single GPU for the game. This is the issue, that every company is facing.

You can make MCM/Multi GPU compute work perfectly well. It has been done for ages. You cannot do right now MCM/Multi GPU work for graphics. There is a reason why companies are dropping multi GPU support.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
Nobody is questioning doability of this for compute, which is CPU's domain.

But everybody should be questioning GRAPHICS, and the possibility of making seperate chiplets look like single GPU for the game. This is the issue, that every company is facing.

You can make MCM/Multi GPU compute work perfectly well. It has been done for ages. You cannot do right now MCM/Multi GPU work for graphics. There is a reason why companies are dropping multi GPU support.

Again, this is because in the past each GPU has been its own device. In the case of multiple chiplets on a SINGLE DEVICE, the OS only ever interacts with that one device. The game has no clue that the single device has multiple chiplets on it.
 

maddie

Diamond Member
Jul 18, 2010
5,204
5,613
136
Again, this is because in the past each GPU has been its own device. In the case of multiple chiplets on a SINGLE DEVICE, the OS only ever interacts with that one device. The game has no clue that the single device has multiple chiplets on it.
Key here. Multi-chiplets, not multi-GPU.
 

maddie

Diamond Member
Jul 18, 2010
5,204
5,613
136
Can you read anything that David Wang said about MCM modules?

Like here: https://www.pcgamesn.com/amd-navi-monolithic-gpu-design

Chiplets are not happening for a long time. Its an issue of both: hardware AND software.
Not saying it's happening next gen or very soon.

Wang:
"So, is it possible to make an MCM design invisible to a game developer so they can address it as a single GPU without expensive recoding?

“Anything’s possible…” says Wang."

Anyhow, enough of this, as no hard proof can be offered. Time will tell.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Again, this is because in the past each GPU has been its own device. In the case of multiple chiplets on a SINGLE DEVICE, the OS only ever interacts with that one device. The game has no clue that the single device has multiple chiplets on it.
What makes you believe you can make Chiplets that contain only Shader Engines, and are connected to seperate IO die, which will make it look as one device?

For gaming you cannot make this happen. For Compute - possibly. But there is more realistic way for achieving Chiplet scalability.
Key here. Multi-chiplets, not multi-GPU.
With current state of hardware and software eachchiplet will be separate GPU in gaming scenarios.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
When it comes to multiple chiplets on a single discrete device, the software (ie: OS and drivers) have no clue what is going on inside this device. All it knows is it sends work, and it gets a response back. There could be one big unified GPU, or there could be a whole bunch of chiplets that divide up work. The OS won't care.

That's not trivial at all.

Even with CPUs, they need OS support to get the full performance out of Ryzen CPUs. That's because having to communicate off-die is slow. Basically the OS patches are saying "try to ignore the off-die portion as much as possible".

You also said GPUs have thousands of cores. That makes mGPU somehow a solved problem? Why doesn't SLI/XFire work properly then? That's Glo's point.

GPUs have "thousands of cores" just like 16 core Ryzen has "64 cores" if you count an ALU as a core. The difference may seem irrelevant but its not.


Going from current GPUs to mGPU is like going from making CPUs wider(scalar, benefits EVERYTHING), to multi-thread(benefits only when supported). The latency, and bandwidth requirements are extreme when you want to make a "chiplet" GPU work like a single one.

Remember "Reverse Hyperthreading" rumor a few years ago? Where a multi-core chip can work as a super wide scalar one? Well, Intel did have a research paper about it, but its just that. Research. Who knows whether it'll bring real benefits?

Chiplets are a compromise in an era where process scaling is seriously being challenged. Don't make it a fad.
 
  • Like
Reactions: Glo.

maddie

Diamond Member
Jul 18, 2010
5,204
5,613
136
That's not trivial at all.

Even with CPUs, they need OS support to get the full performance out of Ryzen CPUs. That's because having to communicate off-die is slow. Basically the OS patches are saying "try to ignore the off-die portion as much as possible".

You also said GPUs have thousands of cores. That makes mGPU somehow a solved problem? Why doesn't SLI/XFire work properly then? That's Glo's point.

GPUs have "thousands of cores" just like 16 core Ryzen has "64 cores" if you count an ALU as a core. The difference may seem irrelevant but its not.


Going from current GPUs to mGPU is like going from making CPUs wider(scalar, benefits EVERYTHING), to multi-thread(benefits only when supported). The latency, and bandwidth requirements are extreme when you want to make a "chiplet" GPU work like a single one.

Remember "Reverse Hyperthreading" rumor a few years ago? Where a multi-core chip can work as a super wide scalar one? Well, Intel did have a research paper about it, but its just that. Research. Who knows whether it'll bring real benefits?

Chiplets are a compromise in an era where process scaling is seriously being challenged. Don't make it a fad.
Concerning the "reverse hyperthreading". Intel very quietly bought the company (Soft Machines) doing the research, in 2016. Previously they were funded by Mubadala, AMD, and others. They did show gains in IPC, but everything has gone dark. I do believe it has promise.

With regards to Glo's point:
It's chiplet not MGPU, and I notice no one replied to this post by me.

"When the 1st computers were designed out of discrete transistors, capacitors, etc, do you think they were not a full computer when seen by the software?

When we then advanced to integrated circuitry but still with separate components such as cache, ALUs, FPUs, etc, do you think they were not a full computer when seen by the software?

When we then advanced to full integrated SOCs, is this the point where you first see them as a full computer?

The fact that circuitry is not integrated does not mean that it cannot function as a unified whole from the software angle. You are mistaking inefficient operation with inability to function.

Remember when L2 cache was outside of the CPU? What horror, how could it work."
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Concerning the "reverse hyperthreading". Intel very quietly bought the company (Soft Machines) doing the research, in 2016.

I'm not talking about Soft Machines. Intel had their own project at some point before this. All large chip companies likely had researched into such solutions at some point in time.

Most research projects show very nice gains, but they don't show promise in the real world, or are impractical. Remember the spinning heatsink cooler that was based on the research by Sandra Labs? It was adopted by Cooler Master, but it wasn't impressive at all. They also had to give up being completely floating above the die and had metal supports I believe? All likely due to practical reasons(like cost and ease manufacturability).

Actually, I remember Intel showing the concept once(probably at IDF), and later a research paper that took a 4 issue CPU and split it into 2x 2 issue dual core chip. The concept was really long time ago, like Core 2 days.

Hmm, now I think about it, they might have had total of 3 research projects based on it.

The Tri-Gate(known by the now common name FinFET) 22nm transistor was in development for 10 years or something. That's not even a long time. Current technologies are based on research and concepts from 60s, 70s, and 80s.
 
Last edited:

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Just to point out... looking at navi diagrams, each Navi "Cluster" seems very independent of each other, each cluster even looks to have it's own memory controller
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
That's not trivial at all.

Even with CPUs, they need OS support to get the full performance out of Ryzen CPUs. That's because having to communicate off-die is slow. Basically the OS patches are saying "try to ignore the off-die portion as much as possible".

You also said GPUs have thousands of cores. That makes mGPU somehow a solved problem? Why doesn't SLI/XFire work properly then? That's Glo's point.

GPUs have "thousands of cores" just like 16 core Ryzen has "64 cores" if you count an ALU as a core. The difference may seem irrelevant but its not.


Going from current GPUs to mGPU is like going from making CPUs wider(scalar, benefits EVERYTHING), to multi-thread(benefits only when supported). The latency, and bandwidth requirements are extreme when you want to make a "chiplet" GPU work like a single one.

Remember "Reverse Hyperthreading" rumor a few years ago? Where a multi-core chip can work as a super wide scalar one? Well, Intel did have a research paper about it, but its just that. Research. Who knows whether it'll bring real benefits?

Chiplets are a compromise in an era where process scaling is seriously being challenged. Don't make it a fad.

Because with SLI/CF, the OS is having to interact with multiple devices. The OS's hardware scheduler has to divide up the work. In order for this to work properly, the game/drivers have to know which work to split up.

With multicore CPU's, each core shows up as its own device to the hardware scheduler. This is why the OS needs to be able to handle multiple threads, and why they have needed patches over the years as the core count has grown.

From the perspective of the OS, the chiplet device should only show up as a single device. Hardware scheduling between the chiplets should be handled by the card, not the OS. The OS has no reason to know how the work is being done, or how many chips are handling the work as there is a single path of input and output.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Because with SLI/CF, the OS is having to interact with multiple devices. The OS's hardware scheduler has to divide up the work. In order for this to work properly, the game/drivers have to know which work to split up.

With multicore CPU's, each core shows up as its own device to the hardware scheduler. This is why the OS needs to be able to handle multiple threads, and why they have needed patches over the years as the core count has grown.

From the perspective of the OS, the chiplet device should only show up as a single device. Hardware scheduling between the chiplets should be handled by the card, not the OS. The OS has no reason to know how the work is being done, or how many chips are handling the work as there is a single path of input and output.
With regards to Glo's point:
It's chiplet not MGPU, and I notice no one replied to this post by me.

"When the 1st computers were designed out of discrete transistors, capacitors, etc, do you think they were not a full computer when seen by the software?

When we then advanced to integrated circuitry but still with separate components such as cache, ALUs, FPUs, etc, do you think they were not a full computer when seen by the software?

When we then advanced to full integrated SOCs, is this the point where you first see them as a full computer?

The fact that circuitry is not integrated does not mean that it cannot function as a unified whole from the software angle. You are mistaking inefficient operation with inability to function.

Remember when L2 cache was outside of the CPU? What horror, how could it work."
Because of the scenes being enevenly rendered between hardware sides(One side(!) of scene can have more geometry and other can have more compute), you cannot make load balancing properly, that is why you will not have perfect scalability of work scheduled between GPUs in multi-GPU configuration. In current state of software, and that includes EVERY SINGLE PART OF IT, even the OS, Chiplet GPUs will be seen as Multi GPU configuration, not one gigantic GPU, even if they will be connected through fabric, or any other internal connection. And we do even exclude the matter of latency which is extremely important for graphics.

Again, we are talking about graphics purposes. It is not a problem for compute. But for graphics, that is no go.
 
Status
Not open for further replies.