The solution to midrange vs high end markets... Why can't amd make a scalable gpu?

Irenicus

Member
Jul 10, 2008
94
0
0
This applies to nvidia as well but I don't care about them as much.

People always talk about yields being lower on a new process, hence the cost of the larger dies being much greater with a new generation. This leads companies like amd and nvidia choosing smaller die sizes initially while yields improve over time.



Multi gpu often sucks, it's dependent on developer support and even with that often has issues.


So why can't amd create a megazord like gpu? A gpu where several smaller gpu dies come together and form a single larger die and most importantly, BEHAVE like a single larger die? What is the technical constraint there?


Is is the proximity to the memory? The inability to properly share the same data to the different smaller gpu pieces at a time?

But can't there be a common pool of memory stuck somewhere on the board with optical interconnects to the individual gpu components? I thought that was the whole point of that talk about the machine hp was building and later backed away from?

Would optical interconnects between the different gpu dies be insufficient to allow enough bandwidth and speed to work as a single unit? What else is the constraint here?

There must be an answer. OR is this the kind of thing amd means on that slide with navi when talking about a more scalable gpu?


If amd had this tech, there would be no need to target midrange vs higher end, they could create smaller dies that were less expensive due to the way wafer defects affect the economy of chip production, and scale them up to whatever they wanted.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
GPUs are already like that basically. Unlike CPUs which have 2-8 cores, GPUs have thousands. What makes a big GPU big is having 2-3x the amount of "cores" as the smaller GPUs.

Think of it like a 2 core CPU as a 1000 core GPU, the 8 core CPU is a 4000 core GPU.

The reason MGpu doesn't work well is because the GPUs have to pass eachother information about what work each is going to do, and what work the other has done (and often share texture data and everything else).
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,009
417
126
Most GPUs have a small portion of it that are designed to be able to have had a manufacturing defect and can then be binned as the next card lower in the performance/market segment. But no one does this for the entire card. The constraint is product design, testing, and cost associated with it. While in theory it sounds all and good to use larger dies that had more portions of them fail during the manufacturing process as a smaller/less powerful product line. The reality is eventually the manufacturing process matures and you are then selling fully working dies with a portion intentionally disabled at a lower cost than you could get for that exact same part. Your fixed costs on that piece are still the same and you are now intentionally saying I am going to sell this for less money than I could get for the product otherwise to fill a specific market segment.

It also costs a lot more money to design such a product, money that AMD quite frankly does not have. If you havn't looked at their financials, they have been bleeding cash for the last 5 years!
 
Last edited:

richaron

Golden Member
Mar 27, 2012
1,357
329
136
AMD has been working on similar tech for ages, and is years ahead of the competition.

The answer is Navi.
 

Glo.

Diamond Member
Apr 25, 2015
5,658
4,417
136
All what you are talking about is HSA 2.0 foundation principle. Nothing new. Next generation GPUs from AMD will be designed with this in the first place.

Problems are software level to overcome. That is the biggest problem here.

This is last node where we will see 600 mm2 die size from any of the vendors(Nvidia OR AMD). It will simply migrate into smaller die sizes like 80 mm2 and smaller. So high-end in future will not be one single big GPU with 250W. But two 350mm2 dies combined on single PCB with 300W of TDP total. High-End market becomes enthusiast market. Mainstream market becomes High-End. low-end market becomes mainstream. At least that is the picture from production cost perspective.
 

tonyfreak215

Senior member
Nov 21, 2008
274
0
76
AMD has been working on similar tech for ages, and is years ahead of the competition.

The answer is Navi.

All what you are talking about is HSA 2.0 foundation principle. Nothing new. Next generation GPUs from AMD will be designed with this in the first place.

Problems are software level to overcome. That is the biggest problem here.

I believe this is part of AMD's long term strategy with Direct X 12.

AdoredTV did a great video on it.
https://www.youtube.com/watch?v=aSYBO1BrB1I
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
This applies to nvidia as well but I don't care about them as much.

People always talk about yields being lower on a new process, hence the cost of the larger dies being much greater with a new generation. This leads companies like amd and nvidia choosing smaller die sizes initially while yields improve over time.



Multi gpu often sucks, it's dependent on developer support and even with that often has issues.


So why can't amd create a megazord like gpu? A gpu where several smaller gpu dies come together and form a single larger die and most importantly, BEHAVE like a single larger die? What is the technical constraint there?


Is is the proximity to the memory? The inability to properly share the same data to the different smaller gpu pieces at a time?

But can't there be a common pool of memory stuck somewhere on the board with optical interconnects to the individual gpu components? I thought that was the whole point of that talk about the machine hp was building and later backed away from?

Would optical interconnects between the different gpu dies be insufficient to allow enough bandwidth and speed to work as a single unit? What else is the constraint here?

There must be an answer. OR is this the kind of thing amd means on that slide with navi when talking about a more scalable gpu?


If amd had this tech, there would be no need to target midrange vs higher end, they could create smaller dies that were less expensive due to the way wafer defects affect the economy of chip production, and scale them up to whatever they wanted.
There is nothing technical to prevent a segmented GPU design. Just look at SOC designs. Each of their component blocks once stood alone. So technically being possible is not the problem.

My view is that it needed some method of transferring the vast data volume associated with GPUs. Now that transposers are becoming cheap enough for general use, this block is removed and the new problem is energy consumed in the increased distances data have to flow. There are exponential increases in energy use by increasing data transport distances.

My guess is that you can downclock the entire system, make it much wider and get overall savings. HBM as developed by AMD follows this philosophy. We get to see what Navi will bring in this regard.

Here we have the two GPU companies pursuing polar opposites of design. Slower and
wider versus faster and narrower. Will be interesting to see the next few years.

My view also is that this wider, slower segmented design has been in the making for years and Navi will be the first implementation of this new GPU design branch. Read the research papers and patents.

AMD also appears to be leveraging the benefits of Vulkan and DX12 in multi-GPU to help with some of the practical problems of a pure hardware solution.
 

Concillian

Diamond Member
May 26, 2004
3,751
8
81
Consider that GPU silicon dedicates most, if not all of the outer "ring" of the silicon to pads to connect I/O and power between the package and the silicon. Logistically, these must be on the outside, you cannot interconnect with the center of a piece of silicon... these need to be on the outside ring of the silicon. You need a lot of I/O pathways for wide memory buses that video cards use. So many that the lower limit of the size for a given memory bus width is determined by the amount of space you need for these pads / interconnects. Usually your lowest end GPUs are close to this minimum size, maybe 10% larger... something in that range.

If you were to design a GPU that would have the option of using 1, 2, or 4 GPUs, then you couldn't have a full ring for these pads, you'd be limited to 2 sides on the smallest version, as the other sides would need to have the option to connect with other dies in your multi-setup. Since you'd have HALF the linear space for pads on the smallest GPU (2 sides instead of 4), and the area is the product of the sides, and these sides need to be double the length to fit all the interconnects you'd have been able to fit when you can use all 4 sides, the minimum area of the die QUADRUPLES.

This is not at all economically feasible for what should be very obvious reasons. All of your product die sizes become HUGE.

Lets use the GTX750 as an example, this silicon is 148mm2, I don't know how close to the minimum size this is, let's say the actual minimum for 75w and 128 bit memory to have enough pad space is an even 100mm2, 4 sides of 100mm or 400mm of linear space to make all your interconnects. So your design takes away 2 sides for pads, so now you need 2 sides 200mm long... now your smallest possible silicon size becomes 400mm2 ?!? This is bigger than a 980Ti die for a 128 bit memory bus. How do you think a card with the execution units and price of a GTX980 would fare when coupled with the memory bandwidth of a GTX750Ti?

This, of course, ignores the engineering hurdles of how you have high speed data paths that work when you don't cut the thing into 1/4ths but are inactive when you do cut it into 1/4ths. Plus the amount of wasted silicon space ($$$) you'd need to leave on your high end GPUs to give room to make the cuts so you could cut your low end GPUs. Silicon is very expensive, and wasted area is a good way to put your operating expenses way over your competition's. This idea doesn't work on many levels.
 
Last edited:

dark zero

Platinum Member
Jun 2, 2015
2,655
138
106
Is in the similar fashion nVIDIA can't deliver lower tier dies.... the process is not mature for both
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Consider that GPU silicon dedicates most, if not all of the outer "ring" of the silicon to pads to connect I/O and power between the package and the silicon. Logistically, these must be on the outside, you cannot interconnect with the center of a piece of silicon... these need to be on the outside ring of the silicon. You need a lot of I/O pathways for wide memory buses that video cards use. So many that the lower limit of the size for a given memory bus width is determined by the amount of space you need for these pads / interconnects. Usually your lowest end GPUs are close to this minimum size, maybe 10% larger... something in that range.

If you were to design a GPU that would have the option of using 1, 2, or 4 GPUs, then you couldn't have a full ring for these pads, you'd be limited to 2 sides on the smallest version, as the other sides would need to have the option to connect with other dies in your multi-setup. Since you'd have HALF the linear space for pads on the smallest GPU (2 sides instead of 4), and the area is the product of the sides, and these sides need to be double the length to fit all the interconnects you'd have been able to fit when you can use all 4 sides, the minimum area of the die QUADRUPLES.

This is not at all economically feasible for what should be very obvious reasons. All of your product die sizes become HUGE.

Lets use the GTX750 as an example, this silicon is 148mm2, I don't know how close to the minimum size this is, let's say the actual minimum for 75w and 128 bit memory to have enough pad space is an even 100mm2, 4 sides of 100mm or 400mm of linear space to make all your interconnects. So your design takes away 2 sides for pads, so now you need 2 sides 200mm long... now your smallest possible silicon size becomes 400mm2 ?!? This is bigger than a 980Ti die for a 128 bit memory bus. How do you think a card with the execution units and price of a GTX980 would fare when coupled with the memory bandwidth of a GTX750Ti?

This, of course, ignores the engineering hurdles of how you have high speed data paths that work when you don't cut the thing into 1/4ths but are inactive when you do cut it into 1/4ths. Plus the amount of wasted silicon space ($$$) you'd need to leave on your high end GPUs to give room to make the cuts so you could cut your low end GPUs. Silicon is very expensive, and wasted area is a good way to put your operating expenses way over your competition's. This idea doesn't work on many levels.
While this would be true for a traditional design, the use of interposers and associated microbumb technology allows orders of magnitude improvement in connection density.

You can also use areas inside the edge of the die and not be limited to edges.

How do you explain the use of a 4096 bit memory bus with a 600mm^2 GPU and this is not even close to the microbump limit which is around 400/mm^2 with research leading to even higher densities.

Xylink has a 4 die 600mm^2 processor with more than 10,000 connections between each successive sub unit for some years now.

This means that you CAN use very small dies for a distributed GPU without the old worries of pad space limitations.

I think you're giving out false information.
 

xpea

Senior member
Feb 14, 2014
429
135
116
My view is that it needed some method of transferring the vast data volume associated with GPUs. Now that transposers are becoming cheap enough for general use, this block is removed and the new problem is energy consumed in the increased distances data have to flow. There are exponential increases in energy use by increasing data transport distances.
^^ THIS
multi GPU will never be competitive in term of power efficiency because calculation (FLOP) is cheap in terms of energy but moving data is not...
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
^^ THIS
multi GPU will never be competitive in term of power efficiency because calculation (FLOP) is cheap in terms of energy but moving data is not...
Don't you think that is a bit harsh. Localization of data is the key to reducing energy use and this applies both to a monolithic GPU as well as a multi-die one. As all things, the key is to find cost effective solutions.

At the upper limit, a traditional monolithic die appears to be maxed out at around 600mm^2. A multi-die one however can go past that limit easily.

For the roughest of comparisons. Say a 900mm^2 multi-die GPU is made to compete with a 600mm^2 monolithic die. We could assume a 2/3 speed as giving equivalent shader performance. Power would be much lower.

Look at HBM memory. Very wide with a much lower clock rate leading to 70% + saving in power consumed and this is directly related to data movement, not processing.

Finally, with through silicon VIAs providing electrical pathways and the ability to transport heat from the internal die to the top, maybe we will see true 3D integration. This is by far the best way to reduce signal lengths and thus power lost by data flows. Multi-die with 3D integration might be a solution.
 

tonyfreak215

Senior member
Nov 21, 2008
274
0
76
That is freaking awesome. If he is right about MS and Sony going multi gpu on the next consoles it is about to change everything in a big way.

AMD is already confirmed for the next-gen of consoles. If it doesn't happen then, it will certainly happen the ones after that. If only AMD can hold on until then.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
AMD is already confirmed for the next-gen of consoles. If it doesn't happen then, it will certainly happen the ones after that. If only AMD can hold on until then.

AMD will still be around in ten years even if they're still horribly unsuccessful to that point. Intel would be considered a de facto monopoly without AMD around which Intel does not want to deal with at all.

Zen should be good enough that AMD can start making money again and eventually they'll be completely free of Global Foundries which is another millstone removed from their neck.
 

tonyfreak215

Senior member
Nov 21, 2008
274
0
76
AMD will still be around in ten years even if they're still horribly unsuccessful to that point. Intel would be considered a de facto monopoly without AMD around which Intel does not want to deal with at all.

Zen should be good enough that AMD can start making money again and eventually they'll be completely free of Global Foundries which is another millstone removed from their neck.

But would Intel be a monopoly? It's been having a lot of competition from ARM recently. Even then, its not like Intel would give AMD cash to keep them afloat.

I', very excited for ZEN. I've always been an AMD fan, but their recent processors have been horrible.
 

Mat3

Junior Member
Jun 3, 2016
4
0
0
Rather than two identical GPUs on an interposer, what about separating different parts of one GPU?

Specifically I mean separate the shader array from the rest of the chip. So we'd have 1 processor, the shader array, connected to the second processor (ROPs, memory controllers and interface, misc. logic), which is itself connected to a couple blocks of stacked HBM RAM. The two chips, especially the middle one that connects to both memory and the shader chip, would be a long rectangular shape to maximize the perimeter for all the interfaces.

This idea wouldn't be completely unheard of before: Xbox 360 is somewhat similar in concept. It keeps the ROPs (biggest bandwidth user?) closest to memory and let's them separate a 600mm2 GPU like Fiji into two more manageable ones. The interposer should allow for the shader array to have enough bandwidth. And you wouldn't have to worry about how to share memory.

Thoughts? Feasible? Terrible idea?
 

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Rather than two identical GPUs on an interposer, what about separating different parts of one GPU?

Specifically I mean separate the shader array from the rest of the chip. So we'd have 1 processor, the shader array, connected to the second processor (ROPs, memory controllers and interface, misc. logic), which is itself connected to a couple blocks of stacked HBM RAM. The two chips, especially the middle one that connects to both memory and the shader chip, would be a long rectangular shape to maximize the perimeter for all the interfaces.

This idea wouldn't be completely unheard of before: Xbox 360 is somewhat similar in concept. It keeps the ROPs (biggest bandwidth user?) closest to memory and let's them separate a 600mm2 GPU like Fiji into two more manageable ones. The interposer should allow for the shader array to have enough bandwidth. And you wouldn't have to worry about how to share memory.

Thoughts? Feasible? Terrible idea?
Post #7 in this thread has a link you should find most interesting.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Cost is important as anything else. It's going up in all aspects. You need very different memory hard to produce like HBM. You need complicated physical structures in your process, like FinFET.

There's so many issues that come up so we really have no idea at all what's feasible and what's not.

We have transistors in the billions. While lot are automated, a lot are still done by humans. Very smart, but they have limits. Having made my own circuits in a scale that's very small, I still need to worry about how the layout is going to turn out. Even with circuits so simple the most important thing for me is the same important thing for big circuit designers at Intel/AMD/Nvidia. And that's reliable production.

What works out in schematic and theory doesn't necessarily translate to that in the real world. It's extremely easy to disregard all the advancements and sweat and tears that went into making these things and be like those guys that just plot the advancements in the graph and say "oh it'll continue like that for 50 more years and we'll have computers to replace humans". All that really tells us is how much those guys that speculate don't know about the very things they speculate about.
 
Last edited: