Multiple dies acting as one on interposer

hrga225

Member
Jan 15, 2016
81
6
11
This topic was discussed in few threads before but always as side discussion.
I find this tech really interesting and that it should deserve its own thread.

Since GPUs are only consumer devices where there is immediate benefit of increasing number of transistors in chip, and easily observable to user, I believe this tech will debut in consumer space with GPUs.

Now,tech was pioneered by Xilinx both in their 28nm and 20nm high performance FPGAs.I don't expect this tech to be ready or worth the effort for consumer space till 10nm(2H2019),especially since 14nm is still manageable.Later nodes will prove to be quite more difficult because EUV will not be viable for quite some time.
Research(s) is already being done as seen here http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf (thanks bsp2020).I would appreciate if anyone can find more links about topic so that we can have more educated discussion.
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
This topic was discussed in few threads before but always as side discussion.
I find this tech really interesting and that it should deserve its own thread.

Since GPUs are only consumer devices where there is immediate benefit of increasing number of transistors in chip, and easily observable to user, I believe this tech will debut in consumer space with GPUs.

Now,tech was pioneered by Xilinx both in their 28nm and 20nm high performance FPGAs.I don't expect this tech to be ready or worth the effort for consumer space till 10nm(2H2019),especially since 14nm is still manageable.Later nodes will prove to be quite more difficult because EUV will not be viable for quite some time.
Research(s) is already being done as seen here http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf (thanks bsp2020).I would appreciate if anyone can find more links about topic so that we can have more educated discussion.
Can I assume that reasoned speculation is welcomed?

I have had too much hostile criticism directed towards me for advocating this subject. It gets tedious.
 

hrga225

Member
Jan 15, 2016
81
6
11
Can I assume that reasoned speculation is welcomed?

Of course.

I am also interested in what "software guys" think.Will hardware, and by hardware I mean robust enough and scaleable enough interconnect,hide latency sufficiently that from their perspective there are no special cases.
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
Of course.

I am also interested in what "software guys" think.Will hardware, and by hardware I mean robust enough and scaleable enough interconnect,hide latency sufficiently that from their perspective there are no special cases.
From the U of T paper you linked, we get this "the impedance across the interposer is identical to conventional on chip interconnects".

This paper from Xilinx, [pg 6] has the claim of 1ns latency between adjacent die on the interposer. This should remove the claim of greatly increased latency from the Luddites argument.

http://www.xilinx.com/support/docum...0_Stacked_Silicon_Interconnect_Technology.pdf
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
From Paul's link above:
"In his famous 1965 paper, Gordon Moore wrote, “It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected. The availability of large functions, combined with functional design and construction, should allow the manufacturer of large systems to design and construct a considerable variety of equipment both rapidly and economically.”

I think the overblown skepticism is hilarious considering Moore himself (like everyone who has ever written code and thought hey, decoupling these routines makes sense) could see the theoretical benefits in the 60's. The problem has never been why, it's always been how.

I personally wonder if they will couple in certain small FPGAs off-die so that they can be refreshed with new video codecs and things like that without having to do entire new chip spins, with the ASIC implementation of it following in later chips (e.g. get TTM advantage on codecs). Somewhat like how nVidia is handling Gsync
 
Last edited:

hrga225

Member
Jan 15, 2016
81
6
11
It's more than just an interposer but other 2.5d solutions. http://semiengineering.com/thinking-outside-the-chip/

Thank you for link.I find idea of using panels for interposers quite interesting.You can make HUGE interposer that way.

From the U of T paper you linked, we get this "the impedance across the interposer is identical to conventional on chip interconnects".

I don't know how I missed it.
So answer to my question would be(from programmer point of view): I would be none the wiser.
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.
FijiBlockDiagram.png

AMDGeoFront.jpg

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?
 

hrga225

Member
Jan 15, 2016
81
6
11
What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.
FijiBlockDiagram.png

AMDGeoFront.jpg

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?

From my understanding both AMD and Nvidia build scalable architectures.So only display,PCIe,multimedia block,XDMA should be put on different die.
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
From my understanding both AMD and Nvidia build scalable architectures.So only display,PCIe,multimedia block,XDMA should be put on different die.

Interesting. How would you handle command splitting and memory access? Would each additional chip have its own command processor, L2 bank and memory controllers? Additionally, how would memory access be shared between separate chips to prevent the need for duplication on memory resources?
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.
FijiBlockDiagram.png

AMDGeoFront.jpg

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?
I am tending, baring detailed knowledge, to them fabbing everything except the shader engines blocks on a single central unit and the shader engine blocks as separate units. This shader unit die could be mutiplied as needed for the performance target. Four shader units for the maximum performance unit would allow a wide performance range by fabbing 2 die.

We get a 1, 2 and 4 shader cluster. from analysing the Fiji die, we can estimate around 85% of the die taken by the shader units. Going with this, the potential for each of the 4 sub-units to have at least 1/2 the compute area, assuming double density, of Fiji. You end up with a 1/2 Fiji +, a Fiji + equivalent and a double Fiji +. Any further increase in shader efficiency, as claimed by Koduri, will apply to all. Three market segments by fabbing 2 die, the largest one being economically unattainable on a monolithic die for the desktop GPU market with present yields.

I would think the cross communication between shader unit blocks would only be used when there might be an imbalance in output due to some shaders becoming available due to simpler shaders being executed faster on some of the shaders. I do not see most of the data being routed this way on a continuous basis. It only handles the overflow.

What would happen in this layout, is that the command processor would be oversize for the smaller GPUs with a lesser shader unit count. Also the # ACEs would be fixed for the entire range. In any case the command processor is a tiny portion of the die and the overall saving in cost by fabbing higher yielding sub-units should cover that small hit.

Everything has compromises.
 

hrga225

Member
Jan 15, 2016
81
6
11
Interesting. How would you handle command splitting and memory access? Would each additional chip have its own command processor, L2 bank and memory controllers? Additionally, how would memory access be shared between separate chips to prevent the need for duplication on memory resources?

Well,I don't think there would be much difference than in monolithic designs. L2 banks and memory controllers are either fixed to particular shader engine (Nvidia) or "flexible" and connected through xbar (AMD).So there would be some penalty in additional circuitry when you compare it with chip of same size but not built with connecting through interposer in mind.So some parts of dies would be beefier than their shader array needs.
There is no free lunch.
 

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
What will be interesting if they are able to break up parts of the GPU into pieces thus removing the need to manufacture a single large die. Then being able more cheaply and quickly bring out a large GPU.
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
Ha,we are speculating use of different approaches to same problem and end it with almost same sentence.Funny
Noticed that. 1 minute apart

Look at these two die shots to visualize what would need to be re-positioned. Amazing the amount of space saved by using HBM.

index.php



index.php
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
Well,I don't think there would be much difference than in monolithic designs. L2 banks and memory controllers are either fixed to particular shader engine (Nvidia) or "flexible" and connected through xbar (AMD).So there would be some penalty in additional circuitry when you compare it with chip of same size but not built with connecting through interposer in mind.So some parts of dies would be beefier than their shader array needs.
There is no free lunch.

Having all the memory controllers on the central die has the disadvantage of using that die space even the class of GPU doesn't warrant using extra stacks of HBM (or channels of GDDR5x). I can see a lot of advantages of having a single point to point link between the central die and the shader engine dies though and implementing the crossbar on the central die, rather than having all the SE dies linked to each other.
I would still imagine you'd still want the command processor on the central die though, though I profess no real experience here and could easily be swayed by a well reasoned argument otherwise.
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
Noticed that. 1 minute apart

Look at these two die shots to visualize what would need to be re-positioned. Amazing the amount of space saved by using HBM.

It's not just HBM and Fiji; Hawaii's 512-bit memory controller was 30% smaller than Tahiti's 384-bit MC. Tahiti was just really large for whatever reason.
 

hrga225

Member
Jan 15, 2016
81
6
11
Having all the memory controllers on the central die has the disadvantage of using that die space even the class of GPU doesn't warrant using extra stacks of HBM (or channels of GDDR5x). I can see a lot of advantages of having a single point to point link between the central die and the shader engine dies though and implementing the crossbar on the central die, rather than having all the SE dies linked to each other.
I would still imagine you'd still want the command processor on the central die though, though I profess no real experience here and could easily be swayed by a well reasoned argument otherwise.
I think we misunderstood each other.I would do amorphous approach.Penalty would be coherence circuits both in caches and dispatch.
Edit: Am I making sense?
 
Last edited:

tential

Diamond Member
May 13, 2008
7,348
642
121
What will be interesting if they are able to break up parts of the GPU into pieces thus removing the need to manufacture a single large die. Then being able more cheaply and quickly bring out a large GPU.
Basically you could build 1 chip then scale it into what you need. Low end is 1, mid range is 2 and High end is 4 chips. Pretty cool if possible
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
Just thought of something. A modification to the layout I'm favoring.

Assuming 4 shader sub-units for the top end product, what about including one HBM module interface and memory controller on each of the shader units instead of placing all on the central unit as I originally suggested. If we can use the interposer as the lower routing layers of a traditional monolithic die, and we can, this should work, and allow for less wasted space on lower end parts that use less HBM. Looking at the die shots, the distances should be similar to Fiji.

What do both of you think? This should answer Mrteal's concern.
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
Basically you could build 1 chip then scale it into what you need. Low end is 1, mid range is 2 and High end is 4 chips. Pretty cool if possible
I would like your thoughts on this. I see you being a neutral observer.

Per Koduri, two Polaris versions this year. One has been demoed and is for the GTX950 - GTX960 market segment + laptops. The other is being used to reclaim the top end from Nvidia [a lofty goal].

What do you see happening mid-market, the real money spinner segment?
 

MrTeal

Diamond Member
Dec 7, 2003
3,919
2,708
136
I would like your thoughts on this. I see you being a neutral observer.

Per Koduri, two Polaris versions this year. One has been demoed and is for the GTX950 - GTX960 market segment + laptops. The other is being used to reclaim the top end from Nvidia [a lofty goal].

What do you see happening mid-market, the real money spinner segment?

The NVIDIA GeForce GTX 660 Review: GK106 Fills Out The Kepler Family
As our regular readers are well aware, NVIDIA’s 28nm supply constraints have proven to be a constant thorn in the side of the company. Since Q2 the message in financial statements has been clear: NVIDIA could be selling more GPUs if they had access to more 28nm capacity. As a result of this capacity constraint they have had to prioritize the high-profit mainstream mobile and high-end desktop markets above other consumer markets, leaving holes in their product lineups. In the intervening time they have launched products like the GK104-based GeForce GTX 660 Ti to help bridge that gap, but even that still left a hole between $100 and $300.

Now nearly 6 months after the launch of the first Kepler GPUs – and 9 months after the launch of the first 28nm GPUs – NVIDIA’s situation has finally improved to the point where they can finish filling out the first iteration of the Kepler GPU family. With GK104 at the high-end and GK107 at the low-end, the task of filling out the middle falls to NVIDIA’s latest GPU: GK106.

In the intervening time, the lineup was filled with Fermi. The GTX 570 provided 85% of the $300 660 Ti's performance at $250, and the 560 Ti provided 67% of it at closer to $200. Hell, the 550Ti was still available at $119 vs the $109 of the GK107-based GT 640, and it was way, way faster than the 640.

If P10 is ~100mm² and P11 is ~300mm², I'd expect them to continue to sell existing 28nm designs in the missing price points.
 

maddie

Diamond Member
Jul 18, 2010
5,161
5,554
136
The NVIDIA GeForce GTX 660 Review: GK106 Fills Out The Kepler Family


In the intervening time, the lineup was filled with Fermi. The GTX 570 provided 85% of the $300 660 Ti's performance at $250, and the 560 Ti provided 67% of it at closer to $200. Hell, the 550Ti was still available at $119 vs the $109 of the GK107-based GT 640, and it was way, way faster than the 640.

If P10 is ~100mm² and P11 is ~300mm², I'd expect them to continue to sell existing 28nm designs in the missing price points.
This time we have the DX12 factor. If this happens and Nvidia has three for AMD two, then AMD, in my opinion has lost the race again. Who will buy AMD 28nm then? The biggest % of profits [units sold x margin] is the mid-range.

The purpose of the halo cards although profitable in their own right, is to stimulate other sales. Why have a halo if you have nothing else except the low end. Also most of the development cost would have been all the redesigning, learning about 14nm, etc. The additional cost to develop a mid-range die would have been a small additional cost compared to what had been spent already.

This wait will be difficult.