Multiple dies acting as one on interposer

hrga225 · Jan 18, 2016

This topic was discussed in few threads before but always as side discussion.
I find this tech really interesting and that it should deserve its own thread.

Since GPUs are only consumer devices where there is immediate benefit of increasing number of transistors in chip, and easily observable to user, I believe this tech will debut in consumer space with GPUs.

Now,tech was pioneered by Xilinx both in their 28nm and 20nm high performance FPGAs.I don't expect this tech to be ready or worth the effort for consumer space till 10nm(2H2019),especially since 14nm is still manageable.Later nodes will prove to be quite more difficult because EUV will not be viable for quite some time.
Research(s) is already being done as seen here http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf (thanks bsp2020).I would appreciate if anyone can find more links about topic so that we can have more educated discussion.

maddie · Jan 18, 2016

hrga225 said:
This topic was discussed in few threads before but always as side discussion.
I find this tech really interesting and that it should deserve its own thread.

Since GPUs are only consumer devices where there is immediate benefit of increasing number of transistors in chip, and easily observable to user, I believe this tech will debut in consumer space with GPUs.

Now,tech was pioneered by Xilinx both in their 28nm and 20nm high performance FPGAs.I don't expect this tech to be ready or worth the effort for consumer space till 10nm(2H2019),especially since 14nm is still manageable.Later nodes will prove to be quite more difficult because EUV will not be viable for quite some time.
Research(s) is already being done as seen here http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf (thanks bsp2020).I would appreciate if anyone can find more links about topic so that we can have more educated discussion.

Can I assume that reasoned speculation is welcomed?

I have had too much hostile criticism directed towards me for advocating this subject. It gets tedious.

Paul98 · Jan 18, 2016

It's more than just an interposer but other 2.5d solutions.

http://semiengineering.com/thinking-outside-the-chip/

hrga225 · Jan 18, 2016

maddie said:
Can I assume that reasoned speculation is welcomed?

Of course.

I am also interested in what "software guys" think.Will hardware, and by hardware I mean robust enough and scaleable enough interconnect,hide latency sufficiently that from their perspective there are no special cases.

maddie · Jan 18, 2016

hrga225 said:
Of course.

I am also interested in what "software guys" think.Will hardware, and by hardware I mean robust enough and scaleable enough interconnect,hide latency sufficiently that from their perspective there are no special cases.

From the U of T paper you linked, we get this "the impedance across the interposer is identical to conventional on chip interconnects".

This paper from Xilinx, [pg 6] has the claim of 1ns latency between adjacent die on the interposer. This should remove the claim of greatly increased latency from the Luddites argument.

http://www.xilinx.com/support/docum...0_Stacked_Silicon_Interconnect_Technology.pdf

Headfoot · Jan 18, 2016

From Paul's link above:
"In his famous 1965 paper, Gordon Moore wrote, “It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected. The availability of large functions, combined with functional design and construction, should allow the manufacturer of large systems to design and construct a considerable variety of equipment both rapidly and economically.”

I think the overblown skepticism is hilarious considering Moore himself (like everyone who has ever written code and thought hey, decoupling these routines makes sense) could see the theoretical benefits in the 60's. The problem has never been why, it's always been how.

I personally wonder if they will couple in certain small FPGAs off-die so that they can be refreshed with new video codecs and things like that without having to do entire new chip spins, with the ASIC implementation of it following in later chips (e.g. get TTM advantage on codecs). Somewhat like how nVidia is handling Gsync

Headfoot · Jan 18, 2016

removed

hrga225 · Jan 18, 2016

Paul98 said:
It's more than just an interposer but other 2.5d solutions. http://semiengineering.com/thinking-outside-the-chip/

Thank you for link.I find idea of using panels for interposers quite interesting.You can make HUGE interposer that way.

maddie said:
From the U of T paper you linked, we get this "the impedance across the interposer is identical to conventional on chip interconnects".

I don't know how I missed it.
So answer to my question would be(from programmer point of view): I would be none the wiser.

MrTeal · Jan 18, 2016

What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?

hrga225 · Jan 18, 2016

MrTeal said:
What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?

From my understanding both AMD and Nvidia build scalable architectures.So only display,PCIe,multimedia block,XDMA should be put on different die.

MrTeal · Jan 18, 2016

hrga225 said:
From my understanding both AMD and Nvidia build scalable architectures.So only display,PCIe,multimedia block,XDMA should be put on different die.

Interesting. How would you handle command splitting and memory access? Would each additional chip have its own command processor, L2 bank and memory controllers? Additionally, how would memory access be shared between separate chips to prevent the need for duplication on memory resources?

maddie · Jan 18, 2016

MrTeal said:
What I'd like to hear from someone with some knowledge of the subject is how much information is shared between shader engines in AMD's current Fiji layout.

We know GCN1-3 was limited to four shader engines and 16 render backends, it remains to be seen whether GCN 4 is similarly limited.

If AMD (using GCN1.2/GCN3) were to implement a multi-die approach on something like Fiji, how would they implement it? Would you place the front and back ends (MC, PCIe, XDMA, Display, Command Processors, Geometry Processors, Rasterizer and RBs) on a single chip, and then have four ports from the four geometry processors to four (or two) separate dies that house the compute units?

Or would you try to move the entire shader engine to a separate die connected to the global data share, and link the SE dies together as needed?

I am tending, baring detailed knowledge, to them fabbing everything except the shader engines blocks on a single central unit and the shader engine blocks as separate units. This shader unit die could be mutiplied as needed for the performance target. Four shader units for the maximum performance unit would allow a wide performance range by fabbing 2 die.

We get a 1, 2 and 4 shader cluster. from analysing the Fiji die, we can estimate around 85% of the die taken by the shader units. Going with this, the potential for each of the 4 sub-units to have at least 1/2 the compute area, assuming double density, of Fiji. You end up with a 1/2 Fiji +, a Fiji + equivalent and a double Fiji +. Any further increase in shader efficiency, as claimed by Koduri, will apply to all. Three market segments by fabbing 2 die, the largest one being economically unattainable on a monolithic die for the desktop GPU market with present yields.

I would think the cross communication between shader unit blocks would only be used when there might be an imbalance in output due to some shaders becoming available due to simpler shaders being executed faster on some of the shaders. I do not see most of the data being routed this way on a continuous basis. It only handles the overflow.

What would happen in this layout, is that the command processor would be oversize for the smaller GPUs with a lesser shader unit count. Also the # ACEs would be fixed for the entire range. In any case the command processor is a tiny portion of the die and the overall saving in cost by fabbing higher yielding sub-units should cover that small hit.

Everything has compromises.

hrga225 · Jan 18, 2016

MrTeal said:
Interesting. How would you handle command splitting and memory access? Would each additional chip have its own command processor, L2 bank and memory controllers? Additionally, how would memory access be shared between separate chips to prevent the need for duplication on memory resources?

Well,I don't think there would be much difference than in monolithic designs. L2 banks and memory controllers are either fixed to particular shader engine (Nvidia) or "flexible" and connected through xbar (AMD).So there would be some penalty in additional circuitry when you compare it with chip of same size but not built with connecting through interposer in mind.So some parts of dies would be beefier than their shader array needs.
There is no free lunch.

hrga225 · Jan 18, 2016

maddie said:
Everything has compromises.

Ha,we are speculating use of different approaches to same problem and end it with almost same sentence.Funny

Paul98 · Jan 18, 2016

What will be interesting if they are able to break up parts of the GPU into pieces thus removing the need to manufacture a single large die. Then being able more cheaply and quickly bring out a large GPU.

maddie · Jan 18, 2016

hrga225 said:
Ha,we are speculating use of different approaches to same problem and end it with almost same sentence.Funny

Noticed that. 1 minute apart

Look at these two die shots to visualize what would need to be re-positioned. Amazing the amount of space saved by using HBM.

hrga225 · Jan 18, 2016

maddie said:
Noticed that. 1 minute apart

Look at these two die shots to visualize what would need to be re-positioned. Amazing the amount of space saved by using HBM.

Yes,I get where are you aiming at.

MrTeal · Jan 18, 2016

hrga225 said:
Well,I don't think there would be much difference than in monolithic designs. L2 banks and memory controllers are either fixed to particular shader engine (Nvidia) or "flexible" and connected through xbar (AMD).So there would be some penalty in additional circuitry when you compare it with chip of same size but not built with connecting through interposer in mind.So some parts of dies would be beefier than their shader array needs.
There is no free lunch.

Having all the memory controllers on the central die has the disadvantage of using that die space even the class of GPU doesn't warrant using extra stacks of HBM (or channels of GDDR5x). I can see a lot of advantages of having a single point to point link between the central die and the shader engine dies though and implementing the crossbar on the central die, rather than having all the SE dies linked to each other.
I would still imagine you'd still want the command processor on the central die though, though I profess no real experience here and could easily be swayed by a well reasoned argument otherwise.

MrTeal · Jan 18, 2016

maddie said:
Noticed that. 1 minute apart

Look at these two die shots to visualize what would need to be re-positioned. Amazing the amount of space saved by using HBM.

It's not just HBM and Fiji; Hawaii's 512-bit memory controller was 30% smaller than Tahiti's 384-bit MC. Tahiti was just really large for whatever reason.

hrga225 · Jan 18, 2016

MrTeal said:
Having all the memory controllers on the central die has the disadvantage of using that die space even the class of GPU doesn't warrant using extra stacks of HBM (or channels of GDDR5x). I can see a lot of advantages of having a single point to point link between the central die and the shader engine dies though and implementing the crossbar on the central die, rather than having all the SE dies linked to each other.
I would still imagine you'd still want the command processor on the central die though, though I profess no real experience here and could easily be swayed by a well reasoned argument otherwise.

I think we misunderstood each other.I would do amorphous approach.Penalty would be coherence circuits both in caches and dispatch.
Edit: Am I making sense?

tential · Jan 18, 2016

Paul98 said:
What will be interesting if they are able to break up parts of the GPU into pieces thus removing the need to manufacture a single large die. Then being able more cheaply and quickly bring out a large GPU.

Basically you could build 1 chip then scale it into what you need. Low end is 1, mid range is 2 and High end is 4 chips. Pretty cool if possible

maddie · Jan 18, 2016

Just thought of something. A modification to the layout I'm favoring.

Assuming 4 shader sub-units for the top end product, what about including one HBM module interface and memory controller on each of the shader units instead of placing all on the central unit as I originally suggested. If we can use the interposer as the lower routing layers of a traditional monolithic die, and we can, this should work, and allow for less wasted space on lower end parts that use less HBM. Looking at the die shots, the distances should be similar to Fiji.

What do both of you think? This should answer Mrteal's concern.

maddie · Jan 18, 2016

tential said:
Basically you could build 1 chip then scale it into what you need. Low end is 1, mid range is 2 and High end is 4 chips. Pretty cool if possible

I would like your thoughts on this. I see you being a neutral observer.

Per Koduri, two Polaris versions this year. One has been demoed and is for the GTX950 - GTX960 market segment + laptops. The other is being used to reclaim the top end from Nvidia [a lofty goal].

What do you see happening mid-market, the real money spinner segment?

MrTeal · Jan 18, 2016

maddie said:
I would like your thoughts on this. I see you being a neutral observer.

Per Koduri, two Polaris versions this year. One has been demoed and is for the GTX950 - GTX960 market segment + laptops. The other is being used to reclaim the top end from Nvidia [a lofty goal].

What do you see happening mid-market, the real money spinner segment?

The NVIDIA GeForce GTX 660 Review: GK106 Fills Out The Kepler Family

As our regular readers are well aware, NVIDIAs 28nm supply constraints have proven to be a constant thorn in the side of the company. Since Q2 the message in financial statements has been clear: NVIDIA could be selling more GPUs if they had access to more 28nm capacity. As a result of this capacity constraint they have had to prioritize the high-profit mainstream mobile and high-end desktop markets above other consumer markets, leaving holes in their product lineups. In the intervening time they have launched products like the GK104-based GeForce GTX 660 Ti to help bridge that gap, but even that still left a hole between $100 and $300.

Now nearly 6 months after the launch of the first Kepler GPUs and 9 months after the launch of the first 28nm GPUs NVIDIAs situation has finally improved to the point where they can finish filling out the first iteration of the Kepler GPU family. With GK104 at the high-end and GK107 at the low-end, the task of filling out the middle falls to NVIDIAs latest GPU: GK106.

In the intervening time, the lineup was filled with Fermi. The GTX 570 provided 85% of the $300 660 Ti's performance at $250, and the 560 Ti provided 67% of it at closer to $200. Hell, the 550Ti was still available at $119 vs the $109 of the GK107-based GT 640, and it was way, way faster than the 640.

If P10 is ~100mm² and P11 is ~300mm², I'd expect them to continue to sell existing 28nm designs in the missing price points.

maddie · Jan 18, 2016

MrTeal said:
The NVIDIA GeForce GTX 660 Review: GK106 Fills Out The Kepler Family

In the intervening time, the lineup was filled with Fermi. The GTX 570 provided 85% of the $300 660 Ti's performance at $250, and the 560 Ti provided 67% of it at closer to $200. Hell, the 550Ti was still available at $119 vs the $109 of the GK107-based GT 640, and it was way, way faster than the 640.

If P10 is ~100mm² and P11 is ~300mm², I'd expect them to continue to sell existing 28nm designs in the missing price points.

This time we have the DX12 factor. If this happens and Nvidia has three for AMD two, then AMD, in my opinion has lost the race again. Who will buy AMD 28nm then? The biggest % of profits [units sold x margin] is the mid-range.

The purpose of the halo cards although profitable in their own right, is to stimulate other sales. Why have a halo if you have nothing else except the low end. Also most of the development cost would have been all the redesigning, learning about 14nm, etc. The additional cost to develop a mid-range die would have been a small additional cost compared to what had been spent already.

This wait will be difficult.

Multiple dies acting as one on interposer

Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member