Multiple dies acting as one on interposer

MrTeal · Jan 18, 2016

maddie said:
This time we have the DX12 factor. If this happens and Nvidia has three for AMD two, then AMD, in my opinion has lost the race again. Who will buy AMD 28nm then? The biggest % of profits [units sold x margin] is the mid-range.

The purpose of the halo cards although profitable in their own right, is to stimulate other sales. Why have a halo if you have nothing else except the low end. Also most of the development cost would have been all the redesigning, learning about 14nm, etc. The additional cost to develop a mid-range die would have been a small additional cost compared to what had been spent already.

This wait will be difficult.

Do you have a source for the red? The steam hardware survey shows the 970 comprising almost twice as many GPUs as the 960.

Keep in mind they said two GPUs this year, and they're aiming for back to school. That doesn't mean they won't follow up with another GPU early in 2017 to fill in any gaps.

maddie · Jan 18, 2016

MrTeal said:
Do you have a source for the red? The steam hardware survey shows the 970 comprising almost twice as many GPUs as the 960.

Keep in mind they said two GPUs this year, and they're aiming for back to school. That doesn't mean they won't follow up with another GPU early in 2017 to fill in any gaps.

I class the GTX970 as mid-range. I remember reading the numbers. Where, can't remember.

We'll have to see what happens. Leaving that gap in the center to be possibly filled with present products is a big failure for me. Very poor strategy in my view.

MrTeal · Jan 18, 2016

maddie said:
I class the GTX970 as mid-range. I remember reading the numbers. Where, can't remember.

We'll have to see what happens. Leaving that gap in the center to be possibly filled with present products is a big failure for me. Very poor strategy in my view.

Well, if you class the x70 as midrange, that will make sense. Considering that x70s are Gx104 and will likely compete with Polaris 11, then AMD will have GPUs in the midrange market segment.

You might think it poor strategy, but it is standard procedure. Introducing many GPUs in the span of a couple months is almost unheard of. This is what Ryan Smith wrote for AT when AMD filled out the 7xxx series.

In 2009-2010, AMD launched the entire 4 chip Evergreen series in 6 months. By previous standards this was a quick pace for a new design, especially since AMD had not previously attempted a 4 chip launch in such a manner. Now in 2012 AMDs Southern Islands team is hard at work at wrapping up their own launch with new aspirations on quickness. Evergreen may have launched 4 chips in 6 months, but this month AMD will be completing the 3 chip Southern Islands launch in half the time 3 chips in a mere 3 months.

It took nVidia 6 months with that transition to get out their 3 main GPUs. Since then, AMD launched two new GCN 1.1 dies in the 2xx series and filled in the rest with GCN 1.0, and then launched another two new GCN1.2 dies with the 3xx series and filled in the rest with GCN1.1 and GCN1.0. I'd love it if they launched 3 new dies and 6-8 new SKUs in the span of a couple months (along with the same from nVidia), but that's hardly standard practice historically.

maddie · Jan 18, 2016

MrTeal said:
Well, if you class the x70 as midrange, that will make sense. Considering that x70s are Gx104 and will likely compete with Polaris 11, then AMD will have GPUs in the midrange market segment.

You might think it poor strategy, but it is standard procedure. Introducing many GPUs in the span of a couple months is almost unheard of. This is what Ryan Smith wrote for AT when AMD filled out the 7xxx series.

It took nVidia 6 months with that transition to get out their 3 main GPUs. Since then, AMD launched two new GCN 1.1 dies in the 2xx series and filled in the rest with GCN 1.0, and then launched another two new GCN1.2 dies with the 3xx series and filled in the rest with GCN1.1 and GCN1.0. I'd love it if they launched 3 new dies and 6-8 new SKUs in the span of a couple months (along with the same from nVidia), but that's hardly standard practice historically.

I see your argument, but for this I'll have to see it happen to believe. Remember it's small and then top-end. Small I get, Oem and laptop are critical. The other release has confused me.

I'm off now.

3DVagabond · Jan 18, 2016

maddie said:
I see your argument, but for this I'll have to see it happen to believe. Remember it's small and then top-end. Small I get, Oem and laptop are critical. The other release has confused me.

I'm off now.

What would you suggest they put HBM2 into? Or do you think they should shelf it and take a step back in their tech? That would seem like a colossal waste of R&D.

hrga225 · Jan 19, 2016

tential said:
Basically you could build 1 chip then scale it into what you need. Low end is 1, mid range is 2 and High end is 4 chips. Pretty cool if possible

maddie said:
Just thought of something. A modification to the layout I'm favoring. Assuming 4 shader sub-units for the top end product, what about including one HBM module interface and memory controller on each of the shader units instead of placing all on the central unit as I originally suggested. If we can use the interposer as the lower routing layers of a traditional monolithic die, and we can, this should work, and allow for less wasted space on lower end parts that use less HBM. Looking at the die shots, the distances should be similar to Fiji. What do both of you think? This should answer Mrteal's concern.

This is what I meant when I said amorphous approach.Blame 10 hour working day.
So in essence you do this on new difficult node:
1.Entry level GPU ~100mm2 - Do it traditional way.
2.Mid range GPU ~200mm2 - Build it for multi die approach with additional circuitry for data hopping to far die but without parts like multimedia,PCIe etc.It is full GPU in traditional sense.Build die with multimedia etc.Put dies with HBM on interposer.
3. Enthusiast level GPU ~400mm2 - Double 2. step
4. High end level GPU 600-1000 mm2 - Repeat 2. step until you run out of space,power,market or any combination of those three.
Success of this approach is how smart you are with interposer.Interposer itself becomes extension to both EACH DIE AND CLUSTER itself.
Or to put it this way one CANNOT function without the other.

@maddie May I point out that you should not get too much hanged on layout of current chips.I know that data locality is what bothers you,but you have to remember that interposer on Fiji routed same stuff that Hawaii routed through PCB in basically same way.It was in no way extension of Fiji itself,"just fancy PCB".

maddie · Jan 19, 2016

3DVagabond said:
What would you suggest they put HBM2 into? Or do you think they should shelf it and take a step back in their tech? That would seem like a colossal waste of R&D.

I definitely think HBM2 should and will be used. This thread depends on it.

You needed to read earlier posts, like this one, to understand why the confusion.

This time we have the DX12 factor. If this happens and Nvidia has three for AMD two, then AMD, in my opinion has lost the race again. Who will buy AMD 28nm then? The biggest % of profits [units sold x margin] is the mid-range. [I included GTX970 here]

The purpose of the halo cards although profitable in their own right, is to stimulate other sales. Why have a halo if you have nothing else except the low end. Also most of the development cost would have been all the redesigning, learning about 14nm, etc. The additional cost to develop a mid-range die would have been a small additional cost compared to what had been spent already.

This wait will be difficult.

Why would AMD forgo spending that additional mid-range development sum and risk loosing most of the middle ranges sales? Everyone on the Arstechnica thread are looking for traditional solutions to explain the performance gap. This thread is pursuing the possibility of another solution. Koduri said two versions of Polaris.

maddie · Jan 19, 2016

hrga225 said:
This is what I meant when I said amorphous approach.Blame 10 hour working day.
So in essence you do this on new difficult node:
1.Entry level GPU ~100mm2 - Do it traditional way.
2.Mid range GPU ~200mm2 - Build it for multi die approach with additional circuitry for data hopping to far die but without parts like multimedia,PCIe etc.It is full GPU in traditional sense.Build die with multimedia etc.Put dies with HBM on interposer.
3. Enthusiast level GPU ~400mm2 - Double 2. step
4. High end level GPU 600-1000 mm2 - Repeat 2. step until you run out of space,power,market or any combination of those three.
Success of this approach is how smart you are with interposer.Interposer itself becomes extension to both EACH DIE AND CLUSTER itself.
Or to put it this way one CANNOT function without the other.

@maddie May I point out that you should not get too much hanged on layout of current chips.I know that data locality is what bothers you,but you have to remember that interposer on Fiji routed same stuff that Hawaii routed through PCB in basically same way.It was in no way extension of Fiji itself,"just fancy PCB".

Yes, the data locality is a concern of mine. In an ideal world, the best designs will maximize this, but we both know about ideal worlds.

My steps as compared to yours:
1.Entry level GPU ~100mm2 - Do it traditional way. GDDR5 [ agree with you = Polaris 10]
Polaris 11 = 2 die
Die [A] = central unit common to all versions ~ 75mm^2
Die = Shader engine unit with a single stack HBM memory controller ~ 125mm^2

2.Mid range GPU ~200mm^2 - [A] + 1 stack HBM2
3. Enthusiast level GPU ~325mm^2 - [A] + 2 x 2 stacks HBM2
4. High end level GPU~ 575mm^2 - [A] + 4 x 4 stacks HBM2

The interesting thing here is the max Die size of 125mm^2. Depending on the actual present yields of the 14nm process, the cost savings relative to a 300mm^2 monolithic Die, might make the interposer and HBM cost the same as GDDR5 or even less.

I think we are in close agreement. That step 4 you mentioned 600-1000mm^2 is missed by many. The interposer base itself is NOT limited by the reticle limit. Only the traces must fit into a single mask. With the microbump density being around 500/mm^2, I can see a mask design overlaying all the possible trace paths needed for different parts of the interposer and only the ones needed for the individual specific die-die connection being used. A Fiji sized multi-Die interposer would need 2 mask exposures, left and right.

Do you propose having the [Compute engines, Command processor,Scheduler, Global Data Share and L2 Cache] in the central shared unit.
If these were split into the sub-units, the data transfers to maintain the single GPU characteristics could cost too much in energy use.

hrga225 · Jan 19, 2016

maddie said:
Do you propose having the [Compute engines, Command processor,Scheduler, Global Data Share and L2 Cache] in the central shared unit. If these were split into the sub-units, the data transfers to maintain the single GPU characteristics could cost too much in energy use.

No,I believe that distributed design is better solution.Thing is, it bugged me how to achieve coherency without wasting to much power but solution is already there:
http://www.hsafoundation.com/html/HSA_Library.htm#SysArch/Topics/02_Details/_chpStr_SysArch_details.htm%3FTocPath%3DHSA%2520Platform%2520System%2520Architecture%2520Specification%2520Version%25201.0%2520%7CChapter%25202.%2520System%2520Architecture%2520Requirements%253A%2520Details%7C_____0
http://www.hsafoundation.com/html/H....%20HSA%20Memory%20Consistency%20Model|_____0
Edit: And you can hide behind compiler and API!

maddie · Jan 19, 2016

hrga225 said:
No,I believe that distributed design is better solution.Thing is, it bugged me how to achieve coherency without wasting to much power but solution is already there:
http://www.hsafoundation.com/html/H...chitecture%20Requirements%3A%20Details|_____0
http://www.hsafoundation.com/html/H....%20HSA%20Memory%20Consistency%20Model|_____0
Edit: And you can hide behind compiler and API!

That document is some serious reading, but I'm appreciating your preferred layout.

hrga225 · Jan 19, 2016

maddie said:
That document is some serious reading, but I'm appreciating your preferred layout.

Mind you I can be completely wrong!
But to me it makes most sense.As chip you build grows, also grow Compute engines, Command processor,Scheduler, Global Data Share and L2 Cache.So you are going to waste more energy shuffling data across chip no matter what.So why waste resources on building multiple mask sets when you can use already researched methods that will also come in handy for different product (example: APU).

tential · Jan 19, 2016

hrga225 said:
This is what I meant when I said amorphous approach.Blame 10 hour working day.
So in essence you do this on new difficult node:
1.Entry level GPU ~100mm2 - Do it traditional way.
2.Mid range GPU ~200mm2 - Build it for multi die approach with additional circuitry for data hopping to far die but without parts like multimedia,PCIe etc.It is full GPU in traditional sense.Build die with multimedia etc.Put dies with HBM on interposer.
3. Enthusiast level GPU ~400mm2 - Double 2. step
4. High end level GPU 600-1000 mm2 - Repeat 2. step until you run out of space,power,market or any combination of those three.
Success of this approach is how smart you are with interposer.Interposer itself becomes extension to both EACH DIE AND CLUSTER itself.
Or to put it this way one CANNOT function without the other.

@maddie May I point out that you should not get too much hanged on layout of current chips.I know that data locality is what bothers you,but you have to remember that interposer on Fiji routed same stuff that Hawaii routed through PCB in basically same way.It was in no way extension of Fiji itself,"just fancy PCB".

I think the real question is what evidence can we find of amd looking into this approach? By all evidence it seems amd isn't looking into this and is content using 2 chips and harvest dies for next gen.

Given a new architecture and very new advanced process, I really imagine the new chips will entice buyers easily. I'm buying a new high end chip on a new node. I know many others will, just because.

I like the idea, I just am trying to see the evidence of amd approaching this. Unless it's heavily guarded but I also think it would give amd such and advantage that we would be hearing similar rumors out of nvidia camp Eben if they were behind in bringing the tech to market.

MrTeal · Jan 19, 2016

hrga225 said:
Mind you I can be completely wrong!
But to me it makes most sense.As chip you build grows, also grow Compute engines, Command processor,Scheduler, Global Data Share and L2 Cache.So you are going to waste more energy shuffling data across chip no matter what.So why waste resources on building multiple mask sets when you can use already researched methods that will also come in handy for different product (example: APU).

So, it seems you'd have each chip with it's own memory controllers, a couple HBM stacks, and L2 slices. With GCN (at least early GCN) each MC has its own L2 block and 64 byte wide connection to the crossbar. Additionally, some of the other parts of the chip have direct access to memory, though my limited understanding doesn't see a reason it's infeasible that a new version could make the L2 fully inclusive and all to the memory controllers go through it. I don't have a good feel for how wide the crossbar would have to be though, or is in current chips. Does it just need to be 512 bits wide?

If the command processor is replicated on every chip, what exactly does the PCIe phy talk to?

maddie · Jan 19, 2016

tential said:
I think the real question is what evidence can we find of amd looking into this approach? By all evidence it seems amd isn't looking into this and is content using 2 chips and harvest dies for next gen.

Given a new architecture and very new advanced process, I really imagine the new chips will entice buyers easily. I'm buying a new high end chip on a new node. I know many others will, just because.

I like the idea, I just am trying to see the evidence of amd approaching this. Unless it's heavily guarded but I also think it would give amd such and advantage that we would be hearing similar rumors out of nvidia camp Eben if they were behind in bringing the tech to market.

A non-confrontational question.

Have you read the AMD research papers and presentations about using interposers to dis-integrate complex chips?

hrga225 · Jan 19, 2016

tential said:
I think the real question is what evidence can we find of amd looking into this approach? By all evidence it seems amd isn't looking into this and is content using 2 chips and harvest dies for next gen. Given a new architecture and very new advanced process, I really imagine the new chips will entice buyers easily. I'm buying a new high end chip on a new node. I know many others will, just because. I like the idea, I just am trying to see the evidence of amd approaching this. Unless it's heavily guarded but I also think it would give amd such and advantage that we would be hearing similar rumors out of nvidia camp Eben if they were behind in bringing the tech to market.

You should reread OP and few post after that.I don't expect this to hit consumer market till H2 2019 and that was me being optimistic.

MrTeal said:
So, it seems you'd have each chip with it's own memory controllers, a couple HBM stacks, and L2 slices. With GCN (at least early GCN) each MC has its own L2 block and 64 byte wide connection to the crossbar. Additionally, some of the other parts of the chip have direct access to memory, though my limited understanding doesn't see a reason it's infeasible that a new version could make the L2 fully inclusive and all to the memory controllers go through it. I don't have a good feel for how wide the crossbar would have to be though, or is in current chips. Does it just need to be 512 bits wide?

Yes, that's right. Although I think Xbar won't cut it.Some type of hybrid interconnect fabric should be used.

MrTeal said:
If the command processor is replicated on every chip, what exactly does the PCIe phy talk to?

I assume simple occupancy mechanism could be used.

tential · Jan 19, 2016

No I haven't. But I mean, I'm not saying it's not possible I'm just wondering if amd will actually do it this gen?

It's clear that amd is interested in using multiple smaller chips to give more performance and I'm sure once I read that paper or a summary of it I'll see even more so how interested amd is in it.

We've seen amd use an interposer already I mean, I'm not doubting pieces are there. I'm just wondering if there is anything concrete, or a list of all of the bread crumbs that leads to this.

maddie · Jan 19, 2016

tential said:
No I haven't. But I mean, I'm not saying it's not possible I'm just wondering if amd will actually do it this gen?

It's clear that amd is interested in using multiple smaller chips to give more performance and I'm sure once I read that paper or a summary of it I'll see even more so how interested amd is in it.

We've seen amd use an interposer already I mean, I'm not doubting pieces are there. I'm just wondering if there is anything concrete, or a list of all of the bread crumbs that leads to this.

In my case, nothing concrete as in proof. Unlike hrga225, I see this being imminent. Every single piece of tech needed have been researched and used. The advantages to cost reduction and performance increase is so large that once this is possible, competitive pressures will force its adoption. An arms race situation. I think that time has arrived, at least for AMD.

Portable IP blocks within a monolithic unit.
When Ryan Smith wrote the Polaris article he specifically mentioned that AMD did not like the GCN 1, 2 ,etc naming, as they have the ability to swap the IP blocks as necessary leading to intermediary designs/layouts. They don't have to redesign a whole GPU to modify part of the design.

Ability to use interposers with microbumps and at cost effective points for desktop GPUs.
Fiji.

Once an interposer is in use, it is cheaper to fab smaller sub-units and use the interposer to reintegrate them. Based on rough estimates I am tending to believe that very early in a node, it might even allow the interposer + HBM to cost the same as GDDR5. A 300mm^2 die vs 2 150mm^2 ones in yield, much less a 600mm^2 + one.
First research paper below.

Allows bigger parts earlier in a node than the traditional monolithic methods.

One design needed to cover a very wide range of performance and thus products.

You can go past the traditional 600mm^2 monolithic limit.

The only big question is if AMD can design a split GPU.
The IP blocks spoken about earlier have to be pretty self contained with limited as possible overall integration, in order for AMD to be able to mix them easily and not redesign the entire chip.
This leads to the multi-die concept having some ip blocks on 1st die layout and the rest on a 2nd die layout. You need at least one of each to assemble a GPU.
The interposer in AMD's words: "Interposer Will Be The SOC With Multiple 3D Components."

They have the experience now.
They have the technology now.
They have the need now.
Why wait?

Links:

Enabling Interposer-Based Disintegration of Multi-core Processors
http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf

NoC Architectures for Silicon Interposer Systems
http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf

Die Stacking is Happening
http://theconfab.com/wp-content/uploads/2014/confab_jun14_die_stacking_is_happening.pdf

Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity Bandwidth and Power Efficiency
http://www.xilinx.com/support/docum...0_Stacked_Silicon_Interconnect_Technology.pdf

maddie · Jan 19, 2016

hrga225 said:
You should reread OP and few post after that.I don't expect this to hit consumer market till H2 2019 and that was me being optimistic.

Yes, that's right. Although I think Xbar won't cut it.Some type of hybrid interconnect fabric should be used.

I assume simple occupancy mechanism could be used.

This might explain our different approaches to the multi-die solution.

I think we have a good chance of this tech being introduced this year, thus I'm looking at rough and ready ways of achieving it.

You, I think, are being more elegant in your approach. Much more symmetry.

MrTeal · Jan 19, 2016

maddie said:
Portable IP blocks within a monolithic unit.
When Ryan Smith wrote the Polaris article he specifically mentioned that AMD did not like the GCN 1, 2 ,etc naming, as they have the ability to swap the IP blocks as necessary leading to intermediary designs/layouts. They don't have to redesign a whole GPU to modify part of the design.

When he said that he was talking about the combinations of different bits like new encoders, etc in relation to naming conventions, where GCN1.2 Tonga might not have the same features as GCN1.2 Fiji. The SP might be the same, but the overall package of IP (what they call the macro-architecture) might not be. They do use a lot of automatic placement tools and would have cells for things like individual compute units, but you make producing a new GPU sound a little simpler than it would be in reality.

Techhog · Jan 19, 2016

maddie said:
In my case, nothing concrete as in proof. Unlike hrga225, I see this being imminent. Every single piece of tech needed have been researched and used. The advantages to cost reduction and performance increase is so large that once this is possible, competitive pressures will force its adoption. An arms race situation. I think that time has arrived, at least for AMD.

Portable IP blocks within a monolithic unit.
When Ryan Smith wrote the Polaris article he specifically mentioned that AMD did not like the GCN 1, 2 ,etc naming, as they have the ability to swap the IP blocks as necessary leading to intermediary designs/layouts. They don't have to redesign a whole GPU to modify part of the design.

Ability to use interposers with microbumps and at cost effective points for desktop GPUs.
Fiji.

Once an interposer is in use, it is cheaper to fab smaller sub-units and use the interposer to reintegrate them. Based on rough estimates I am tending to believe that very early in a node, it might even allow the interposer + HBM to cost the same as GDDR5. A 300mm^2 die vs 2 150mm^2 ones in yield, much less a 600mm^2 + one.
First research paper below.

Allows bigger parts earlier in a node than the traditional monolithic methods.

One design needed to cover a very wide range of performance and thus products.

You can go past the traditional 600mm^2 monolithic limit.

The only big question is if AMD can design a split GPU.
The IP blocks spoken about earlier have to be pretty self contained with limited as possible overall integration, in order for AMD to be able to mix them easily and not redesign the entire chip.
This leads to the multi-die concept having some ip blocks on 1st die layout and the rest on a 2nd die layout. You need at least one of each to assemble a GPU.
The interposer in AMD's words: "Interposer Will Be The SOC With Multiple 3D Components."

They have the experience now.
They have the technology now.
They have the need now.
Why wait?

Links:

Enabling Interposer-Based Disintegration of Multi-core Processors
http://www.eecg.toronto.edu/~enright/Kannan_MICRO48.pdf

NoC Architectures for Silicon Interposer Systems
http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf

Die Stacking is Happening
http://theconfab.com/wp-content/uploads/2014/confab_jun14_die_stacking_is_happening.pdf

Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity Bandwidth and Power Efficiency
http://www.xilinx.com/support/docum...0_Stacked_Silicon_Interconnect_Technology.pdf

Here's a link for you to look at:
http://www.sciencedaily.com/terms/confirmation_bias.htm

maddie · Jan 19, 2016

MrTeal said:
When he said that he was talking about the combinations of different bits like new encoders, etc in relation to naming conventions, where GCN1.2 Tonga might not have the same features as GCN1.2 Fiji. The SP might be the same, but the overall package of IP (what they call the macro-architecture) might not be. They do use a lot of automatic placement tools and would have cells for things like individual compute units, but you make producing a new GPU sound a little simpler than it would be in reality.

I disagree. Here is the exact quote from the article.

To that end the Polaris architecture will encompass a few things: the fourth generation Graphics Core Next core architecture, RTG's updated display and video encode/blocks, and the next generation of RTG's memory and power controllers. Each of these blocks is considered a seperate IP by RTG, and as a result they can and do mix and match various versions of these blocks across different GPUs

The core architecture is specifically mentioned as well as memory and power as separate IP blocks.

If I was making it sound simple, then my fault. What I'm saying is that it's not quite as difficult as most are claiming. Not quite the same thing. Especially as AMD has developed all the tech for this to happen.

Some even say practically impossible without the slightest examination of what it entails. You get these ignorant cries of latency being a problem from people who never read a single research paper. The good thing is that every criticism motivates me to learn more and for the latency issue, I found a Xilinx paper from 2012 which claimed a 1 ns latency between die on the interposer. Latency can no longer be a barrier for the pessimists.

hrga225 · Jan 19, 2016

maddie said:
This might explain our different approaches to the multi-die solution.

I think we have a good chance of this tech being introduced this year, thus I'm looking at rough and ready ways of achieving it.

You, I think, are being more elegant in your approach. Much more symmetry.

Like I said in one of previous post,I get where you're aiming at.

Personally,I think this tech will be used on the very end of 14nm node as a pipe cleaner for next one,but only in enterprise solutions.That's why I think solution has to be robust from the get go.You call my approach elegant,I call it bloody hard.
And thank you!

MrTeal · Jan 19, 2016

maddie said:
I disagree. Here is the exact quote from the article.

The core architecture is specifically mentioned as well as memory and power as separate IP blocks.

You're misunderstanding me. AMD's discomfort with the naming conventions isn't because they have the ability to swap blocks in and out of chips, but because around the same basic compute unit version they might have different IP blocks and so "GCN1.2" chips might not be homogeneous in their feature set. The example given is that while Tonga and Fiji are both GCN1.2 (or GCN3), one has HVEC while the other doesn't. It's the extending of the CU version to cover the whole chip that is the problem. While AMD definitely does have the ability to match GCN1, 2 , etc CUs with different other blocks, the next sentence saying they don't have to redesign to whole GPU to modify part of it is a pretty big simplification.

Yes, they're separate IP. You could mix and match GCN CU's from different versions, different other features, say leave off a XDMA from a low end part, etc while building a new chip. That doesn't necessarily mean that you can do something like grab the new MC and L2 from Polaris, quickly toss it on a respin of Hawaii, and sell that into the midrange market. Maybe you could back port it, or say update Hawaii with HVEC and HDMI2.0, but that doesn't mean it'd be a quick job.

boozzer · Jan 19, 2016

my question is this, can this be done for cpu cores also?

hrga225 · Jan 19, 2016

boozzer said:
my question is this, can this be done for cpu cores also?

Yes.Paper in link in 1# talks of building 64 core CPU using same approach.

Multiple dies acting as one on interposer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Member

Diamond Member

Golden Member

Member