Discussion Next Navi GPU specifications (Hypothetical!)

extide · Jul 31, 2019

Nvidia has a whilepaper about chiplet GPU's -- it's definitely possible, for graphics workloads. I am sure we will see it sooner or later. Check it out: https://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs it's a pretty good read, honestly.
That paper is from 2017, and I am sure AMD knows about it.

Yotsugi · Jul 31, 2019

extide said:
Nvidia has a whilepaper about chiplet GPU's

MCM != chiplet.

extide said:
it's definitely possible, for graphics workloads.

Did you even read the paper (and workloads in it)?

extide · Aug 1, 2019

Yotsugi said:
MCM != chiplet.

Not all MCM's are chiplet but all chiplet packages are MCM's. They are talking about four compute dies on a single package (along with HBM in this example but that's not strictly necessary). So, your statement here is pointless.

Yotsugi said:
Did you even read the paper (and workloads in it)?

Yes, I have read the whole thing, it's been a while but yes. It does seem they are using all compute benchmarks here, I thought there were some graphics workload examples too. In any case, the optimizations that they discussed in the article should apply just fine to graphics workloads as well.

You are basically just crapping all over this thread and not actually contributing anything useful, so it's probably best to just ignore your posts on this thread anyways.

Yotsugi · Aug 1, 2019

extide said:
Not all MCM's are chiplet but all chiplet packages are MCM's.

W-rong.
I can do chiplets in 3DSoIC and it's not gonna be MCM.

extide said:
It does seem they are using all compute benchmarks here

They didn't do that just because.
They did that for a reason.

extide said:
In any case, the optimizations that they discussed in the article should apply just fine to graphics workloads as well.

Graphics isn't FP64 in vacuum, hammer that into your head already.

DisEnchantment · Aug 1, 2019

https://lists.freedesktop.org/archives/amd-gfx/2019-August/037821.html

Navi12 Support coming soon.
No new info in the commits, basically similar config with other Navi products

Navi10 = GFX10_1_0
Navi12 = GFX10_1_1 --> Does not have LDS Misaligned Bug? What does it mean?
Navi14 = GFX10_1_2

NostaSeronx · Aug 2, 2019

DisEnchantment said:
What does it mean?

Doesn't have "Some GFX10 bug with misaligned multi-dword LDS access in WGP mode."

Ottonomous · Aug 2, 2019

NostaSeronx said:
Doesn't have "Some GFX10 bug with misaligned multi-dword LDS access in WGP mode."

Workgroup processing problem with LDS allocation? Aren't workgroups effectively limited by the LDS size? Why would there be misalignment in some forms of addressing?

extide · Aug 2, 2019

Yotsugi said:
W-rong.
I can do chiplets in 3DSoIC and it's not gonna be MCM.

TSMC's press release literally says "Bonding chips together in a 3D structure will allow chip makers to utilise a multi-chip design while benefitting from low latency interconnects and fewer of the performance downsides that are seen in some of today's multi-chip products. "
You are really grasping at straws there. I mean MCM is literally an assembly with multiple chips on it.

So what's your point here, I mean even you said there will be multi chip GPU's in 5-7 years+ Obviously these guys are working on them right now. We wouldn't have a 2 year old research paper in the public otherwise. I really think we will see something a fair bit sooner than that -- more in the 3-4ish year time and I wouldn't be surprised if it wasn't AMD that brings out the first product. Or is your point that we will never see a MCM graphics GPU? "Never" is a pretty bold claim to make, and you'd need to bring some pretty solid evidence to back a statement like that. I mean, honestly, 3dfx was doing multi-chip graphics back in the 90's, although it wasn't the most elegant solution, it did work. Compared to compute, graphics adds some fixed function stuff like geometry, texturing, rasterization, which sure, they make it harder but not impossible. I mean honestly all you are doing in this thread is telling people they are wrong and making snide comments while not adding anything useful or thoughtful. Why don't you try that out?

Yotsugi · Aug 2, 2019

extide said:
Bonding chips together in a 3D structure will allow chip makers to utilise a multi-chip design while benefitting from low latency interconnects and fewer of the performance downsides that are seen in some of today's multi-chip products

MCM in OSAT terminology is multiple chips on organic carrier.

extide said:
I mean even you said there will be multi chip GPU's in 5-7 years

That's me being very optimistic.

extide said:
I mean, honestly, 3dfx was doing multi-chip graphics back in the 90's, although it wasn't the most elegant solution, it did work

We're not talking mGPU.
We're talking fully coherent multidie solutions.

Glo. · Aug 2, 2019

Yotsugi said:
We're not talking mGPU.
We're talking fully coherent multidie solutions.

And this is the only way we can get chiplet based graphics for games.

extide · Aug 2, 2019

Yotsugi said:
MCM in OSAT terminology is multiple chips on organic carrier.

So those guys wouldn't consider Pentium Pro or POWER5 to be MCM?

Splitting hairs here, surely.

Yotsugi said:
We're talking fully coherent multidie solutions.

Glo. said:
And this is the only way we can get chiplet based graphics for games.

And, funnily enough, that nvidia research paper fully covered this, including coming up with some basic optimizations to help overcome the reduced bandwidth and additional latency to 'remote' caches and memory.

Yotsugi · Aug 2, 2019

extide said:
So those guys wouldn't consider Pentium Pro or POWER5 to be MCM?

These are too ancient for modern OSAT terminology.

extide said:
funnily enough, that nvidia research paper fully covered this

Unfortunately for you, it didn't cover jobs that don't easily scale to n chips/boards/nodes/you name it.
Jobs like realtime computer graphics.

extide · Aug 2, 2019

Yotsugi said:
jobs that don't easily scale to n chips/boards/nodes/you name it.
Jobs like realtime computer graphics.

Why don't they? Specifically

Yotsugi · Aug 2, 2019

extide said:
Why don't they? Specifically

Very tangible synchronisation and memory access overhead for anything graphics that could easily kill perf.
It's kind of hurts compute too, particularly training (hence why we use xboxhueg dies like V100/Spring Crest and weird scale up setups like DGX-2/whatever Nervana is doing), but it's manageable there.

extide · Aug 2, 2019

Yotsugi said:
Very tangible synchronisation and memory access overhead for anything graphics that could easily kill perf.
It's kind of hurts compute too, particularly training (hence why we use xboxhueg dies like V100/Spring Crest and weird scale up setups like DGX-2/whatever Nervana is doing), but it's manageable there.

Compute suffers just as much from synchronisation and memory access overhead. (Love how you added "Very tangible" to your statement, ) This is all discussed in the paper and like I have mentioned at least two times before this there are some simple optimizations that they presented in the paper to assist with overcoming those issues. You don't need to entirely eliminate the overhead, just reduce it enough that you net more performance because of the greater overall GPU resources you have available. The paper presents it very well -- they talk about the performance of the largest possible gpu that you could physically build and then ways to get more performance than that.

If you are trying to come up with a good argument as to why compute is easy to do on multi-chip, but graphics is not, you need to talk about problems that are unique to (or at least more difficult on) graphics. You keep mentioning that this is "impossible for graphics" but you haven't said anything specifically about how the additional aspects of graphics (stuff like geometry, rasterization, texturing, etc) cannot scale across multiple dies. Rasterization is already done in tiles on a lot of architectures, and each tile can be small and worked on independently. In fact, this approach was initially thought of to save on memory access and improve data locality -- exactly what we need here. Geometry and texturing can similarly be done by splitting the scene up into chunks. I do agree that graphics is more difficult, but I do not agree that is impossible.

Not sure why you mention DGX as an example of avoiding synchronisation and memory access overhead, as that uses multi gpu with NVLink and has much worse synchronisation and memory access overhead than even a hypothetical multi-chip gpu using on-package interconnects.

Also, I just want to add the whole discussion about what is or isn't MCM is pointless. The point is whether a GPU manufacturer has made the choice to invest into solutions to overcoming the hurdles that building a GPU out of multiple dies presents. For the purposes of this discussion it doesn't really matter if the dies are on an organic substrate, a ceramic one, a silicon interposer, or even a more exotic solution. Sure, the more exotic ones will probably allow better interconnects but that's not the point here, remember your argument is that multi die graphics GPU's are essentially impossible and that it will be 5-7+ years before we see any multi chip gpu solution.

Yotsugi · Aug 2, 2019

extide said:
Compute suffers just as much from synchronisation and memory access overhead

Nowhere near as much as graphics.
Try scaling anything real-time to at least multiple GPUs.

extide said:
If you are trying to come up with a good argument as to why compute is easy to do on multi-chip, but graphics is not, you need to talk about problems that are unique to (or at least more difficult on) graphics

Easily but not here and you can always annoy the likes of sebbbi on Twitter if you want to.

extide said:
specifically about how the additional aspects of graphics (stuff like geometry, rasterization, texturing, etc)

Computer graphics isn't FF stuff only.

extide said:
Geometry and texturing can similarly be done by splitting the scene up into chunks.

Congrats you're doing good old mGPU.
Next.

extide said:
Not sure why you mention DGX as an example of avoiding synchronisation and memory access overhead

Where did I say "avoiding"?
Your headcanon doesn't work here.

extide · Aug 2, 2019

Yotsugi said:
Nowhere near as much as graphics.
Try scaling anything real-time to at least multiple GPUs.

No substance there.

Yotsugi said:
Easily but not here and you can always annoy the likes of sebbbi on Twitter if you want to.

Nope, none here either.

Yotsugi said:
Where did I say "avoiding"?
Your headcanon doesn't work here.

Let's take a look at the statement:

Yotsugi said:
Very tangible synchronisation and memory access overhead for anything graphics that could easily kill perf.
It's kind of hurts compute too, particularly training (hence why we use xboxhueg dies like V100/Spring Crest and weird scale up setups like DGX-2/whatever Nervana is doing), but it's manageable there.

You basically say "There are synchronization and memory access overhead for graphics, that kind of hurt compute, too, so we use big monolithic gpus like V100" This part makes sense, because big monolithic gpus avoid synchronization and memory access overhead you get from multi die. Then you add "and weird scale up setups like DGX-2" which flys completely against the first half of the statement because it is pretty much worst case scenarios for synchronization and memory access overhead.

In any case, NO SUBSTANCE! You are basically making a hand wavey argument that it doesn't work "because" -- and it's really starting to seem like you don't understand enough about exactly how graphics workloads work to make a solid argument as to exactly why they can't be scaled across multiple dies. But oh the latency!!! Then make sure you have good data locality and build intelligent local caches to hide that deficit. (Again, some good techniques are discussed in the whitepaper).

Yotsugi · Aug 2, 2019

extide said:
No substance there.

No, genuinely go try it.
Then try some CUDA.

extide said:
Nope, none here either

Do you expect me to write you some examples for free?
Jeez.

extide said:
which flys completely against the first half of the statement because it is pretty much worst case scenarios for synchronization and memory access overhead

It's like you never did anything that already scales to multiple nodes.
High bandwidth and coherent > p2p weirdness > going off the node.

extide said:
But oh the latency!!!

Wow, your headcanon is getting stronger.
Impressive in your desperation.
Very cute though.

extide · Aug 2, 2019

Yotsugi said:
No, genuinely go try it.
Then try some CUDA.

Do you expect me to write you some examples for free?
Jeez.

An example of how it is difficult to scale to multiple discrete GPUs is not an example of why it is harder or impossible to scale graphics than compute to multiple dies. Nvidia's research paper already proves that it is possible to scale compute across multiple dies, that's not even a question. You keep forgetting your argument, that graphics is somehow impossible to do on multiple dies.

Yotsugi said:
It's like you never did anything that already scales to multiple nodes.
High bandwidth and coherent > p2p weirdness > going off the node.

This is random gibberish. What are you even trying to say here? And why do you bring DGX-2 into the situation when you are trying to say that multi die graphics is impossible yet DGX-2 has more latency and less bandwidth than any multi-chip solution would ever have. People use DGX-2 because they figure out ways to mitigate the shortcomings of the platform in exchange for it's greater performance, which is, funnily enough, exactly what you'd have to do for multi-die. Kind of shooting yourself in the foot there.

Yotsugi said:
Wow, your headcanon is getting stronger.
Impressive in your desperation.
Very cute though.

I mean I suppose if you can't even formulate a cohesive argument you can resort to this... Or is this some sort of real-life example of the Dunning–Kruger effect where I am the one bringing a reference into the argument, and you are using nothing but hand wavey magic, but firing 'headcanon' at me, nice.

Yotsugi · Aug 2, 2019

extide said:
An example of how it is difficult to scale to multiple discrete GPUs is not an example of why it is harder or impossible to scale graphics than compute to multiple dies

It's the very same thing, except with tons more bandwidth between GPUs.

extide said:
This is random gibberish

You never doing any mGPU jobs doesn't make it gibberish.
What kills mGPU graphics hurts compute too.
Fortunately enough we mitigate that for some workloads by replacing PCIe p2p with something faster and coherent and then just making a fatter node.

extide said:
People use DGX-2 because they figure out ways to mitigate the shortcomings of the platform in exchange for it's greater performance

People use DGX-2 for those jobs that scale better with fatter nodes instead of node count.
You know, the good old scale-up versus scale-out.

NostaSeronx · Aug 3, 2019

imho, Graphics and Compute with chiplets would be rather easy to implement.

384-bit GDDR6 on 7nm Common die = 150 mm squared die
48 CUs on two 7nm GPU dies = 2x200 mm squared die
Total area = 550 mm squared

96 CUs @ 1.5 GHz + 384-bit @ 14 GHz
(2x3072)6144 ALUs + (2x192)384 TMUs + (2x64)128 ROPs

https://www.renesas.com/us/en/produ...y-dram/low-latency-high-bandwidth-memory.html

Guru · Aug 3, 2019

There is no need for chiplet design for gpu's as of right now for gaming workloads, because the drawbacks are quite big and the advantages small. In terms of compute, AI, chiplet design does make sense and it could be done without much of the negatives that are there for gaming.

Discussion Next Navi GPU specifications (Hypothetical!)

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member