Question Zen 6 Speculation Thread

adroc_thurston · Tuesday at 4:29 AM

Joe NYC said:
More microbumps, wider bus?

I mean yeah moar SDPs is a go.

Joe NYC said:
Another guess would be creating connections to other CCDs, which would lead to a question "what for?"

you guys need to stop with this meme.

Joe NYC · Tuesday at 4:43 AM

511 said:
L3<-> L3 transfer perhaps?

I wonder if that ever gets solved that would fit satisfactorily into how AMD handles L3...

It would be a "nice to have" in a desktop, but on server, it would only make sense if the entire CPU worked as unit, rather than various virtualized subsets. So probably too few big use cases in the overall picture.

adroc_thurston · Tuesday at 4:55 AM

Joe NYC said:
I wonder if that ever gets solved that would fit satisfactorily into how AMD handles L3...

It would be a "nice to have" in a desktop, but on server, it would only make sense if the entire CPU worked as unit, rather than various virtualized subsets. So probably too few big use cases in the overall picture.

NUCA is nice and elegant.
Forget about that.

itsmydamnation · Tuesday at 5:00 AM

yeah i dont get the obsession with making your L3 crappy and making coherency a nightmare.

marees · Tuesday at 5:11 AM

Clarifications by high yield

https://twitter.com/x/status/1972835889912136119

https://twitter.com/x/status/1972894903203192902

adroc_thurston · Tuesday at 5:18 AM

marees said:
Clarifications by high yield

https://twitter.com/x/status/1972835889912136119

https://twitter.com/x/status/1972894903203192902

They do have the same substrate sizes available for both -R and -L.
The difference is down to bump pitch and line spacing.

Joe NYC · Tuesday at 5:44 AM

itsmydamnation said:
yeah i dont get the obsession with making your L3 crappy and making coherency a nightmare.

Also, it is becoming a moot point after AMD moved :
- from 16MB pool of L3 in Bergamo
- to 32 MB pool of L3 in Turin
- to 128MB pool of L3 in Venice
- to maybe > 200 MB pool of L3 in Florence

LightningZ71 · Tuesday at 6:38 AM

More bumps = more connections. Either wider data pathways for existing payouts(edit: layouts), or payouts(edit: layouts) are changing and they need more wires to connect more chiplets. Maybe there will be a 4 chiplet package on desktop with a separate iGPU chiplet?

Joe NYC · Tuesday at 7:48 AM

LightningZ71 said:
More bumps = more connections. Either wider data pathways for existing payouts, or payouts are changing and they need more wires to connect more chiplets. Maybe there will be a 4 chiplet package on desktop with a separate iGPU chiplet?

Or a separate NPU chiplet.

BorisTheBlade82 · Tuesday at 11:15 AM

Or daisy-chaining of CCDs </s>

But yes, they tripled RAM bandwidth to around 1.6 TByte/s for the Top-End. With 8 CCD you'd need an interconnect to be at least as wide as 200 GByte/s in order to saturate this. And that is with each CCD demanding an equal share. Current GMI-Wide delivers 128 GByte/s (read) IIRC.
So 256 GByte/s/CCD or even more don't seem like overkill to me.

marees · Tuesday at 11:41 AM

marees said:
Clarifications by high yield

https://twitter.com/x/status/1972835889912136119

https://twitter.com/x/status/1972894903203192902

One more comment by high yield

https://x.com/highyieldYT/status/1973049669749248150

adroc_thurston · Tuesday at 12:23 PM

marees said:
One more comment by high yield

https://x.com/highyieldYT/status/1973049669749248150

*loud incorrect buzzer noise*
WRONG, EMIB is embedded into the ABI laminate.

511 · Tuesday at 12:25 PM

Cowos-L ain't though

Josh128 · Tuesday at 1:04 PM

As long as it doesnt use CoWpAt-Y its cool.

adroc_thurston · Tuesday at 1:05 PM

LightningZ71 said:
with a separate iGPU chiplet?

who is this for

mmaenpaa · Tuesday at 2:15 PM

basix said:
Afaik NPUs are also better regarding time to first token or in other words execution latency. They work better with small batch sizes. For many applications and single customer use cases this is helpful. But for big number crunching it should be better to move towards the GPU in the longterm. The GPU does also have massive support from a big and wide memory system. Replicate that for an NPU is a waste of sand.

But funnily enough, doesn't Qualcomm add a better link between GPU and NPU to move matrix computations to the NPU (the GPU does not support such acceleration)

Regarding software:
HW differences could be abstracted away by HALs and APIs.

I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet? So far I have not spotted anything useful. For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.

marees · Tuesday at 3:06 PM

mmaenpaa said:
I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet? So far I have not spotted anything useful. For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.

Spellcheck

Josh128 · Tuesday at 3:16 PM

marees said:
Spellcheck

So the answer is still no. No real uses as of yet.

mmaenpaa · Tuesday at 3:19 PM

marees said:
Spellcheck

Spellcheck as in "To use Microsoft's AI spellcheck in office"? It actually uses NPU?

Doug S · Tuesday at 3:38 PM

Spellcheck doesn't need any sort of AI. Checking grammar needs to be a bit smarter (though nowhere near needing 50 TOPS that Copilot requires) but spellcheck has been around since before CPUs went 32 bits.

adroc_thurston · Tuesday at 3:41 PM

Doug S said:
Spellcheck doesn't need any sort of AI. Checking grammar needs to be a bit smarter (though nowhere near needing 50 TOPS that Copilot requires) but spellcheck has been around since before CPUs went 32 bits.

yeah but they're gonna do spellcheck using a hugeass xformer eating 4GB of your DRAM just for that.
welcome to the future, gramps.

marees · Tuesday at 4:25 PM

adroc_thurston said:
yeah but they're gonna do spellcheck using a hugeass xformer eating 4GB of your DRAM just for that.
welcome to the future, gramps.

What do you mean 4 GB

It can easily gobble up more

Edit: I believe some MacBook ran out of RAM due to spellcheck

basix · Wednesday at 3:44 AM

511 said:
L3<-> L3 transfer perhaps?

Compared to IFOP, you can do that now through a wider, faster and lower latency interface to the IOD

BorisTheBlade82 said:
Or daisy-chaining of CCDs </s>

But yes, they tripled RAM bandwidth to around 1.6 TByte/s for the Top-End. With 8 CCD you'd need an interconnect to be at least as wide as 200 GByte/s in order to saturate this. And that is with each CCD demanding an equal share. Current GMI-Wide delivers 128 GByte/s (read) IIRC.
So 256 GByte/s/CCD or even more don't seem like overkill to me.

That is a very interesting idea, indeed. For Zen 6 I do not expect something like that to happen. For Zen 7 I think not as well (16/33C CCDs, bigger L3$ and simply faster cores are already a decent enough update). But Zen 7 could still introduce it (core count mania). Would be sick to see a 512C Zen 7 SKU

As the beachfront of the IOD is limited, daisy-chaining makes very much sense in the mid- to longterm. It are just a few hundred of GByte/s if putting 2x CCDs in series. Such a concept opens up the door to very huge core count scalings without adding too much cost (much bigger CCDs, much more IOD area, ...).

Even with 512 GByte/s it is not an issue, the power draw is still much lower than 128 GByte/s of an existing IFOP interface (~10x less power required)
RDNA3 MCDs already delivered ~900 GByte/s per chiplet
Zen 7 will probably introduce an outsourced L3$ on a bottom 3D-Stacked Die. Adding 2x IF-PHY on two sides of this base Die (for daisy-chaining), which gets manufactured in an older node like N4, would not hurt regarding costs.

511 · Wednesday at 4:01 AM

basix said:
That is a very interesting idea, indeed. For Zen 6 I do not expect something like that to happen. For Zen 7 I think not as well (16/33C CCDs, bigger L3$ and simply faster cores are already a decent enough update). But Zen 7 could still introduce it (core count mania). Would be sick to see a 512C Zen 7 SKU

512C totally doubt this with the meager density gains they have to make the package significantly larger 384C seems possible

ToTTenTranz · Wednesday at 4:31 AM

mmaenpaa said:
I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet?

You have AMD's GAIA and Intel's OpenVINO that can integrate Ollama. Both can get you to use an NPU for running LLMs on Windows, with which you can also make agents.

If you're willing to dedicate a couple of hours to set this up, you can get a NPU to run LLMs for you locally with a lower power consumption compared to running it on the iGPU.

EDIT: even easier than using OpenVINO, Intel has the AI Playground app that also makes use of its NPUs:

Introducing AI Playground – Intel Gaming Access

game.intel.com

Check out a demo of the app here, at the timestamp:

mmaenpaa said:
For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.

You could try to implement a local running LLM to work on Outlook and Word, it's supposedly possible.. but IIRC it's not easy. Microsoft isn't super interested in letting people off the hook on paying $30/month for the full Copilot M365 experience.

Question Zen 6 Speculation Thread

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Member

Golden Member

Golden Member

Member

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member