Question Zen 6 Speculation Thread

Page 259 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Joe NYC

Diamond Member
Jun 26, 2021
3,630
5,172
136
L3<-> L3 transfer perhaps?

I wonder if that ever gets solved that would fit satisfactorily into how AMD handles L3...

It would be a "nice to have" in a desktop, but on server, it would only make sense if the entire CPU worked as unit, rather than various virtualized subsets. So probably too few big use cases in the overall picture.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,058
9,793
106
I wonder if that ever gets solved that would fit satisfactorily into how AMD handles L3...

It would be a "nice to have" in a desktop, but on server, it would only make sense if the entire CPU worked as unit, rather than various virtualized subsets. So probably too few big use cases in the overall picture.
NUCA is nice and elegant.
Forget about that.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,630
5,172
136
yeah i dont get the obsession with making your L3 crappy and making coherency a nightmare.

Also, it is becoming a moot point after AMD moved :
- from 16MB pool of L3 in Bergamo
- to 32 MB pool of L3 in Turin
- to 128MB pool of L3 in Venice
- to maybe > 200 MB pool of L3 in Florence
 

LightningZ71

Platinum Member
Mar 10, 2017
2,507
3,190
136
More bumps = more connections. Either wider data pathways for existing payouts(edit: layouts), or payouts(edit: layouts) are changing and they need more wires to connect more chiplets. Maybe there will be a 4 chiplet package on desktop with a separate iGPU chiplet?
 
Last edited:

Joe NYC

Diamond Member
Jun 26, 2021
3,630
5,172
136
More bumps = more connections. Either wider data pathways for existing payouts, or payouts are changing and they need more wires to connect more chiplets. Maybe there will be a 4 chiplet package on desktop with a separate iGPU chiplet?

Or a separate NPU chiplet.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Or daisy-chaining of CCDs </s>

But yes, they tripled RAM bandwidth to around 1.6 TByte/s for the Top-End. With 8 CCD you'd need an interconnect to be at least as wide as 200 GByte/s in order to saturate this. And that is with each CCD demanding an equal share. Current GMI-Wide delivers 128 GByte/s (read) IIRC.
So 256 GByte/s/CCD or even more don't seem like overkill to me.
 

mmaenpaa

Member
Aug 4, 2009
135
244
116
Afaik NPUs are also better regarding time to first token or in other words execution latency. They work better with small batch sizes. For many applications and single customer use cases this is helpful. But for big number crunching it should be better to move towards the GPU in the longterm. The GPU does also have massive support from a big and wide memory system. Replicate that for an NPU is a waste of sand.

But funnily enough, doesn't Qualcomm add a better link between GPU and NPU to move matrix computations to the NPU (the GPU does not support such acceleration)

Regarding software:
HW differences could be abstracted away by HALs and APIs.
I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet? So far I have not spotted anything useful. For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.
 

marees

Golden Member
Apr 28, 2024
1,727
2,367
96
I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet? So far I have not spotted anything useful. For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.
Spellcheck
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,058
9,793
106
Spellcheck doesn't need any sort of AI. Checking grammar needs to be a bit smarter (though nowhere near needing 50 TOPS that Copilot requires) but spellcheck has been around since before CPUs went 32 bits.
yeah but they're gonna do spellcheck using a hugeass xformer eating 4GB of your DRAM just for that.
welcome to the future, gramps.
 

basix

Senior member
Oct 4, 2024
241
492
96
L3<-> L3 transfer perhaps?
Compared to IFOP, you can do that now through a wider, faster and lower latency interface to the IOD ;)

Or daisy-chaining of CCDs </s>

But yes, they tripled RAM bandwidth to around 1.6 TByte/s for the Top-End. With 8 CCD you'd need an interconnect to be at least as wide as 200 GByte/s in order to saturate this. And that is with each CCD demanding an equal share. Current GMI-Wide delivers 128 GByte/s (read) IIRC.
So 256 GByte/s/CCD or even more don't seem like overkill to me.
That is a very interesting idea, indeed. For Zen 6 I do not expect something like that to happen. For Zen 7 I think not as well (16/33C CCDs, bigger L3$ and simply faster cores are already a decent enough update). But Zen 7 could still introduce it (core count mania). Would be sick to see a 512C Zen 7 SKU ;)

As the beachfront of the IOD is limited, daisy-chaining makes very much sense in the mid- to longterm. It are just a few hundred of GByte/s if putting 2x CCDs in series. Such a concept opens up the door to very huge core count scalings without adding too much cost (much bigger CCDs, much more IOD area, ...).
  • Even with 512 GByte/s it is not an issue, the power draw is still much lower than 128 GByte/s of an existing IFOP interface (~10x less power required)
  • RDNA3 MCDs already delivered ~900 GByte/s per chiplet
  • Zen 7 will probably introduce an outsourced L3$ on a bottom 3D-Stacked Die. Adding 2x IF-PHY on two sides of this base Die (for daisy-chaining), which gets manufactured in an older node like N4, would not hurt regarding costs.
 

511

Diamond Member
Jul 12, 2024
4,498
4,116
106
That is a very interesting idea, indeed. For Zen 6 I do not expect something like that to happen. For Zen 7 I think not as well (16/33C CCDs, bigger L3$ and simply faster cores are already a decent enough update). But Zen 7 could still introduce it (core count mania). Would be sick to see a 512C Zen 7 SKU ;)
512C totally doubt this with the meager density gains they have to make the package significantly larger 384C seems possible
 
  • Like
Reactions: OneEng2

ToTTenTranz

Senior member
Feb 4, 2021
686
1,146
136
I have been thinking about getting a Copilot+ laptop but frankly are there any "real" uses/programs for NPU yet?

You have AMD's GAIA and Intel's OpenVINO that can integrate Ollama. Both can get you to use an NPU for running LLMs on Windows, with which you can also make agents.

If you're willing to dedicate a couple of hours to set this up, you can get a NPU to run LLMs for you locally with a lower power consumption compared to running it on the iGPU.


EDIT: even easier than using OpenVINO, Intel has the AI Playground app that also makes use of its NPUs:

Check out a demo of the app here, at the timestamp:



For example I would like Copilot app on windows actually use NPU. Or Copilot addons in office.
You could try to implement a local running LLM to work on Outlook and Word, it's supposedly possible.. but IIRC it's not easy. Microsoft isn't super interested in letting people off the hook on paying $30/month for the full Copilot M365 experience.
 
Last edited:
  • Like
Reactions: Tlh97 and mmaenpaa