64 core EPYC Rome (Zen2)Architecture Overview?

Page 23 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

amd6502

Senior member
Apr 21, 2017
971
360
136
Why would they ever do that?

They need a high volume mainstream to value APU primarily for mobile and secondarily for desktop . Something plentiful and cheap to produce, similar in ~5B transistors as RR.

They could ride it out with a RR optimization and binning refresh, but that means no significant frequency improvments and no IPC improvements over a 2400g.

It would be nice if they could take some of those Zen2 improvements, namely, the integer core related ones and port it to 12nm. A RR successor really doesn't need to waste transistor on massively overkill FPUs or a large number of huge multilevel caches that need to talk to another.
 
Last edited:

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Even if you retired the request the memory won’t be ready to receive another request for many more cycles and the controller knows this. I suppose you could have logic to remove entries from a queue in the controller on a a cache hit. That sort of logic would have to globally work with all caches which seems to unnecessarily mess with the normal operation of the cache-memory hierarchy.

You could be right - I had been under the impression that the newer DRAM schedulers allowed the CPU memory controller to send retirement signals.

Now, upon searching, I see no hard evidence of this.
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
They need a high volume mainstream to value APU primarily for mobile and secondarily for desktop .
So they will make a value N7 APU somewhere late 2019-early 2020.
It would be nice if they could take some of those Zen2 improvements, namely, the integer core related ones and port it to 12nm.
That's, that's not how it works.
 

Beemster

Member
May 7, 2018
34
30
51
Centaur barely has any eDRAM.
Also expensive.

it's 16MB L4 buffer on EACH of (8) Centaur chips (one for each memory channel) or 128MB if they did it on one hub chip.

https://www.anandtech.com/show/9567/the-power-8-review-challenging-the-intel-xeon-/7

this was on Power 8 and done in 22nm SOI. Power 9 was done in 14nm FinFET. I suspect the L4 buffer was doubled to 32MB. So I'd guess 256MB or perhaps larger on the ONE hub chip on Rome. Power 9 (SU) was released this summer so AMD could have followed with designing their memory hub chip by about a year after IBM. The Centaur chip was likely fabbed in East Fishkill and IF AMD went with eDRAM, it is still likely produced there. Global, since mid 2016, owns that process and the timing would have allowed AMD to use it. It seems IBM chose eDRAM even though they disintegrated the memory controllers onto 8 separate chips for (SU) Power 9. Each chip can't be that large so why did they not go with faster SRAM? It's packed with CMOS logic anyway.
CDIMM_scheme.jpg
 
Last edited:

Vattila

Senior member
Oct 22, 2004
799
1,351
136
For me, the really big question is whether AMD has the packaging capacity and low cost to take chiplet design into the mainstream. If so, then we'll see reuse of their 7nm CPU chiplet for the Ryzen 3000 series. They only need a mini IO chiplet.

What is the easiest way for AMD to create this IO chiplet? They already have Raven Ridge. It has the dual-channel memory controllers and IO needed. Just rip out the 4-core CCX and replace it with some external IF ports to connect the CPU chiplet. Except for that, and some adjustment to the layout, no redesign is needed. With this seemingly simple solution, Ryzen 3000 would come with a powerful iGPU as well, thus match Intel's feature set and increase sales with OEMs.

Drivers for this thing would be mostly based on Raven Ridge and ready to go.

Would it fit on the package? 14LP Raven Ridge sans CCX is 166 mm² (210 - 44). With the announced 15% density improvement for 12LP we get 144 mm², so let's say 150 mm² for good measure.

9114301_c454443b5feb4f8465a65f148c87a626.png
 

Beemster

Member
May 7, 2018
34
30
51
With this seemingly simple solution, Ryzen 3000 would come with a powerful iGPU as well, thus match Intel's feature set and increase sales with OEMs.


That sounds good. And perhaps a different I/O chip for high end gaming by replacing the iGPU with L4 cache and using a 8 core chiplet. All 14nm I/0 chip designs would be pretty cheap these days at GF. I suspect.
 
Last edited:
  • Like
Reactions: Vattila

Tuna-Fish

Golden Member
Mar 4, 2011
1,349
1,534
136
That sounds good. And perhaps a different I/O chip for high end gaming by replacing the iGPU with L4 cache and using a 8 core chiplet. All 14nm I/0 chip designs would be pretty cheap these days at GF. I suspect.

Unless there is a lot of that cache, I suspect it wouldn't be worth the latency hit for gaming.

Note that since the GPU/IO-chip also needs a at least 16x worth of PCI-E interface (to match RR, 8x for the GPU connection, 4x for the southbridge connection and 4x for the M.2) and the IF interface is quite small, the exact same chip could also do double duty as a new lowest-end GPU in the AMD GPU lineup. Even though it would only have a 128-bit DDR4 interface for memory, it would presumably have all the newest multimedia stuff and so would make a nice chip for, for example, TR systems that lack need for a lot of GPU power but do require something to run a desktop.
 

Zapetu

Member
Nov 6, 2018
94
165
66
Charlie's article (at SemiAccurate ) has some interesting rumours about AMD Rome that might as well be true. I made a new image of Rome's organic packaging based on that:
uTxAX7w.png

Basically he says that all IO is in IOX-chip, there is only one direct IF 2.0 connection from each chiplet to the IOX (no direct links between chiplets) and each chiplet has an 8-core CCX. He also says that there are no interposers of any kind and it's all organic packaging. We probably have to wait for Zen3 or even Zen4 for any new packaging techniques.

Because there is very littlle room to do any routing on the organic package, it's safe to say that tthe new IFOP 2.0-links still utilize some type of SerDes (serializer/deserializer). That adds some latency but might also provide power savings (compared to paraller interface). Also, as many have said before, Infinity Fabric 2.0 might run at higher clock speed than memory clock. L4 cache is still pure speculation and we don't even know if AMD is using 14HP for the IOX or not.

Canard PC had the first rumour that Rome was going to be 64 cores, 256MB L3 cache and 128 x PCIe 4.0. Since they were right about the first and the third point, why wouldn't there be 256MB of L3?

Here are a couple of questions to think about:
  • If each chiplet has a local 32MB L3 cache, is L4 cache (on I/O die) smaller than 256MB (128MM or 64MB) even beneficial or needed? Does it improve memory lantecy by how much? Does it reduce cache misses by a significant amount?
  • If each chiplet would contain just two 4-core CCXs, why didn't AMD use four 146 mm² chiplets with four 4-core CCXs instead what they have done now (eight 73 mm² chiplets)? Do chiplets have to be 8-core CCXs for current design for Rome to make any sense? AMD didn't choose 8-core chiplets just because of yields or binning for higher clocks / better power efficiency. They must have had some architectural reasons behind it, right? 7 nm Vega is a relatively big chip so 146 mm² chiplet would have been ok.
Here's a clean picture of Rome that can be used for routing diagrams or any other similar presentations. Please use any images freely as I have stated before.
ZC70Trm.png

Addition:
Information market with brackets in the first picture is still not 100% (or some may not at all) confimed (8C CCX, 256MB L3, L4 eDRAM cache) but one can hope. In case anyone is wondering where I got specs for the SP3 socket, AMD has this nice thermal design guide (page 8 gives the exact dimensions).

Edit: Uploaded the correct versions of the images. Removed signature from the simpler image for lower threshold to be reused by other people.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Halo products matter, even if they don't sell that many of them (why else does Intel struggle so hard with them). Just imagine the headlines if the mid/lower-range AM4 product had 8 cores, and the halo one had 16.

If AMD goes for core-count advantage with Zen 2 against Skylake/CFL, I see that as a poor sign for competitive single-thread performance. Personally, I still expect 7nm Zen 2 to beat 14nm Skylake/CFL. Anything less would be poor planning against 10nm Ice Lake, which would have been here by now, had Intel not stumbled so badly. I may be disappointed, but if so, AMD must have fallen short of their targets. Remember, Lisa Su ended the recent "Next Horizon" event by pointing out that they are not aiming to play in a niche nor be a second source. They are in high-performance compute to lead. So being a floating-point monster in parts of the data centre is not enough.

IMHO that halo product would find a better home on the enthusiast platform though.

I tend to agree. Intel is not going beyond 8-core on their socket. If AMD really needs core-count advantage (which, again, is a bad sign for Zen 2 performance, and hopefully won't be necessary), then lower cost-of-entry to the HEDT platform instead. This is already taking place, with dramatic changes since the launch of Threadripper.

If they want to pressure Intel on core count, they can do that starting with an even more price competitive Threadripper. In fact... they can start with TR 3 first, then follow with Zen 3000 series.

Yeah. I wouldn't be surprised if we see Threadripper 3000 before Ryzen 3000. All the components are there for Threadripper, since it is all reuse of the EPYC platform. Ryzen 3000 either needs an IO chiplet or a monolithic design. I bet on the former.

Best they can do is to up the core count as a mean to make a worthy differentiation

I think it would be plenty worthy differentiation if 7nm Zen 2 beats 14nm Skylake/CFL as I expect it to — and at lower power to boot. If AMD fails me and does not, it may make more sense to fight the high end with Threadripper 3000. If it has UMA and L4 cache as presumed, it may even perform better on games and other latency-sensitive applications than Ryzen 3000.

Halo products matter, true. Threadripper is there for that reason

Exactly. As I've speculated (here), it makes sense for enthusiasts to move to a dedicated high-performance platform, while mainstream is served by a platform optimised for cost and power.

My guess is that remaining on a maximum 8 cores for now is the more reasonable option for AMD financially.

Agree. There is no real demand for 16 cores in the mainstream. And it sure would be lower cost to use a single 7nm CPU chiplet rather than two. The package space left over is better used to include a GPU to match Intel's feature set in the mainstream.

Besides demolishing 9900K in benches, there's a price point between top end mainstream and low end enthusiast that they could fill reasonably well with 12 to 16 core parts.

AMD should close that gap by bringing the cost of the HEDT platform further down. I presume this will happen as Threadripper 3000 obsoletes the previous generation, while refreshed motherboards reduce the prices of the old models.
 
Last edited:
  • Like
Reactions: Zapetu

Vattila

Senior member
Oct 22, 2004
799
1,351
136
AMD didn't choose 8-core chiplets just because of yields or binning for higher clocks / better power efficiency.

Agree. While yields and binning are a big part of it, they may also have chosen 8-core for reuse in the mainstream, as in my hypothetical 8-core Ryzen 3000 APU.
 
Last edited:
  • Like
Reactions: Zapetu

Zapetu

Member
Nov 6, 2018
94
165
66
There is no real demand for 16 cores in the mainstream.

Agree. While yields and binning are a big part of it, they may also have chosen 8-core for reuse in the mainstream, as in my hypothetical 8-core APU.

I also agree that there's probably not enough market for 12 and 16 core CPU's on AM4 platforms to warrant development of a separate chip for 12 and 16 core Ryzen 3k, but you never know. As has been brought up many times before, chip development isn't exactly cheap for any recent node.

I actually really like the idea of 8-core 7 nm chiplet and one 12 nm IO+GPU die even if it's virtually the same idea as Intel had with Clarkdale as has been also mentioned before. Still a large 32MB of L3 cache should alleviate some memory latency concerns while L4 cache is not needed with just one cpu chiplet. Whatever Rome's IO die uses 14HP process and eDRAM for L4 cache is a totally separate question. AMD also still has WSA (Wafer Supply Agreement) with Global Foundries and while they want to renegotiate it later, they still have to order some amout of wafers from GloFo. 14 nm IO dies for Rome and 12 nm IO+GPU would help with that.

Edit: Please note that the following is an oversimplification and design and wafer costs are a much more complex matter than just taking some numbers from a chart. High-end designs usually cost even more and mask costs are only as small part of the total cost(see pages 9 and 11 here).

AMD could choose to do the following (cost estimations might be more or less wrong):
  • GloFo 14nm HP (IBM) or 14nm LPP (Samsung)
    • 420 mm² IO die for Rome - >$100M, <$150M (?) ($200M±$100M)
    • Optionally 1/4 size IO die for Ryzen 3k (2x 7nm CPU chiplet design) - <$100M ($60M (-$10M...+$60M)) (reuse IO die IP)
  • GloFo 12 nm LP
    • Polaris refresh (confirmed) - <<$100M ($15M (-$5M...+$25M)) (?) (14 nm Polaris remasked)
    • Optionally IO+GPU die for mainstream (1x 7nm CPU chiplet design)) - <$100M ($50M (-$10M...+$50M)) (new InfinityFabric 2.0, improved DDR4, Vega 14LPP to 12LP (or Polaris based 12LP?))
  • TSMC 7nm HPC
    • Vega Instinct 7nm (confirmed) - <$300M ($400M±$150M)
    • 7nm 8 core chiplet (confirmed) - <$300M ($400M±$150M)
    • Navi GPU (later in 2019) - <$300M ($400M±$200M)
    • Mobile APU (later in 2019 or 2020, 7nm+ (?)) - <$300M ($400M±$250M)
 
Last edited:
  • Like
Reactions: Schmide and Vattila

dnavas

Senior member
Feb 25, 2017
355
190
116
If AMD goes for core-count advantage with Zen 2 against Skylake/CFL, I see that as a poor sign for competitive single-thread performance. Personally, I still expect 7nm Zen 2 to beat 14nm Skylake/CFL. Anything less would be poor planning against 10nm Ice Lake, which would have been here by now, had Intel not stumbled so badly.

I'm going to try and take a different angle to this.
What I see is a company that is planning a roll-out strategy of a number of different technologies over the course of a number of different advances in production capability. On the production side we've got a 2:1 shrink, a follow-on that reduces power-use but doesn't do much for size, and another shrink, but one that isn't at all as large as the move from 14 to 7nm. On the technology side we've got moar-cores, power budgets, signal routing/connectivity, frequency -- feel free to add your own.

A company that doesn't go for core count on the 2:1 shrink isn't targeting well. That means they've got to think about how to deliver the core count without blowing the budget (chiplets), how to connect all the cores, how to feed the cores (tech they haven't yet disclosed), etc. These are the engineering challenges that are front of mind. Now add marketing. Marketing comes in and says, ok, how to best sell more cores -- what market are we hitting? You ask about single-thread performance. Where are frequency concerns in servers? Well, we've already heard them say that they have customers that want high frequency, low-core count parts. That isn't the gaming market, sure, but it's a concern that at least isn't lost in this round. You heard the AMD server guy who said "hey, this part isn't a desktop part that's pretending to be a server chip" (or words to that effect). While it wasn't clear to me whether he was talking about Intel or AMD parts, the point is that THIS is the server chip. It isn't targeting the desktop. It remains to be seen whether the desktop part is a repurposed server chip, or a purpose-built desktop chip, but given Intel's missteps, if AMD has gotten decent IPC and frequency uplifts, they could get away with it. Big time.

I think it would be hard to be disappointed that AMD is targeting the server market with higher core counts while still recognizing the importance of more performant, lower core-count parts. They're leading with the strengths of the node, adding the floating point vectors that they need to compete, but not going overboard with a full 512 implementation. Power-reduction cycle would be next, so I half expect for laptops to be targeted for 7nm+. I don't know whether you create a fused 7nm design on this go-around, but it seems smart for the low-power market, so I'd certainly expect such a development on 7nm+, regardless of whether it happens this time. Maybe you add the half-width 512 ops in that round, and in the 5nm shrink you go wider on the floating point units. Or maybe you redesign your GPU to better integrate with your CPUs, and provide something that's more flexible. Worrying about this in a couple of years allows Lisa to steer the Radeon ship safely back into port.

So far, honestly, I'm pretty impressed. I don't know if they'll deliver the parts *I* want, but they're targeting the market that the node move will have the biggest impact on, and they're focusing their engineering effort there, rather than being distracted by the rest of the market.

Yeah. I wouldn't be surprised if we see Threadripper 3000 before Ryzen 3000. All the components are there for Threadripper, since it is all reuse of the EPYC platform. Ryzen 3000 either needs an IO chiplet or a monolithic design. I bet on the former.

Ditto. I also wouldn't be surprised if there was a little something extra on TR3. They might wait until TR4, but, I don't think you try to feed a 64 core TR4 through 4 memory channels. Max it out at 48, and sprinkle a little pixie dust in-between. GPU cores? FPGA chiplets? What sort of crazy hackathon projects have the engineers prepared this go-around? Could get interesting....
 
  • Like
Reactions: Zapetu and Vattila

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
If each chiplet has a local 32MB L3 cache, is L4 cache (on I/O die) smaller than 256MB (128MM or 64MB) even beneficial or needed? Does it improve memory lantecy by how much? Does it reduce cache misses by a significant amount?

32 * 8 = 256mb. So if you have a 256MB "inclusive" L4 cache on IO chiplet and each chiplet only has 1 connection to the IO die (but now between chiplets), then by having the L4 cache you save the time (and bandwidth) to get data from the cache of another chiplet. Basically all cores have access to all data of the other cores over 1 IF hop. I would say that makes a whole lot of sense to do.

Not sure how else it would even work. check all other chiplets first then memory? that would mean an average of 3.5 if hops. Would that be even worth it vs directly going to memory? The L4 cache would have huge benefits if all cores are working on the same problem. On the other hand when you virtualize based on chiplets the l4 would be useless.
 

naukkis

Senior member
Jun 5, 2002
706
578
136
32 * 8 = 256mb. So if you have a 256MB "inclusive" L4 cache on IO chiplet and each chiplet only has 1 connection to the IO die (but now between chiplets), then by having the L4 cache you save the time (and bandwidth) to get data from the cache of another chiplet. Basically all cores have access to all data of the other cores over 1 IF hop. I would say that makes a whole lot of sense to do.

Not sure how else it would even work. check all other chiplets first then memory? that would mean an average of 3.5 if hops. Would that be even worth it vs directly going to memory? The L4 cache would have huge benefits if all cores are working on the same problem. On the other hand when you virtualize based on chiplets the l4 would be useless.

Think again. Duplicating L3 to L4 won't have any benefits for bandwith reducing as L3 either has to be write-through or L3 still need to be checked for dirty lines.

They can replicate (inclusive)L3-tags to IO-hub so for memory access check to other chiplets won't be necessarily if tags misses. If they implement L4-cache to IO hub it's only useful operating mode is to prefetch data from memory to L4 buffer, in which case it will lower memory latencies. That's the biggest disadvantage Zen has compared to ringbus-Intels, L3 acts as victim buffer where ringbus implementation can prefetch from memory to L3 which greatly reduces memory latencies. For Skylake-X L3 acts like victim buffer also and look how gaming performance goes.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
That's the biggest disadvantage Zen has compared to ringbus-Intels, L3 acts as victim buffer where ringbus implementation can prefetch from memory to L3 which greatly reduces memory latencies. For Skylake-X L3 acts like victim buffer also and look how gaming performance goes.

Zen1 yes but maybe bot Zen2. But else it was just a suggestion.

For Skylake-X: Owners here on Anandtech have shown that if you overclock the uncore (which is very conservatively clocked), the gaming performance can be restored.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
After letting the news sink in I agree it's likely they will use the same MCM concept with 8CCX chiplets for the desktop. I also think it will have a cut down NB that includes a small 3-5CU GPU. AMD needs the mainstream desktop office market and including a smallish 14nm GPU is almost for free. They would be crazy not to include one.

I dont see them including a more beefed up separate GPU chiplet as there's no market for it on the desktop. You either need a bare bones 2D capable GPU or a serious dedicated GPU.

For mobile the situation is different and an integrated die with beefed up iGPU is most definately the best of both worlds. These chips would also be useful for the bottom lineup.

I can imagine the desktop line-up to look something like this:
R9 16c/32t 130W MCM 2*8CCX+1*NB w/3-5CU ~$600
R7 12c/24t 90W MCM 2*8CCX +1*NB w/3-5CU ~$450
R5 8c/16t 65W SOC 1*8CCX +8-12CU ~$300
R3 6c/12t 50W SOC 1*8CCX +8-12CU ~$200
and maybe even an
R1 6c/6t 45W SOC 1*8CCX+8-12CU ~$150
 
Last edited:
  • Like
Reactions: prtskg

Glo.

Diamond Member
Apr 25, 2015
5,708
4,552
136
I can imagine the desktop line-up to look something like this:
R9 16c/32t 130W 2*8CCX+1*NB ~$600
R7 12c/24t 90W 2*8CCX +1*NB ~$450
R5 8c/16t 65W 1*8CCX +1*NB ~$300
R3 6c/12t 50W 1*8CCX +1*NB ~$200
and maybe even an
R1 6c/6t 45W 1*8CCX+1*NB ~$150
SKU prices do not move. If there is Ryzen 3 3200 it will cost 109$. Ryzen 5 3600 will cost 199$. I can see Ryzen 7 3800X for 499$. I cannot see Ryzen 9 lineup, however.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
SKU prices do not move. If there is Ryzen 3 3200 it will cost 109$. Ryzen 5 3600 will cost 199$. I can see Ryzen 7 3800X for 499$. I cannot see Ryzen 9 lineup, however.

Maybe so, but doubling the cores gives you a very wide spread of good performing chips, would be silly not to make use of that.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
After letting the news sink in I agree it's likely they will use the same MCM concept with 8CCX chiplets for the desktop. I also think it will have a cut down NB that includes a small 3-5CU GPU. AMD needs the mainstream desktop office market and including a smallish 14nm GPU is almost for free. They would be crazy not to include one.

If they include an iGPU at all, I would be surprised to see one smaller than the one in Raven Ridge (11 CU). Below that, they'll quickly run into diminishing returns. Polaris 11 (16 CUs) is a 123mm^2 chip; Polaris 12 (8 CUs) is a 103mm^2 chip. That's a lot of lost functionality for not much saved space. No matter how few CUs, they still need the video output hardware and encoding/decoding fixed-function hardware. Of course there's also some stuff that can be omitted on an iGPU because it's shared with the CPU, most notably the memory controller.

I dont see them including a more beefed up separate GPU chiplet as there's no market for it on the desktop. You either need a bare bones 2D capable GPU or a serious dedicated GPU.

Unless Navi's touted "scalability" means that it's also going to be a chiplet design. In that case, they could use 1 chiplet with Ryzen for iGPU functionality, and for the discrete cards they could have a 14nm glue die with the memory controller and uncore plus either 2 or 4 chiplets. Maybe 16 CUs per chiplet.

For mobile the situation is different and an integrated die with beefed up iGPU is most definately the best of both worlds. These chips would also be useful for the bottom lineup.

There will no doubt eventually be a monolithic Zen 2 die similar to a beefed-up Raven Ridge for mobile and low-end/low-power desktops, but that may wait another year or so until wafer costs go down and yields go up at TSMC for larger 7nm parts.
 

Spartak

Senior member
Jul 4, 2015
353
266
136
Because the APUs currently are AMD's bottom of the barrel offerings that come last every generation. We are still waiting for the 12nm APU, and we are expecting Zen 2 based Ryzen chips in 1H 2019.

I believe Charlie said EPYC2 release in 19Q2 and Ryzen3 in 19Q3/4?

If top of the line Ryzen3 R7/9 (that is usually released first) uses the same MCM chiplet approach they could release the lower end desktop chips a bit later in 19Q4/20Q1 alongside their mobile offerings using the same SOC die.
 
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,637
3,673
136
Some people here mentioned that Intel Clarkdale (2010 Nehalem Arch) also has a separated Northbridge on package. Thanks for pointing that out, I had totally forgotten about the CPU.

The overall layout of Clarkdale is actually a bit similar to the one Vattila described above (with the GPU being on the NB). Though not excatly as the CPU still does some of the (legacy) I/O:

clarkdaledie.jpg


That's a much better comparison than the old Core 2 Duos, as has the memory controller is on package and uses DDR3.

Bjorn3D did some nice tests with a Clarkdale based i5 661:

@stock it was running on DDR3 1333MHz cas 9.
@4.3Ghz OC it was running memory at DDR3 1600MHz cas 8
(Everest is the predecessor of AIDA64)
C_E_Mem_Lat.jpg


The latency on i5 661 is indeed nearly 2x worse than with the on-die controller i5's. Curiously the i5 661 did even worse than the Core 2 Q6600 (which has the memory controller on the MoBo and uses FSB). Probably because the memory controller wasn't tuned to run through an FSB.

From this it's pretty clear that AMD does indeed need to tune the memory controller to not suffer a latency regression. Hopefully the I/O die has L4, which would mitigate it. Or even better, if the rumoured 32MB L3 is unified and inclusive (no more vicitim cache), prefetching memory there should be possible (exactly like Intel does).

Such 4x cache increase with improved prefetch alone, would already increase gaming performance significantly, even if memory latency itself doesn't really improve.
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,708
4,552
136
Maybe so, but doubling the cores gives you a very wide spread of good performing chips, would be silly not to make use of that.
If the manufacturing costs are low enough we can easily see 8C/8T with one Chiplet, Ryzen 3 3200, for 109$.
 

Gideon

Golden Member
Nov 27, 2007
1,637
3,673
136
Did some digging around and unfortunately it seems that AIDA64 rewrote their cache benchmark from scratch in 2013, meaning the scores aren't really comparable (read: worsened by >10 ns)

Fiery said:
We've also replaced the old cache and memory latency benchmark with a brand new one that uses a different approach, recommended by processor architecture engineers. The old memory latency benchmark used the classic forward-linear solution, so it "walked" the memory continuously, in forward direction. Unfortunately that classic approach was sometimes over-optimized by "too smart" memory controllers, that led to unrealistically low latency scores. It was a constant fight for us to get around those over-optimizations, to make sure AIDA64 provides stable and reliable latency results. With the new latency benchmark we've switched to a block-random solution, that keeps "jumping" to random addresses inside a memory block for a period of time, and then skips to a new block and continues "jumping" to random places inside there as well. With this new solution memory controllers cannot find a pattern anymore in the latency measurement, and so they cannot over-optimize the benchmark. The block-random approach however means that latency results will be higher, and since the scores are in nanosec, it means the results will be worse than what you got used to. For example:

Core i7-3960X with X79 chipset and 4-channel DDR3-1600:

- AIDA64 v2.85 Memory Latency: 55.9 ns [ old ]

- AIDA64 v3.00 Memory Latency: 67.5 ns [ new ]

Unfortunately there don't seem to be any good reviews using the newer code on old hardware, to compare. The last time anyone bothered to fire up Nehalem's was in the Sandy Bride reviews (e.g. Tech Report in 2011) :(
 
  • Like
Reactions: Vattila

dnavas

Senior member
Feb 25, 2017
355
190
116
On the other hand when you virtualize based on chiplets the l4 would be useless.

Take it one step further, though. If you've got a write-heavy, shared block of memory, you either have race conditions, or most of your time is taken up by atomics anyway. One fundamental problem is that the link between locks and the memory they're trying to guard are not made explicit, otherwise you could transfer delta-updates along with the lock ownership. Of course, the other is that Windows is designed to move work around cores (presumably so that none of them overheats?) -- it's not a fabulous strategy for retaining locality. We're stuck with what we're stuck with though.

If you intend to address the write-heavy case well, then you need write-through caches. I don't see how that's even plausible with 64 cores / 8 chiplets (I am assuming the multi-socket case is NUMA, otherwise it's just worse). As your quote above implies, the useful case for L4 is when it holds shared, read-heavy data -- stuff that's likely to be read from many times, and written to quite a bit less frequently. You don't try to save the hop to read from another L3 when the data is overwritten, you just mark the line dirty in L4 and point to the L3 with the most up-to-date data. One way I might think of arranging an L4 is making an L4 "TLB" that holds row locations, and a cache that only fills when a read happens from another L3. I might then iterate on the branch predictor and lean on it to predict memory access characteristics and whether a read from RAM is likely going to be read from again or not and pre-cache into L4 if I've got an access that looks likely to get reused. Something as simple as a bloom filter on the PC might wind up pretty accurate. You could arrange for memory prefetching in a similar manner.

There are ways to make an L4 useful, and while the TLB (is there a better name for an L3 row assignment translation buffer?) would need to be full-size, not all rows need to be in L4. In fact, you could throw out the L4 entirely as I suspect even fairly small prefetch buffers driven by a simple learning system could help....

Warning: this has been random thoughts from not-a-CPU-designer....
 
  • Like
Reactions: Schmide and Vattila

Jan Olšan

Senior member
Jan 12, 2017
278
297
136
Some people here mentioned that Intel Clarkdale (2010 Nehalem Arch) also has a separated Northbridge on package. Thanks for pointing that out, I had totally forgotten about the CPU.

The overall layout of Clarkdale is actually a bit similar to the one Vattila described above (with the GPU being on the NB). Though not excatly as the CPU still does some of the (legacy) I/O:

clarkdaledie.jpg


That's a much better comparison than the old Core 2 Duos, as has the memory controller is on package and uses DDR3.

The thing to note is that the previous generation for Intel (Core 2 Duo) still had FSB and a memory controller in its chipset. You have to remember that context. So this memory controller in a separate die thing might have been an improvement back then, but if Intel was coming from a platform that had a memory controller on-die at that time, it would likely have been a regression performance-wise. And very likely also for power consumption in a mobile platform. The memory interface is very busy and anything that crosses substrate usually eats power. Not a good combo.