AMD “Next Horizon Event" Thread

Page 13 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
I linked a source while the other just talked speculation :p

And I typoed. Oops.

But yeah linking a source is a good idea. Regardless it looks like 12FDX is delayed until at least 2020. Maybe later because, you know, GlobalFoundries.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,571
136
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not the signal transmission.

You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

(Appreciable) Delays will be the result of the clock rate of the infinity fabric and the internal speed of the memory controller.


You have a transmission delay of around 1ns for every 15 cm travelled. So the extra latency added (due to transmission route lengths) from deviating from core, through an on-socket memory controller rather than directly from core-located memory controller will be measured in pico-seconds... if not femto-seconds.

Yeah, I can't understand why people keep bringing up the "distance" argument. I'm tired of refuting it constantl:

As I mentioned The Core 2 Duo had 800 MHz DDR2 with very similar transfer times to 3200 MHz DDR4, and can achieve ~65ns memory latency with a low 1066 MHz FSB.


Some really dumb MoBo schematics to get the point across:

On the Intel's motherboard (D975XBX) used in the review: The Memory Controller is on the 82975X Northbride chip just to the left of the socket.
So the red line represents the shortest path a hypothetical singal must travel - From the DIMM slots to the NB and from there onwards (via the FSB running at a lowly 1 GHz) to the processor.
ji7QFem.png



Now look at the distance covered on a B450 motherboard:
89hNVDf.png


The CPU is certainly closer than the Northbridge of old. And then from the IO chiplet the CPU Chiplet, you could measure it in millimeters (not centimeters as is the case with the northbridge).

Not only Zen 1 already has the Infinity Fabric clock-speed at 1.6 GHz in case of 3200MHz RAM. It will probably be higher with Zen 2 (as they mentioned they have substantially refined it).

Let me repeat that: Core 2 Has shown AIDA latencies between 60-70ns with memory with similar (actually a bit worse) transfer times to DDR4 3200MHz CL16 . ZEN 1 in comparison, can't really do that, not without much tighter timings.

Yet ... Hurr Durr ... AMD can't possibly improve the Latency with Zen2, if it's a chiplet designe. It's impossible due to "distance"!

TL;DR:
As Atari already mentioned. The whole distance argument is flawed from the start. Signal Processing is the bottleneck, not Transmission distances.
 
Last edited:

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Yeah, I can't understand why people keep bringing up the "distance" argument. I'm tired of refuting it constantl:

Agreed, It's not distance in this context.

It's on die, vs off die. Then it's on package, vs off package where you typically incur latency/power penalties.

All else being equal, you have lower latency and less power usage on die. Next best is on package, and worse is off package.

Optimal laptop solutions will be monolithic, and really so would desktops though the differences are less important on the desktop.
 
  • Like
Reactions: Gideon

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
I am skeptical that lower main memory access latencies past a certain point translate into significant performance gains in desktop workloads including gaming.

That's probably true. Hopefully though we'll see some nice gains in some content creation where some work requires other work to complete first. In audio for example we have a large number of signals that all need to be processed at the same time, but they then also feed other processing in series. I think to an extent it may be similar for some image processing.

So I'm hopefully optimistic that Ryzen could improve for 'us' all things considered, and that Threadripper will at the very least improve the situation for the versions with compute dies by giving all chiplets the same delay.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
What IS on a mature process is the IO chiplet.

This is the biggest of all the Epyc chiplets and the one that's supposed to have the lowest defect density. Any word on it's dimensions yet so we can use the Die Per Wafer Calculator to make some estimates?

Because of it's size, the yields will not be as good as with smaller sizes and, add to that, the chances of defects killing the chip entirely should be higher, as opposed to the CCX chiplets where a defect may kill part of it with the rest being fully active. However, the defect density should be far smaller and that should mitigate the yields issue.

Had AMD made the IO chiplet also in 7 nm could end up in disaster because of a much worse defect density aligned with the higher possibility whatever defects could kill the entire IO chiplet. Smart decision IMO to go with a very mature process for such a critical component.

I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.
 

exquisitechar

Senior member
Apr 18, 2017
655
862
136
IIRC Charlie Demerjian has said on Twitter that the turbo on a few cores will be significantly higher than before for Ryzen 3xxx CPUs. I wonder about final 64c Epyc 2 clocks, the one that they benchmarked probably wasn't clocked all that high.
https://twitter.com/CDemerjian/status/1035967242306039809
Got curious and I looked it up. I somewhat remembered wrong, he was talking about the base/turbo. Hope his prediction pans out. :p
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,571
136
It's on die, vs off die. Then it's on package, vs off package where you typically incur latency/power penalties.

Precisely!

At the very least, AMD should be able to do no worse than what Intel managed with an off-package controller (e.g. at least a ~5-10 ns improvement in AIDA64 over gen-1). Higher CPU- , IF- and especially memory would probably help somewhat.

The first two of those should be a given (CPU and IF). The last one is needed anyway once DDR5 ships (starts from 4400 Mhz and goes all the way to 6400 Mhz). I really hope that with this generation they finally canned their subpar memory controller for good and can finally manage decent memory clocks.
 

H T C

Senior member
Nov 7, 2018
549
395
136
I am thinking that Threadripper will have two Ryzen IO dies.
Care to elaborate?

I thought the whole idea was to have one IO chiplet across their whole lineup, from AM4 to Epyc and what may vary is the size of the IO chiplet because AMD may possibly use a smaller version of this IO chiplet for their AM4 platform.

However, i'm not so sure anymore because of this:

I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.

Makes a lot of sense, assuming it can work like this.
 

jpiniero

Lifer
Oct 1, 2010
14,509
5,159
136
Care to elaborate?

I thought the whole idea was to have one IO chiplet across their whole lineup, from AM4 to Epyc and what may vary is the size of the IO chiplet because AMD may possibly use a smaller version of this IO chiplet for their AM4 platform.

At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.
 

maddie

Diamond Member
Jul 18, 2010
4,722
4,627
136
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.
That's a GPU size, so easy to fab and can be harvested for TR as needed (detailed in above post). TR sales are probably the least of Zen die.

What you're talking about (2 IO die), entails a new package for the lowest selling variant, whereas the present Rome package can be reused for TR if the same IO is harvested.
 
  • Like
Reactions: Gideon and H T C

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.

When dealing with SRAM blocks in that big of a setting you could probably include some redundancy. If you are doing lets say 256 MB of memory you really do something like 300MB the extra 44MB wouldn't be that much more expensive (and actual backup amount would be much less anyways). Giving yourself some space for bad blocks. No matter the situation you would always fuse it down to 256. That would increase theoretical fully functional yields because you never are expecting or using a chip with more than 256MB. I think this is why Starship was specced at 48C instead of the 64C we are going to see. I think AMD was planning on saving face if 7nm didn't turn out well to start with by only using 6 of 8 cores per die to pump up yields and only when actual yields right away looked great moving to 64C with Zen 2 instead of waiting for Zen 3 and more refinement on 7nm.
 

H T C

Senior member
Nov 7, 2018
549
395
136
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

Actually ...

That's a GPU size, so easy to fab and can be harvested for TR as needed (detailed in above post). TR sales are probably the least of Zen die.

What you're talking about (2 IO die), entails a new package for the lowest selling variant, whereas the present Rome package can be reused for TR if the same IO is harvested.

The difference between the TR and Epyc sockets is minimal, other then the RAM connectivity: TR is quad channel while Epyc is octo channel. This means the exact same IO chiplet design could be used for both but, while TR's can in theory have parts fused off (if not defective), Epyc's IO chiplet needs to be fully functional.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
The difference between the TR and Epyc sockets is minimal, other then the RAM connectivity: TR is quad channel while Epyc is octo channel. This means the exact same IO chiplet design could be used for both but, while TR's can in theory have parts fused off (if not defective), Epyc's IO chiplet needs to be fully functional.

The alternative is that they make a different IO die for Threadripper and Ryzen that only supports up to quad channel RAM. Even though it's a large die, it's unlikely that on a mature process they'll wind up with so many that are naturally defective, but can be binned down. The only reason not to is that they need to hit a certain number of wafers at GF as per the WSA terms and making loads of this monster IO die is just an easy way of doing it. Still seems like a bit of a waste though.

Did they ever mention if the IO die included cache? I can't imagine it being that large with just the memory interfaces and the infinity fabric interconnects. If that were the case, there could be some advantage to using just the one design as having a really fat cache would be great for a lot of applications.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not theignal transmission.
You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

(Appreciable) Delays will be the result of the clock rate of the infinity fabric and the internal speed of the memory controller.


You have a transmission delay of around 1ns for every 15 cm travelled. So the extra latency added (due to transmission route lengths) from deviating from core, through an on-socket memory controller rather than directly from core-located memory controller will be measured in pico-seconds... if not femto-seconds.

Explain this to me. And I am no engineer :)
In my world signal length influence signal processing and you can't separate it. Protocols. Pll what not.
Another factor when you go from 100MHz to says 3000MHz off die you put enormous more stress on the signal integrity. Its just not 0 and 1 being send here. That's just the interpretation. Its an analogue signal and signal to noise becomes a huge concern. Error correction what not. Different filters. Takes time.

That said I am positively sure the new way to do it with chiplets is outright brilliant.
It will prove the old integration is more about being smart and make an elegant solution from a technical perspective than what is overall a better approach.
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
In my world signal length influence signal processing and you can't separate it. Protocols. Pll what not.

That will be in the spec issued from CPU manufacturer to motherboard manufacturer - the signal degradation is a design factor.

Impedance and its cousin, resistance, are carefully controlled so that signals are coherent when they reach their destination. Its much more complicated than just etching a few traces on some PCB!
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not the signal transmission.

You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

That is, ultimately due to the distance. When it goes to on-package, or on-die, they don't just put it there, they take advantage of it by changing the way its processed to optimize for it. Being closer allows you to do things you can't do otherwise.

Rare times, they do leave it alone. Like despite integrating the memory controller with the earlier Atoms, they used the FSB. But on Silvermont they used the much faster internal interconnect to speed it up.

At the very least, AMD should be able to do no worse than what Intel managed with an off-package controller (e.g. at least a ~5-10 ns improvement in AIDA64 over gen-1).

Core 2 had very low latency in some testing applications, but in others AMD had better latency. Core 2's prefetchers and architectural changes allowed hiding latency in some testing, but not all. Anandtech had a test about that and Nehalem was 60% faster than Core 2. https://www.anandtech.com/show/2045/5

I agree. I don't think that a hypothetical 20% reduction in latency would yield anywhere near a 20% improvement in performance. The gains are clearly more marginal as latencies improve.

I don't think so either, but clearly all things equal integration is the faster route as we've seen many times. They can do it better and offset it for sure, but they will be able to do even better if integrated. Core 2 was good, but Nehalem did it a lot better. AMD gave up the low power cores and only has one architecture to worry about. I can't see why its such a big deal to make a different die that's monolithic on the same uarch.
 
Last edited:

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

I don't see why it would be a problem for Threadripper. The marginal cost of these is virtually zero (because of the WSA that requires AMD to pay GloFo for wafers anyway) and TR is a low-volume product. Besides, they can use salvage dice if needed.

For AM4 Ryzen products, there are two possibilities. One would be to create a separate, monolithic die with 8 CPU cores, integrated I/O, and an iGPU (probably 16 CUs or so). This could then be sold in various configurations. The other possibility would be to create a smaller 14nm I/O die that is basically 1/4 of the Epyc I/O die, and incorporate a single 7nm CPU chiplet, plus a GPU chiplet. This would create a Crystalwell-like configuration with the I/O's L4 cache, which could dramatically increase GPU performance. The question is whether this would be too complicated or expensive for low-end systems. Or maybe they'll just leave low and midrange laptops on Raven Ridge until 7nm becomes cheaper down the road.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
I think it would be a lot smaller than 1/4 of Epyc's. Maybe just 40mm2 worth of memory interface and inf fabric. So you end up with like a 100mm2 die and get to pack in a lot of little extras (all of which you can disable for certain other configuration) like a small iGPU.
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
The other possibility would be to create a smaller 14nm I/O die that is basically 1/4 of the Epyc I/O die, and incorporate a single 7nm CPU chiplet, plus a GPU chiplet.

Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

But agree with the rest. Rome IO die can easily be used for TR. Doesn't make sense to make another die for TR and the huge cache can be a selling argument over desktop Ryzen. In fact TR3 might actually be also attractive for gamers as all the NUMA issues should be gone and we saw with broadwell 5775k or how it was called that the L4 helped a lot in some games making it beat the 4790k.

I mean do we really expect a huge revolution in single threaded cpu performance? Especially with the process cost rising more and more? A 32-core thread-ripper rome based could easily last 1 decade. Strange times. Same for gpus.The real winners waw everyone one buying 1000-series especially 1080 and 1080 ti early on. Future proofing can actually be worth it nowadays. (Only problem with TR3 is that it won't have ddr5 so maybe this makes more sense with tr4).
 

NTMBK

Lifer
Nov 14, 2011
10,208
4,940
136
Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

But agree with the rest. Rome IO die can easily be used for TR. Doesn't make sense to make another die for TR and the huge cache can be a selling argument over desktop Ryzen. In fact TR3 might actually be also attractive for gamers as all the NUMA issues should be gone and we saw with broadwell 5775k or how it was called that the L4 helped a lot in some games making it beat the 4790k.

I mean do we really expect a huge revolution in single threaded cpu performance? Especially with the process cost rising more and more? A 32-core thread-ripper rome based could easily last 1 decade. Strange times. Same for gpus.The real winners waw everyone one buying 1000-series especially 1080 and 1080 ti early on. Future proofing can actually be worth it nowadays. (Only problem with TR3 is that it won't have ddr5 so maybe this makes more sense with tr4).

A small GPU to just drive the display for office work would open up more markets to them- massive piles of OEM office PCs, that all ship with Intel integrated graphics. Even a 3CU thing would do the job.
 

naukkis

Senior member
Jun 5, 2002
701
569
136
Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

IO-chip has to be quite big to fit all needed IO-connections. They either fill that chip space with L4-cache, or with iGPU like Intel did with their northbridge chipsets. So they can have Raven-ridge like iGPU for free so it's pretty obvious that they would implement it.