AMD “Next Horizon Event" Thread

Dayman1225 · Nov 9, 2018

DrMrLordX said:
@NostaSeronx has already weighed in on the subject, but I'll add this as well:

https://forums.anandtech.com/thread...ies-stops-all-7nm-development.2553184/page-13

Read to the bottom of that page. Nosta thinks 2022-2023, while @Dayman1225 says 2020. Makes of it what you will.

I linked a source while the other just talked speculation

DrMrLordX · Nov 9, 2018

Dayman1225 said:
I linked a source while the other just talked speculation

And I typoed. Oops.

But yeah linking a source is a good idea. Regardless it looks like 12FDX is delayed until at least 2020. Maybe later because, you know, GlobalFoundries.

Gideon · Nov 9, 2018

Atari2600 said:
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not the signal transmission.

You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

(Appreciable) Delays will be the result of the clock rate of the infinity fabric and the internal speed of the memory controller.

You have a transmission delay of around 1ns for every 15 cm travelled. So the extra latency added (due to transmission route lengths) from deviating from core, through an on-socket memory controller rather than directly from core-located memory controller will be measured in pico-seconds... if not femto-seconds.

Yeah, I can't understand why people keep bringing up the "distance" argument. I'm tired of refuting it constantl:

As I mentioned The Core 2 Duo had 800 MHz DDR2 with very similar transfer times to 3200 MHz DDR4, and can achieve ~65ns memory latency with a low 1066 MHz FSB.

Some really dumb MoBo schematics to get the point across:

On the Intel's motherboard (D975XBX) used in the review: The Memory Controller is on the 82975X Northbride chip just to the left of the socket.
So the red line represents the shortest path a hypothetical singal must travel - From the DIMM slots to the NB and from there onwards (via the FSB running at a lowly 1 GHz) to the processor.

Now look at the distance covered on a B450 motherboard:

The CPU is certainly closer than the Northbridge of old. And then from the IO chiplet the CPU Chiplet, you could measure it in millimeters (not centimeters as is the case with the northbridge).

Not only Zen 1 already has the Infinity Fabric clock-speed at 1.6 GHz in case of 3200MHz RAM. It will probably be higher with Zen 2 (as they mentioned they have substantially refined it).

Let me repeat that: Core 2 Has shown AIDA latencies between 60-70ns with memory with similar (actually a bit worse) transfer times to DDR4 3200MHz CL16 . ZEN 1 in comparison, can't really do that, not without much tighter timings.

Yet ... Hurr Durr ... AMD can't possibly improve the Latency with Zen2, if it's a chiplet designe. It's impossible due to "distance"!

TL;DR:
As Atari already mentioned. The whole distance argument is flawed from the start. Signal Processing is the bottleneck, not Transmission distances.

PeterScott · Nov 9, 2018

Gideon said:
Yeah, I can't understand why people keep bringing up the "distance" argument. I'm tired of refuting it constantl:

Agreed, It's not distance in this context.

It's on die, vs off die. Then it's on package, vs off package where you typically incur latency/power penalties.

All else being equal, you have lower latency and less power usage on die. Next best is on package, and worse is off package.

Optimal laptop solutions will be monolithic, and really so would desktops though the differences are less important on the desktop.

mattiasnyc · Nov 9, 2018

Arzachel said:
I am skeptical that lower main memory access latencies past a certain point translate into significant performance gains in desktop workloads including gaming.

That's probably true. Hopefully though we'll see some nice gains in some content creation where some work requires other work to complete first. In audio for example we have a large number of signals that all need to be processed at the same time, but they then also feed other processing in series. I think to an extent it may be similar for some image processing.

So I'm hopefully optimistic that Ryzen could improve for 'us' all things considered, and that Threadripper will at the very least improve the situation for the versions with compute dies by giving all chiplets the same delay.

LightningZ71 · Nov 9, 2018

H T C said:
What IS on a mature process is the IO chiplet.

This is the biggest of all the Epyc chiplets and the one that's supposed to have the lowest defect density. Any word on it's dimensions yet so we can use the Die Per Wafer Calculator to make some estimates?

Because of it's size, the yields will not be as good as with smaller sizes and, add to that, the chances of defects killing the chip entirely should be higher, as opposed to the CCX chiplets where a defect may kill part of it with the rest being fully active. However, the defect density should be far smaller and that should mitigate the yields issue.

Had AMD made the IO chiplet also in 7 nm could end up in disaster because of a much worse defect density aligned with the higher possibility whatever defects could kill the entire IO chiplet. Smart decision IMO to go with a very mature process for such a critical component.

I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.

jpiniero · Nov 9, 2018

I am thinking that Threadripper will have two Ryzen IO dies.

exquisitechar · Nov 9, 2018

exquisitechar said:
IIRC Charlie Demerjian has said on Twitter that the turbo on a few cores will be significantly higher than before for Ryzen 3xxx CPUs. I wonder about final 64c Epyc 2 clocks, the one that they benchmarked probably wasn't clocked all that high.

https://twitter.com/CDemerjian/status/1035967242306039809
Got curious and I looked it up. I somewhat remembered wrong, he was talking about the base/turbo. Hope his prediction pans out.

Gideon · Nov 9, 2018

PeterScott said:
It's on die, vs off die. Then it's on package, vs off package where you typically incur latency/power penalties.

Precisely!

At the very least, AMD should be able to do no worse than what Intel managed with an off-package controller (e.g. at least a ~5-10 ns improvement in AIDA64 over gen-1). Higher CPU- , IF- and especially memory would probably help somewhat.

The first two of those should be a given (CPU and IF). The last one is needed anyway once DDR5 ships (starts from 4400 Mhz and goes all the way to 6400 Mhz). I really hope that with this generation they finally canned their subpar memory controller for good and can finally manage decent memory clocks.

H T C · Nov 9, 2018

jpiniero said:
I am thinking that Threadripper will have two Ryzen IO dies.

Care to elaborate?

I thought the whole idea was to have one IO chiplet across their whole lineup, from AM4 to Epyc and what may vary is the size of the IO chiplet because AMD may possibly use a smaller version of this IO chiplet for their AM4 platform.

However, i'm not so sure anymore because of this:

LightningZ71 said:
I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.

Makes a lot of sense, assuming it can work like this.

jpiniero · Nov 9, 2018

H T C said:
Care to elaborate?

I thought the whole idea was to have one IO chiplet across their whole lineup, from AM4 to Epyc and what may vary is the size of the IO chiplet because AMD may possibly use a smaller version of this IO chiplet for their AM4 platform.

At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

Shivansps · Nov 9, 2018

If distance does not matter the 2990WX says hi.

maddie · Nov 9, 2018

jpiniero said:
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

That's a GPU size, so easy to fab and can be harvested for TR as needed (detailed in above post). TR sales are probably the least of Zen die.

What you're talking about (2 IO die), entails a new package for the lowest selling variant, whereas the present Rome package can be reused for TR if the same IO is harvested.

Topweasel · Nov 9, 2018

LightningZ71 said:
I suspect that the I/O chiplet is actually mostly L4 cache. If that's true, then the probability is high that defects of that chip will happen in individual SRAM blocks. Those blocks can be fused/mapped out and the rest of the chip can still be usable in lower spec EPYC processors, or even Threadripper processors if those wind up being just low spec EPYC processors. The other large area portion of the chip will be DRAM channels. It is highly likely that I/O chiplets with defective DRAM channels (assuming that those channels are EPYC use only) can be reused in Threadripper parts. The third largest amount of real estate would be the IF links to the CPU chiplets. Again, those can be mixed and matched as needed (within certain constraints to be sure) to ALSO be reusable as threadripper parts.

With the design that they have, there is a lot of chance for recovering defective parts. I suspect that even their lowest priced threadripper sold from this platform will still be significantly above break even in pricing.

When dealing with SRAM blocks in that big of a setting you could probably include some redundancy. If you are doing lets say 256 MB of memory you really do something like 300MB the extra 44MB wouldn't be that much more expensive (and actual backup amount would be much less anyways). Giving yourself some space for bad blocks. No matter the situation you would always fuse it down to 256. That would increase theoretical fully functional yields because you never are expecting or using a chip with more than 256MB. I think this is why Starship was specced at 48C instead of the 64C we are going to see. I think AMD was planning on saving face if 7nm didn't turn out well to start with by only using 6 of 8 cores per die to pump up yields and only when actual yields right away looked great moving to 64C with Zen 2 instead of waiting for Zen 3 and more refinement on 7nm.

H T C · Nov 9, 2018

jpiniero said:
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

Actually ...

maddie said:
That's a GPU size, so easy to fab and can be harvested for TR as needed (detailed in above post). TR sales are probably the least of Zen die.

What you're talking about (2 IO die), entails a new package for the lowest selling variant, whereas the present Rome package can be reused for TR if the same IO is harvested.

The difference between the TR and Epyc sockets is minimal, other then the RAM connectivity: TR is quad channel while Epyc is octo channel. This means the exact same IO chiplet design could be used for both but, while TR's can in theory have parts fused off (if not defective), Epyc's IO chiplet needs to be fully functional.

Mopetar · Nov 9, 2018

H T C said:
The difference between the TR and Epyc sockets is minimal, other then the RAM connectivity: TR is quad channel while Epyc is octo channel. This means the exact same IO chiplet design could be used for both but, while TR's can in theory have parts fused off (if not defective), Epyc's IO chiplet needs to be fully functional.

The alternative is that they make a different IO die for Threadripper and Ryzen that only supports up to quad channel RAM. Even though it's a large die, it's unlikely that on a mature process they'll wind up with so many that are naturally defective, but can be binned down. The only reason not to is that they need to hit a certain number of wafers at GF as per the WSA terms and making loads of this monster IO die is just an easy way of doing it. Still seems like a bit of a waste though.

Did they ever mention if the IO die included cache? I can't imagine it being that large with just the memory interfaces and the infinity fabric interconnects. If that were the case, there could be some advantage to using just the one design as having a really fat cache would be great for a lot of applications.

krumme · Nov 9, 2018

Atari2600 said:
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not theignal transmission.
You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

(Appreciable) Delays will be the result of the clock rate of the infinity fabric and the internal speed of the memory controller.

You have a transmission delay of around 1ns for every 15 cm travelled. So the extra latency added (due to transmission route lengths) from deviating from core, through an on-socket memory controller rather than directly from core-located memory controller will be measured in pico-seconds... if not femto-seconds.

Explain this to me. And I am no engineer

In my world signal length influence signal processing and you can't separate it. Protocols. Pll what not.
Another factor when you go from 100MHz to says 3000MHz off die you put enormous more stress on the signal integrity. Its just not 0 and 1 being send here. That's just the interpretation. Its an analogue signal and signal to noise becomes a huge concern. Error correction what not. Different filters. Takes time.

That said I am positively sure the new way to do it with chiplets is outright brilliant.
It will prove the old integration is more about being smart and make an elegant solution from a technical perspective than what is overall a better approach.

Despoiler · Nov 9, 2018

Rome is exactly what we said it was in July, a monster with nine die, eight 8C CCXs on 7nm, and one IOX built on 14nm.

https://semiaccurate.com/2018/11/09/amds-rome-is-indeed-a-monster/

Atari2600 · Nov 9, 2018

krumme said:
In my world signal length influence signal processing and you can't separate it. Protocols. Pll what not.

That will be in the spec issued from CPU manufacturer to motherboard manufacturer - the signal degradation is a design factor.

Impedance and its cousin, resistance, are carefully controlled so that signals are coherent when they reach their destination. Its much more complicated than just etching a few traces on some PCB!

IntelUser2000 · Nov 9, 2018

Atari2600 said:
Please.

Physical transmission distances make up a very, very small fraction of overall latency. Its virtually all in signal processing, not the signal transmission.

You can beat on the I/O Controller for possibly increasing latency, but don't use propagation delay as the reason.

That is, ultimately due to the distance. When it goes to on-package, or on-die, they don't just put it there, they take advantage of it by changing the way its processed to optimize for it. Being closer allows you to do things you can't do otherwise.

Rare times, they do leave it alone. Like despite integrating the memory controller with the earlier Atoms, they used the FSB. But on Silvermont they used the much faster internal interconnect to speed it up.

At the very least, AMD should be able to do no worse than what Intel managed with an off-package controller (e.g. at least a ~5-10 ns improvement in AIDA64 over gen-1).

Core 2 had very low latency in some testing applications, but in others AMD had better latency. Core 2's prefetchers and architectural changes allowed hiding latency in some testing, but not all. Anandtech had a test about that and Nehalem was 60% faster than Core 2. https://www.anandtech.com/show/2045/5

I agree. I don't think that a hypothetical 20% reduction in latency would yield anywhere near a 20% improvement in performance. The gains are clearly more marginal as latencies improve.

I don't think so either, but clearly all things equal integration is the faster route as we've seen many times. They can do it better and offset it for sure, but they will be able to do even better if integrated. Core 2 was good, but Nehalem did it a lot better. AMD gave up the low power cores and only has one architecture to worry about. I can't see why its such a big deal to make a different die that's monolithic on the same uarch.

JDG1980 · Nov 9, 2018

jpiniero said:
At 450mm2, the IO die is too much for anything other than Epyc, even on 14 nm.

I don't see why it would be a problem for Threadripper. The marginal cost of these is virtually zero (because of the WSA that requires AMD to pay GloFo for wafers anyway) and TR is a low-volume product. Besides, they can use salvage dice if needed.

For AM4 Ryzen products, there are two possibilities. One would be to create a separate, monolithic die with 8 CPU cores, integrated I/O, and an iGPU (probably 16 CUs or so). This could then be sold in various configurations. The other possibility would be to create a smaller 14nm I/O die that is basically 1/4 of the Epyc I/O die, and incorporate a single 7nm CPU chiplet, plus a GPU chiplet. This would create a Crystalwell-like configuration with the I/O's L4 cache, which could dramatically increase GPU performance. The question is whether this would be too complicated or expensive for low-end systems. Or maybe they'll just leave low and midrange laptops on Raven Ridge until 7nm becomes cheaper down the road.

amd6502 · Nov 10, 2018

I think it would be a lot smaller than 1/4 of Epyc's. Maybe just 40mm2 worth of memory interface and inf fabric. So you end up with like a 100mm2 die and get to pack in a lot of little extras (all of which you can disable for certain other configuration) like a small iGPU.

beginner99 · Nov 10, 2018

JDG1980 said:
The other possibility would be to create a smaller 14nm I/O die that is basically 1/4 of the Epyc I/O die, and incorporate a single 7nm CPU chiplet, plus a GPU chiplet.

Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

But agree with the rest. Rome IO die can easily be used for TR. Doesn't make sense to make another die for TR and the huge cache can be a selling argument over desktop Ryzen. In fact TR3 might actually be also attractive for gamers as all the NUMA issues should be gone and we saw with broadwell 5775k or how it was called that the L4 helped a lot in some games making it beat the 4790k.

I mean do we really expect a huge revolution in single threaded cpu performance? Especially with the process cost rising more and more? A 32-core thread-ripper rome based could easily last 1 decade. Strange times. Same for gpus.The real winners waw everyone one buying 1000-series especially 1080 and 1080 ti early on. Future proofing can actually be worth it nowadays. (Only problem with TR3 is that it won't have ddr5 so maybe this makes more sense with tr4).

NTMBK · Nov 10, 2018

beginner99 said:
Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

But agree with the rest. Rome IO die can easily be used for TR. Doesn't make sense to make another die for TR and the huge cache can be a selling argument over desktop Ryzen. In fact TR3 might actually be also attractive for gamers as all the NUMA issues should be gone and we saw with broadwell 5775k or how it was called that the L4 helped a lot in some games making it beat the 4790k.

I mean do we really expect a huge revolution in single threaded cpu performance? Especially with the process cost rising more and more? A 32-core thread-ripper rome based could easily last 1 decade. Strange times. Same for gpus.The real winners waw everyone one buying 1000-series especially 1080 and 1080 ti early on. Future proofing can actually be worth it nowadays. (Only problem with TR3 is that it won't have ddr5 so maybe this makes more sense with tr4).

A small GPU to just drive the display for office work would open up more markets to them- massive piles of OEM office PCs, that all ship with Intel integrated graphics. Even a 3CU thing would do the job.

naukkis · Nov 10, 2018

beginner99 said:
Forget the GPU. That won't happen. They are selling ryzen well enough without a gpu so why invest that additional money?

IO-chip has to be quite big to fit all needed IO-connections. They either fill that chip space with L4-cache, or with iGPU like Intel did with their northbridge chipsets. So they can have Raven-ridge like iGPU for free so it's pretty obvious that they would implement it.

AMD “Next Horizon Event" Thread

Golden Member

Lifer

Platinum Member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Platinum Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Golden Member

Elite Member

Golden Member

Senior member

Diamond Member

Lifer

Golden Member