【der8auer】Threadripper 2990X Preview - aka EPYC 7601 overclocking

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
i explain it simply since you seem hell bent on not understading basic things.

Let s assume that the socket s PIN 1 to PIN 256 are used for the RAM signals.

In TR1 the organic (or ceramic) interface below the two dies has 2 X 64 copper traces that get from each die to the relevant pins, all they need to do is to route 64 pins to each of the four dies, that is, seen from the MB the exact connected IMCs are invisible, first die will have a single IMC connected to pin 1-64, second die to pin 65-128 and so on...

That s really no rocket science here....

Yes, but that is 128+ traces that would have to be moved into non optimal locations which is obviously re-wiring the package, which you claimed it was "wired the same way", which clearly it would NOT be. When you start out stating incorrect things, the opposite of reality actually, then your clarity is less than mud.

And beyond that you have the issues I stated, with the loss of manufacturing flexibility.

Time will tell which option AMD chose. Manufacturing simplicity and flexibility, or higher performance. Don't underestimate the importance of manufacturing flexibility.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
The whole discussion rides on the assumption that the binned dies for Threadripper are the same as for Epyc and the decision whether to use a completed package as the former or latter is done as one of the last stages. Personally I think that's wrong already as the dies for Threadripper are binned for absolute performance and overclockability (so above 1800X and 2700X) whereas I expect the dies for Epyc to be binned for power efficiency in the sweetspot frequency range. Never mind that there are no Zen+/12nm based Epyc chips anyway.
 
  • Like
Reactions: ryan20fun

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
Yes, but that is 128+ traces that would have to be moved into non optimal locations.

The PINs function of the socket will be the same, set apart that they are distributed toward four dies.

Besides there s no such thing as optimal location, if you look closely at a MB RAM traces you ll notice that they dont use the shortest physical path and that there s some U shapped parts in the traces, this because the paths are designed as transmission lines and thoses shapes are actually inductances that compensate the capacitive part of the trace such that the IMC see the RAM impedance as a purely resistive load, FTR most common transmissions lines known and used by the general public are the coaxial cables...
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
The PINs function of the socket will be the same, set apart that they are distributed toward four dies.

Besides there s no such thing as optimal location, if you look closely at a MB RAM traces you ll notice that they dont use the shortest physical path and that there s some U shapped parts in the traces, this because the paths are designed as transmission lines and thoses shapes are actually inductances that compensate the capacitive part of the trace such that the IMC see the RAM impedance as a purely resistive load, FTR most common transmissions lines known and used by the general public are the coaxial cables...

There certainly are designs that are significantly more, and significantly less optimal than others. A lot of time is spent optimizing routing.

If you try to cross over 128+ plus traces to locations that are much more difficult to reach you can very quickly run into significant routing issues. Which may force you add another layer to the PCB to resolve.

Edit: I just noticed another thread is already discussing the 2,2,0,0 and 1,1,1,1 MC options for TR:
https://forums.anandtech.com/thread...-top-tdp-of-250w.2547899/page-5#post-39453436

Most of this thread should probably have just been in that thread.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
If you try to cross over 128+ plus traces to locations that are much more difficult to reach you can very quickly run into significant routing issues. Which may force you add another layer to the PCB to resolve.

That s still half of an Epyc interface trace count, so certainly much more easy to route.

Btw, even if they use a TR2 dedicated routing we are still talking of 1-2cts higher eventual cost, seems to me that the 1500€ expected retail price still allow for some decent margin...
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
That s still half of an Epyc interface trace count, so certainly much more easy to route.

Btw, even if they use a TR2 dedicated routing we are still talking of 1-2cts higher eventual cost, seems to me that the 1500€ expected retail price still allow for some decent margin...

You keep seeing this as completely one sided, when there are two viable options.

You are ignoring the benefit of manufacturing flexibility, that having the same PCB for Epyc, TR(2die*active*) and TR (4die) provides.

You are also ignoring that memory channel connections of 2,2,0,0 offer performance advantages in some scenarios.

Such as any time you are running with 16 cores or less you still get full speed dual channels memory to each die. In the 1,1,1,1 config you are only getting half the bandwidth in those scenarios.

If you are getting 32 cores for some full 32 core load, server type activity, you should probably just get Epyc, but workstation/HEDT may have a lot more mixed loads that might benefit from superior 16 core performance at the expense of slightly lower 32 core performance. One synthetic benchmarks isn't enough to really quantify those differences.

It's too bad der8auer, didn't test 2,2,0,0 vs 1,1,1,1 in a bunch of scenarios.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
You guys are over-complicating this WAY too much. The only 'rewiring' needing to be done is adding the additional dies in place of the dummy dies. No modification needs to be done to the design of the chip itself. Requests not present in the memory that the controller would manage are sent over infinity fabric. AMD can choose to optimize this a bit or they can choose to just slap 4 dies together and call it a day. The advantage of #2 is almost no R&D is involved, they can make the chips for $300-$350 a pop, sell them to distributors for $600-$700 a pop, and let the distributor deal with the rest. They can also bin all the way down to the lowest Ryzen tier without overly complicating things. That is the approach they will take. They could have released Threadripper back in April, but they wanted to bin higher quality dies (each die must have a TDP of 65 watts or less) and they just released the 19xx series last year around this time.

For me personally it's going to come down to the trade-offs between the 2990x and the 2950x. If the 2950x clocks higher for single core workloads, I'll go with that. 16 cores does the job just fine for what I do, and it games just fine...better than review sites make it out to be (because they don't understand BIOS options).
 

maddie

Diamond Member
Jul 18, 2010
4,722
4,626
136
You keep seeing this as completely one sided, when there are two viable options.

You are ignoring the benefit of manufacturing flexibility, that having the same PCB for Epyc, TR(2die*active*) and TR (4die) provides.

You are also ignoring that memory channel connections of 2,2,0,0 offer performance advantages in some scenarios.

Such as any time you are running with 16 cores or less you still get full speed dual channels memory to each die. In the 1,1,1,1 config you are only getting half the bandwidth in those scenarios.

If you are getting 32 cores for some full 32 core load, server type activity, you should probably just get Epyc, but workstation/HEDT may have a lot more mixed loads that might benefit from superior 16 core performance at the expense of slightly lower 32 core performance. One synthetic benchmarks isn't enough to really quantify those differences.

It's too bad der8auer, didn't test 2,2,0,0 vs 1,1,1,1 in a bunch of scenarios.

"You are also ignoring that memory channel connections of 2,2,0,0 offer performance advantages in some scenarios

Such as any time you are running with 16 cores or less you still get full speed dual channels memory to each die. In the 1,1,1,1 config you are only getting half the bandwidth in those scenarios."



I've seen this argument before without any explanation into the practicality of it all. Sounds so easy, but.

Can you explain to me how software could do this easily, as in not being specifically tuned for 32C TR2 in each application?
Will it depend on an OS or microcode automating this?
Will a program have to treat each core as unique, or at least half of them?
 

Bouowmx

Golden Member
Nov 13, 2016
1,138
550
146
Can you explain to me how software could do this easily, as in not being specifically tuned for 32C TR2 in each application?
Will it depend on an OS or microcode automating this?
Will a program have to treat each core as unique, or at least half of them?

Update to operating system scheduler, like with Intel Turbo Boost Max Technology 3.0: Windows driver or Linux kernel patch (sched_itmt_enabled)
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Update to operating system scheduler, like with Intel Turbo Boost Max Technology 3.0: Windows driver or Linux kernel patch (sched_itmt_enabled)

You could also have something like game mode, that temporarily disable the dies without memory controllers.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,570
136
You could also have something like game mode, that temporarily disable the dies without memory controllers.
Or they could just load up 1 CCX per die first, memory BW and latency would be the same. The only thing that would be worse is the inter die latency vs inter CXX. And considering how awful the inter CXX latency on ryzen already is (nearly twice as bad as memory latency on good dimms) it won't really matter, it already sucks.

Let's not forget that this is a company that uses *interposers* for niche products. A few cents of extra costs and some manufacturing complexity for a >1000$ product looks like a nobrainer to me.

I'm not saying it will happen for sure, but constantly dissing this idea is IMO rather stupid. Why are they wasting so much time validating and releasing the damn thing otherwise? They already have a Ryzen 2xxx series and original threadripper, should have been a drop in replacement then.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
I'm not saying it will happen for sure, but constantly dissing this idea is IMO rather stupid. .

Do start throwing around "stupid". Particularly when you don't have your facts straight about what I wrote.

I have just been pointing out there are TWO options, with pros/cons for each, and stating we have no idea which will be chosen.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,570
136
Do start throwing around "stupid". Particularly when you don't have your facts straight about what I wrote.

I have just been pointing out there are TWO options, with pros/cons for each, and stating we have no idea which will be chosen.
Sorry, stupid was indeed an over-exaggeration, all I was trying to say is that totally ruling out the other solution doesn't seem right. I also get your points, that you think that two dies have no connection to memory by far the most likely (after all Ian pointed this out first as the most likely version).

IMO you were just a bit too one-sided. Increasingly pointing out faults of the all-dies-connected solution (even those that aren't all that relevant) and started mentioning strength of the other that IMO aren't really there. How is the 2+2+0+0 significantly better (other than economics) than 1+1+1+1, if 1 CCX per die is loaded first?

This is how bad the cross-CCX latency of the chips is (courtesy of the Stilt, here). It's already twice as bad as memory latency (yet still isn't that big of a problem on even the desktop Ryzen).
GF3tru1.png


I seriously doubt additional (~100 ns) inter-die latency could really change things all that much. At least compared to how much performance would tank, if all the memory access were routed through just 2 dies.

AMD-EPYC-Infinity-Fabric-DDR4-2666-Idle-Latencies-in-ns.jpg
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Never mind that there are no Zen+/12nm based Epyc chips anyway.

Fair point. But to we know the refresh is actually based on Zen+?

If they go with 1 channel per die, then all TR2s would need 4 active dies which means the core count config is very inflexible:

8,16,24,32 would be the only possibilities. Especially the gab between 8 and 16 simply too big for pricing them right.

So we either get 2 dies with indirect access or 2 types of packages.

Can someone highlight how CPU packaging is done? I tried google but wasn't successful. What would be the implications of having 2 packages? Especially in cost.

For me it sounds rather unlikely they would go to all this troubles for 2 SKUs (you can only have a 24 and 32 version above 16 if all 4 dies are active). Doesn't look very likely. 2 indirect dies seems to be much more logical.
 

naukkis

Senior member
Jun 5, 2002
701
569
136
You keep seeing this as completely one sided, when there are two viable options.

You are ignoring the benefit of manufacturing flexibility, that having the same PCB for Epyc, TR(2die*active*) and TR (4die) provides.

Epyc won't even use same dies as TR2.

You are also ignoring that memory channel connections of 2,2,0,0 offer performance advantages in some scenarios.

This is the main point against 2,2,0,0 configuration. It's unique design whether 1+1+1+1 configuration is same NUMA used for Epyc and rest of many socket cpus. Will they really want to release one of kind CPU which needs special OS support?

Such as any time you are running with 16 cores or less you still get full speed dual channels memory to each die. In the 1,1,1,1 config you are only getting half the bandwidth in those scenarios.

This is wrong, NUMA scheluder optimizes load evenly so software can take use full cache, full power and full memory bandwith from cpu. Only cases where anyone would like to load one cpu only on x-cpu system are minority cases where many threads share same data, but those are loads where TR sucks anyway.

And even with 1+1+1+1 configuration you aren't losing half the bandwith from loading one die only, second cpu can provide almost same bandwith as direct connected memory when using interleaved memory model.
 
Last edited:

Bouowmx

Golden Member
Nov 13, 2016
1,138
550
146
What sort of time-frame for this, rough idea?
In the case of Intel Turbo Boost Max Technology 3.0, which debuted with Broadwell-E (May 2016):
For the Windows driver, I recall Intel provided the Windows driver on day one, at Intel's download center.
For Linux, see this news item: Phoronix. Patches began on August 2016.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Why is this so complicated? The line for TR2 is dedicated to TR2. Those chips aren't going to ever be EPYC chips. They aren't "recovered" epyc chips. They are their own product. As far as has been released, there will be a continued production of TR1 using existing 14nm production. There will also be TR2 using 12nm production. It will be assembled separately because of that. There is no practical reason that AMD could not choose to do a different MCM for TR2 that would still work with the existing DDR4 channels on the motherboards and still routes the channels to different die on the MCM. Looking at the way the DDR4 channels are distributed on the motherboard, there's no logistical reason that they couldn't come from each die. Looking at the AM4 pinout, the two channels appear to share no pins. I can't find a TR socket pinout, but it has the same memory controllers, so it should be the same way.

I will say, the MCM will likely need an additional layer or two to get those traces together on the package though. Epyc packages have the channels spread out well enough to not have that issue. Given that a TR2 will need to work in a board designed for TR1, it needs to get those traces to those pins, and that's not going to happen on the same MCM with respect to layer number. In summary, its definitely dooable. Its also definitely not trivial. The "interesting" routing may adversely impact the ability of the memory to clock high enough.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Sorry, stupid was indeed an over-exaggeration, all I was trying to say is that totally ruling out the other solution doesn't seem right. I also get your points, that you think that two dies have no connection to memory by far the most likely (after all Ian pointed this out first as the most likely version).

Your interpretation is wrong, I was close to 50:50 on which option is likely, and I am open to new information. I am definitely not invested in either outcome.

How is the 2+2+0+0 significantly better (other than economics) than 1+1+1+1, if 1 CCX per die is loaded first?

Good point. It was late and I hadn't considered the option of disabling all the CCX's with no memory connected or using them last. If that is possible,that definitely favors the 1+1+1+1 design, and makes it the more likely option.
 
  • Like
Reactions: Gideon

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Good point. It was late and I hadn't considered the option of disabling all the CCX's with no memory connected or using them last.

It is better cause in UMA configuration only 50% of requests on average need to go to different die and eat 100+ns of extra latency. In 1111, only 25% will. So in light loaded UMA scenario, where modern OS is aware. Lightly loaded NUMA also benefits from 2020 more.

It is memory benchmarking and full CPU load with memory bw requirements ( like linpack and friends ) that break down horribly with 2020 configuration.

So it is the case of optimizing for best case by doing nothing or focusing on mitigating the worse case by investing into new substrate etc.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
Why is this so complicated? The line for TR2 is dedicated to TR2. Those chips aren't going to ever be EPYC chips. They aren't "recovered" epyc chips. They are their own product. As far as has been released, there will be a continued production of TR1 using existing 14nm production. There will also be TR2 using 12nm production. It will be assembled separately because of that. There is no practical reason that AMD could not choose to do a different MCM for TR2 that would still work with the existing DDR4 channels on the motherboards and still routes the channels to different die on the MCM. Looking at the way the DDR4 channels are distributed on the motherboard, there's no logistical reason that they couldn't come from each die. Looking at the AM4 pinout, the two channels appear to share no pins. I can't find a TR socket pinout, but it has the same memory controllers, so it should be the same way.

I seriously doubt that they are still producing GF "14nm" Ryzen dies. Why keep producing them at the old unrefined process. Since GF 12nm is really just a slight tweak on GF 14nm. Really it's 14nm+. There hasn't been a new tapeout, the "14nm" and "12nm" dies are the identical size. It's most likely they converted the Ryzen lines from "14nm" to "12nm", not started a parallel lines. It's a small tweak, not a major change.

I expect there are stockpiles of the old "14nm" Ryzen, but I expect every Ryzen die produced right now is a "12nm".

If old stock runs down, expect an announcement of Epyc using "12nm".
 

maddie

Diamond Member
Jul 18, 2010
4,722
4,626
136
I seriously doubt that they are still producing GF "14nm" Ryzen dies. Why keep producing them at the old unrefined process. Since GF 12nm is really just a slight tweak on GF 14nm. Really it's 14nm+. There hasn't been a new tapeout, the "14nm" and "12nm" dies are the identical size. It's most likely they converted the Ryzen lines from "14nm" to "12nm", not started a parallel lines. It's a small tweak, not a major change.

I expect there are stockpiles of the old "14nm" Ryzen, but I expect every Ryzen die produced right now is a "12nm".

If old stock runs down, expect an announcement of Epyc using "12nm".
This is an amazing post with fantastic claims. It goes against everything known so far about the product line roadmap.

I understand your need to win every argument, but in the process, you should not descend into utter fantasy.

The last line especially is hilarious. Such a simple announcement.

[If old stock runs down, expect an announcement of Epyc using "12nm".]

amd_rome_milan_zen23.jpg


AMD-CPU-architecture-roadmap.png
 
  • Like
Reactions: Batboy88

jpiniero

Lifer
Oct 1, 2010
14,509
5,159
136
Do think it's going to be 12 and 16 cores with two active dies (same as TR1); and then the 24 and 32 core models with 2+2+0+0. Rewiring it to work 1+1+1+1 (if it's even possible) sounds like way too much work for such a low volume product.

The ideal solution of course would be to make Threadripper models with SP3, and just disable ECC to segment out.. which is pretty much what Intel is doing with the Super HEDT.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
Fair point. But to we know the refresh is actually based on Zen+?
Fair point as well, I don't think we got confirmation either way indeed. Personally I'd be disappointed if TR2 is still based on Zen without the Precision Boost 2 improvements introduced in Zen+ that should be very useful within the TDP limits.