【der8auer】Threadripper 2990X Preview - aka EPYC 7601 overclocking

maddie · Jul 17, 2018

Epyc already has all (8) memory already channels wired into the package base. Can we assume that TR uses the same package as Epyc, but simply ignores the connectors that enable the absent memory controllers? I've never read anything detailed on this but it makes the most sense.

The problem is that in TR sockets they are going to the (2) die, but you would need them to go to (4) die for TR2 while using the original pins on the socket. The only way I see is to modify the package OR.

Assuming that there was a long lead time in these designs, we can say that TR2 was planned quite a while ago, even before TR was released? Is it possible that there is some way to change the package routing without making a custom one? Sever (2) paths and reroute the signals? Additional paths already baked in to the package, but unused until now?

Markfw · Jul 17, 2018

maddie said:
Epyc already has all (8) memory already channels wired into the package base. Can we assume that TR uses the same package as Epyc, but simply ignores the connectors that enable the absent memory controllers? I've never read anything detailed on this but it makes the most sense.

The problem is that in TR sockets they are going to the (2) die, but you would need them to go to (4) die for TR2 while using the original pins on the socket. The only way I see is to modify the package OR.

Assuming that there was a long lead time in these designs, we can say that TR2 was planned quite a while ago, even before TR was released? Is it possible that there is some way to change the package routing without making a custom one? Sever (2) paths and reroute the signals? Additional paths already baked in to the package, but unused until now?

I believe that the main difference between socket TR4 and SP3 is simply that TR4 does NOT have 8 channel memory wired.

moinmoin · Jul 17, 2018

maddie said:
Assuming that there was a long lead time in these designs, we can say that TR2 was planned quite a while ago, even before TR was released?

I'd say the fact that even TR1 used four dies of which two were supposedly placeholders/dummies/rocks but actually deactivated dies tells me that they did plan ahead for TR2, otherwise AMD could have optimized for two dies.

PeterScott · Jul 17, 2018

moinmoin said:
I'd say the fact that even TR1 used four dies of which two were supposedly placeholders/dummies/rocks but actually deactivated dies tells me that they did plan ahead for TR2, otherwise AMD could have optimized for two dies.

If anything that says AMD will make non optimal choices, rather than doing a redesign. Which points to it being more likely that it won't have memory channels active on two dies.

maddie · Jul 17, 2018

PeterScott said:
If anything that says AMD will make non optimal choices, rather than doing a redesign. Which points to it being more likely that it won't have memory channels active on two dies.

Your interpretation of "non optimal choices" disregards the overall costs to the company. One design, or at least as few as possible to rule all markets, has many advantages. They started with 1 design, Ryzen, then added Raven Ridge. Now we have the 3rd one, Pinnacle Ridge. With the introduction of Zen 3, we might see a return to 2 CPU designs.

Spending extra on wasted resources for some of your products might have an overall lower cost than optimizing for all parameters in individual products.

AFAIK, none of us have access to any detailed financial info to say what was or is optimal. I am a firm believer in the Pareto rule and try to structure many things I do with this in mind. Best bang for the buck, so to speak.

eek2121 · Jul 18, 2018

The Stilt said:
CCX to CCX latency is not the issue in question on TR2, but the die to die latency in case half of the dies lack the memory controllers.
If the package has been redesigned for single memory channel per die, then the latency issue pretty much disappears.

I think we just said the same thing.

PeterScott said:
Not what I meant. I meant physically faking out the pin arrangements.

Right now you are getting 2 memory channels from say Die 1, and the memory controller/data pins on the package for those will be located right near die one to avoid crossovers.

If you activate 4 dies, and using instead 1 memory channel on each, you now have to route all your memory controller/data pins from die 2, across the package over to where they originally connected to die 1.

So TR 4 die, has a unique cross wired package. That makes it appear to the MB that you are accessing 2 channels on 2 dies, when in reality you are doing 1 channel on 4 dies.

It's possible, but it's a spiders web of high pin count crossover connections, that may not happen.

Time will tell if the cross wire the package for 1 channel/die access, or if they just leave 2channels on 2 dies, and suffer the latency penalties.

Not really, it depends on how the socket and cpu was designed to begin with. As I stated before, Threadripper works just fine in single channel mode. That just makes both dies use a single memory channel on one die. Both AM4 and TR4 already support this. In theory you could probably build a socket adapter that would allow EPYC 32 core on X399.

mattiasnyc · Jul 18, 2018

eek2121 said:
Threadripper works just fine in single channel mode. That just makes both dies use a single memory channel on one die.

With no significant degradation in performance on apps that rely on low-latency access to main RAM?

Charlie22911 · Jul 18, 2018

eek2121 said:
IIRC each die has 2 memory controllers (1 for each CCX). All they would need to do is disable 1 controller per die. Memory bandwidth per die would be halved, but this would NOT affect 32 core workloads (except if your memory bandwidth is higher). Since Zen+ features latency improvements, they can probably set the infinite fabric at a set speed and you won't notice any latency issues at all. It's the die to die communication that causes latency, Inter-CCX latencies, especially under Zen+, aren't that bad.

Edit: Oh and even though der8auer generally knows his stuff, I'm not sure he had his RAM set up correctly. In order to properly enable quad channel mode on an epyc CPU, for instance, you must install the RAM in the correct slots. His EPYC was showing oddball latencies for everything, which makes me think that something was off with his setup. By comparison, here is my 1950X latencies with CL16 RAM:

Notice how my cache latencies are much lower. I'm not sure what is causing his latencies to go through the roof like they are, but IIRC they should be similar to what you see here (for L1, L2, and L3).

He was using RDIMMs at a much lower speed than your 3200 kit, between this and platform differences it is probably why things seem a bit wonky.

PeterScott · Jul 18, 2018

eek2121 said:
I think we just said the same thing.

Not really, it depends on how the socket and cpu was designed to begin with. As I stated before, Threadripper works just fine in single channel mode. That just makes both dies use a single memory channel on one die. Both AM4 and TR4 already support this. In theory you could probably build a socket adapter that would allow EPYC 32 core on X399.

Since there were two active dies on TR, and 4 channels, it's pretty obvious that the socket/CPU were designed to connecting 2 channels to each of the two active dies, and NOTHING to the two inactive dies.

Gideon · Jul 18, 2018

PeterScott said:
Since there were two active dies on TR, and 4 channels, it's pretty obvious that the socket/CPU were designed to connecting 2 channels to each of the two active dies, and NOTHING to the two inactive dies.

While this was true, this doesn't have to hold for threadripper 2.

Charlie22911 · Jul 18, 2018

You know, I don't think there is a technical reason why AMD couldn't design the package to connect a memory channel to each DIMM slot, allowing for a 1 DIMM per channel config for octal channel. I'd love to hear opinions on this from someone who is more knowledgeable than I am.

PeterScott · Jul 18, 2018

Gideon said:
While this was true, this doesn't have to hold for threadripper 2.

It holds for TR2 MBs which are TR1 MBs as well.

If you read the thread back. I was saying you have to extensively crosswire the package to get memory channel on each die. Then someone claimed it depended on how TR1 was designed.

Well NO, it doesn't because we know how TR1 was designed, with dual channels access to each of two dies, and nothing to the others.

So we are back to needing and extensively hacked/cross wired package to get a memory channel on each die.

This could happen, or AMD could just use the same package.

Those are the two possibilities. Given AMD's propensity to just reuse things as is, it wouldn't surprise me they just use the same standard package and live with the latency penalty.

Time will tell on this.

mattiasnyc · Jul 18, 2018

PeterScott said:
we know how TR1 was designed, with dual channels access to each of two dies, and nothing to the others.

So we are back to needing and extensively hacked/cross wired package to get a memory channel on each die.

This could happen, or AMD could just use the same package.

Those are the two possibilities. Given AMD's propensity to just reuse things as is, it wouldn't surprise me they just use the same standard package and live with the latency penalty.

Now, if they were to try to get channels evenly distributed, would that mean that the signals would still be "bottlenecked" through the socket/motherboard regardless of what's done on the CPU? In other words, not only is there a potential latency issue but also one of bandwidth.... (?)

The Stilt · Jul 18, 2018

Charlie22911 said:
You know, I don't think there is a technical reason why AMD couldn't design the package to connect a memory channel to each DIMM slot, allowing for a 1 DIMM per channel config for octal channel. I'd love to hear opinions on this from someone who is more knowledgeable than I am.

The existing boards are 2 DPC.
For example memory slots A1 & A2 share nearly all of the signals (only CAD are separate for the slots of the same channel).

Charlie22911 · Jul 18, 2018

The Stilt said:
The existing boards are 2 DPC.
For example memory slots A1 & A2 share nearly all of the signals (only CAD are separate for the slots of the same channel).

I see. The slots in each channel share traces to the socket, so such a package change would require a corresponding motherboard layout change which would break compatibility.

Thanks for the clarification.

Abwx · Jul 18, 2018

PeterScott said:
It holds for TR2 MBs which are TR1 MBs as well.

.

This has nothing to do with what is possible..

The routing from the CPU IMC I/Os and the MB is implemented in the socket, it cost a few man hours to design a layout that get from each of the 4 dies to the corresponding pins connected to the RAM, and likely that it has been done and tested by AMD with previous gen chips since there are 4 of thoses in a TR1, as said the cost is minimal...

maddie · Jul 18, 2018

Abwx said:
This has nothing to do with what is possible..

The routing from the CPU IMC I/Os and the MB is implemented in the socket, it cost a few man hours to design a layout that get from each of the 4 dies to the corresponding pins connected to the RAM, and likely that it has been done and tested by AMD with previous gen chips since there are 4 of thoses in a TR1, as said the cost is minimal...

Exactly.

All this talk of extensive work to do this is pure FUD. This is not IC lithographic work here people, and TR sales appear to be much higher than AMD expected for a HEDT part, justifying a cheaply made custom part.

The Stilt did some fast and raw tests, with the memory results much better for (1) channel/die vs the alternative. The question is, will AMD leave this performance off the table for a small investment cost?

https://forums.anandtech.com/thread...-top-tdp-of-250w.2547899/page-4#post-39451806
1950X at fixed 3.4GHz frequency, 2933MHz MEMCLK CL14-14-14-1T.

3RA

2CPD (NUMA) = 85961MB/s (Read), 86643MB/s (Write), 81097MB/s (Copy), 78.33ns
1CPD (NUMA) = 44458MB/s (Read), 43449MB/s (Write), 40789MB/s (Copy), 78.80ns
2+0 CPD (LEECH) = 34495MB/s (Read), 37059MB/s (Write), 34823MB/s (Copy), 127.00ns

PeterScott · Jul 18, 2018

Abwx said:
This has nothing to do with what is possible..
.

And I state later in the post, the two possible options, and that we really won't know until it ships which option AMD chose.

Option one: is easy, in that that the can essentially use the same package across the board for TR1/TR2/Epyc. But doing this for 32 core TR2, causes a latency issue.

Option Two: is more difficult. It requires a specially wired package just the TR2 high core counts with 4 active dies. It also reduces manufacturing flexibility as you have to know ahead of time that you are building a 4 active die TR. But this reduces the latency penalty, and is obviously the nicer arrangement for customers.

I have no idea which option it will be. I certainly wouldn't bet money on either outcome, because there is a rationale for both. Option one is ease of manufacturing and design, option two is better performance.

As I said time will tell which option was chosen.

Abwx · Jul 18, 2018

PeterScott said:
Option Two: is more difficult. It requires a specially wired package just the TR2 high core counts with 4 active dies.

It s wired the same way as TR1, what you seem unable to understand is that it s only a matter of pins routed to dies using copper traces, that s akin to motherboard design, that is to say that it cost the same whatever the way it is wired.

maddie said:
Exactly.

All this talk of extensive work to do this is pure FUD.

Dunno how such simple things are not understood....

PeterScott · Jul 18, 2018

Abwx said:
It s wired the same way as TR1, what you seem unable to understand is that it s only a matter of pins routed to dies using copper traces, that s akin to motherboard design, that is to say that it cost the same whatever the way it is wired.

Dunno how such simple things are not understood....

🙄 Changing the routing of the memory controller pins to different dies == wired differently. Not sure how such a simple things is not understood.

Which means it a different, package not compatible with Epyc/TR1.

Right now TR and Epyc use exactly the same package.

With a Uniquely wired package for TR2 4 die parts, you lose manufacturing flexibility.

Also it may not be an easy routing job. Routing traces on MBs depends on logical placement of components to make routing easier. In a normal Epyc/TR package, routing of of memory controllers to pins will be done logically close to the die to have shorter tracing and less collisions attempting to cross other traces.

Now if you want to route the memory controller of formerly inactive die #3 to the pins for Die #2, you have created a very non optimal routing problem, with longer traces and multiple collision issues, and memory controllers have a LOT of pins.

Markfw · Jul 18, 2018

PeterScott said:
🙄

Right now TR and Epyc use exactly the same package.

.

No EPYC has the 8 channel memory setup, and its also been confirmed that this is one pin wired different just to signal that its SP3, not TR4 socket. If it was the same package, you could put an EPYC in a TR4 and a TR in a SP3.

PeterScott · Jul 18, 2018

Markfw said:
No EPYC has the 8 channel memory setup, and its also been confirmed that this is one pin wired different just to signal that its SP3, not TR4 socket. If it was the same package, you could put an EPYC in a TR4 and a TR in a SP3.

It's the same package. Having one pin grounded as an ID, doesn't change that.

Right now they can run them all down the same assembly line, and after they get tested, decide which ones are going to be TR and which are Epic.

After they decide on the type of chip, they can set the signalling pin, and load the appropriated microcode.

Markfw · Jul 18, 2018

PeterScott said:
It's the same package. Having one pin grounded as an ID, doesn't change that.

Right now they can run them all down the same assembly line, and after they get tested, decide which ones are going to be TR and which are Epic.

After they decide on the type of chip, they can set the signalling pin, and load the appropriated microcode.

No. They already did a test where they found the signal pin and tried to trick it, and it would not post. Something else is different. They said most likely due to the memory 8 channel vs 4.

PeterScott · Jul 18, 2018

Markfw said:
No. They already did a test where they found the signal pin and tried to trick it, and it would not post. Something else is different. They said most likely due to the memory 8 channel vs 4.

Just because AMD blocked Epyc from booting on a TR MB doesn't mean it's a different package. That's like arguing 2600X and 2700X use different dies because they have different core counts. IOW, it's a facetious argument.

The "They" you are referring to is Der8auer, and he concluded it was the same package, after going as far as to x-ray and Epyc and and TR, and they are the same:
https://youtu.be/GWQ74Fuyl4M?t=15m1s

It's the same PCB. They still have traces going to the inactive dies. Which are totally pointless unless you are just re-using the same PCB from Epyc.

Abwx · Jul 18, 2018

PeterScott said:
🙄 Changing the routing of the memory controller pins to different dies == wired differently. Not sure how such a simple things is not understood.

Which means it a different, package not compatible with Epyc/TR1.

.

i explain it simply since you seem hell bent on not understading basic things.

Let s assume that the socket s PIN 1 to PIN 256 are used for the RAM signals.

In TR1 the organic (or ceramic) interface below the two dies has 2 X 64 copper traces that get from each die to the relevant pins, all they need to do is to route 64 pins to each of the four dies, that is, seen from the MB the exact connected IMCs are invisible, first die will have a single IMC connected to pin 1-64, second die to pin 65-128 and so on...

That s really no rocket science here....

【der8auer】Threadripper 2990X Preview - aka EPYC 7601 overclocking

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Senior member

Platinum Member

Platinum Member

Senior member

Platinum Member

Senior member

Golden Member

Senior member

Lifer

Diamond Member

Platinum Member

Lifer

Platinum Member

Moderator Emeritus, Elite Member

Platinum Member

Moderator Emeritus, Elite Member

Platinum Member

Lifer