【der8auer】Threadripper 2990X Preview - aka EPYC 7601 overclocking

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,542
14,496
136
Not that exciting, since there is too much "inferred" and not the 2990X.
 
  • Like
Reactions: Drazick

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Gives a pretty good idea of the potential memory limitations and VRM issues.
With the exception that the VRM issues are almost certainly worse, as the 12nm LP features higher leakage and has lower nominal voltage.
The frequencies on > 16 core parts will be rather low and therefore the voltage will be very low. High current draw at < 1.20V voltage (i.e. < 10% duty) results in extremely poor VRM efficiency, which is a major issue aside of the actual power draw itself (which will be high).

Because of that I expect the issues at least on existing X399 boards to be similar, or even worse than what we saw with Skylake-X.
On X299 boards the VRM can maintain quite high efficiency, as the VRM duty cycle is always >= 15% due to FIVR. And they still cook unless the VRM is extremely well built and well cooled.
 
  • Like
Reactions: Drazick

tamz_msc

Diamond Member
Jan 5, 2017
3,772
3,592
136
Gives a pretty good idea of the potential memory limitations and VRM issues.
With the exception that the VRM issues are almost certainly worse, as the 12nm LP features higher leakage and has lower nominal voltage.
The frequencies on > 16 core parts will be rather low and therefore the voltage will be very low. High current draw at < 1.20V voltage (i.e. < 10% duty) results in extremely poor VRM efficiency, which is a major issue aside of the actual power draw itself (which will be high).

Because of that I expect the issues at least on existing X399 boards to be similar, or even worse than what we saw with Skylake-X.
On X299 boards the VRM can maintain quite high efficiency, as the VRM duty cycle is always >= 15% due to FIVR. And they still cook unless the VRM is extremely well built and well cooled.
Would that mean that in order to have a certain efficiency for the VRMs at a given power consumption, Threadripper motherboards would require more phases than Skylake-X?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Would that mean that in order to have a certain efficiency for the VRMs at a given power consumption, Threadripper motherboards would require more phases than Skylake-X?

An identical VRM, at identical load power consumption would dissipate more heat on TR2 system than on X299 system.
Thats due to the significant voltage difference (SKL-X >= 1.80V Vout, 2990X is most likely < 1.00V Vout) and therefore lower efficiency due to higher currents and very low converter duty cycle.

The best way to decrease the power density of the VRM is to add more phases, i.e. a larger surface to dissipate the heat from.
VRM cooling solutions cannot increase infinitely in size, so at some point active cooling will become mandatory. Afterall we're talking about > 75W power dissipation for the VRM alone.

One of the key aspects with TR2 will be the actual power consumption.
Meaning if AMD has configured the actual power limits to significantly higher figures than the actual TDP (35% higher on AM4 Pinnacle Ridge CPUs) or not. 250W is already pretty hard, but 337W would be just insane.
 

Charlie22911

Senior member
Mar 19, 2005
614
228
116
On the Epyc part each die has an active memory controller, where as the Threadripper part will have half its cores accessing memory through infinity fabric. We can see from at least one data point in that video where 32 cores are adversely impacted by only 4 active memory channels, and I imagine the Threadripper SKU would be slightly worse due to the way half the cores access RAM indirectly. It seems to me that Epyc 7601 in quad channel mode will be a decent ballpark for what to expect from the upcoming halo Threadripper SKU.

I'd love to see this explored further since not all workloads will be as heavily impacted by highly threaded workloads like Cinebench.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
On the Epyc part each die has an active memory controller, where as the Threadripper part will have half its cores accessing memory through infinity fabric. We can see from at least one data point in that video where 32 cores are adversely impacted by only 4 active memory channels, and I imagine the Threadripper SKU would be slightly worse due to the way half the cores access RAM indirectly. It seems to me that Epyc 7601 in quad channel mode will be a decent ballpark for what to expect from the upcoming halo Threadripper SKU.

I'd love to see this explored further since not all workloads will be as heavily impacted by highly threaded workloads like Cinebench.

This does make for a strange part. It will be interesting if AMD has anything up it's sleeve to mitigate this.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,772
3,592
136
On the Epyc part each die has an active memory controller, where as the Threadripper part will have half its cores accessing memory through infinity fabric. We can see from at least one data point in that video where 32 cores are adversely impacted by only 4 active memory channels, and I imagine the Threadripper SKU would be slightly worse due to the way half the cores access RAM indirectly. It seems to me that Epyc 7601 in quad channel mode will be a decent ballpark for what to expect from the upcoming halo Threadripper SKU.

I'd love to see this explored further since not all workloads will be as heavily impacted by highly threaded workloads like Cinebench.
I'm pretty sure that all memory access is through the IF, as evidenced by this diagram on wikichip based on AMD presentations:
700px-AMD_Summit_Ridge_SoC.svg.png
 
  • Like
Reactions: lightmanek

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
This does make for a strange part. It will be interesting if AMD has anything up it's sleeve to mitigate this.

They likely do. I didn't get to watch the video, but the 1950X had UMA and NUMA modes that could drastically increase memory bandwidth while lowering latency.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I'd love to see this explored further since not all workloads will be as heavily impacted by highly threaded workloads like Cinebench.

Isn't the reverse true? If anything Cinebench is very insensitive to memory bw/latency.
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
On the Epyc part each die has an active memory controller, where as the Threadripper part will have half its cores accessing memory through infinity fabric. We can see from at least one data point in that video where 32 cores are adversely impacted by only 4 active memory channels, and I imagine the Threadripper SKU would be slightly worse due to the way half the cores access RAM indirectly. It seems to me that Epyc 7601 in quad channel mode will be a decent ballpark for what to expect from the upcoming halo Threadripper SKU.

I'd love to see this explored further since not all workloads will be as heavily impacted by highly threaded workloads like Cinebench.
The 2nd part of that statement is not confirmed. It is a speculative opinion that has been accepted as fact.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
The 2nd part of that statement is not confirmed. It is a speculative opinion that has been accepted as fact.

It does seem to be the most likely case, though.

Ryzen: Full Dual channel memory controller on one chip: 2 Channels
TR: Full Dual Channel memory controller on two chips: 4 Channels
Epyc: Full Dual Controller on 4 chips: 8 Channels.

TR2: Options:

Option one: Same as TR1, seems most likely give the exact same motherboards can be used.

Option Two: Somehow rewiring the chips to use only half a memory controller on each core chip, so you can use one channel on each Chip, seamlessly on the same MBs? That seems a lot less likely.

So it does seem mostly likely that two chips have full dual channel memory and two other chips have no direct memory access.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,772
3,592
136
  • Like
Reactions: CatMerc

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
It does seem to be the most likely case, though.

Ryzen: Full Dual channel memory controller on one chip: 2 Channels
TR: Full Dual Channel memory controller on two chips: 4 Channels
Epyc: Full Dual Controller on 4 chips: 8 Channels.

TR2: Options:

Option one: Same as TR1, seems most likely give the exact same motherboards can be used.

Option Two: Somehow rewiring the chips to use only half a memory controller on each core chip, so you can use one channel on each Chip, seamlessly on the same MBs? That seems a lot less likely.

So it does seem mostly likely that two chips have full dual channel memory and two other chips have no direct memory access.

You don't need to rewire the chip, just the package.

dieshot.jpg


Each Ryzen die, has (2) 64bit memory controllers.
The present Threadripper has (4) 64 bit memory channels.

If you can rewire the traces in the package, and it is possible, you can have each Threadripper 2 die using (1) memory controller from each of the (4) die in the package connecting to each channel.

The resulting memory performance is superior to using just (2) die for (4) channel memory access.

Will it happen? We'll see soon.
 

Charlie22911

Senior member
Mar 19, 2005
614
228
116
You don't need to rewire the chip, just the package.

dieshot.jpg


Each Ryzen die, has (2) 64bit memory controllers.
The present Threadripper has (4) 64 bit memory channels.

If you can rewire the traces in the package, and it is possible, you can have each Threadripper 2 die using (1) memory controller from each of the (4) die in the package connecting to each channel.

The resulting memory performance is superior to using just (2) die for (4) channel memory access.

Will it happen? We'll see soon.

This is something I hadn't considered. If each die has its own RAM channel then I expect we can anticipate similar performance to the Epyc 7601 in quad-channel mode; in this case fast memory will be important to maximizing performance potential here.
It really didnt make sense to me that he was using gaming benchmarks to measure perofrmance since these sorts of parts are not meant for those workloads, though to be fair I suppose he is catering to his audence. I'd like to see some encoding and rendering numbers.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
You don't need to rewire the chip, just the package.

Each Ryzen die, has (2) 64bit memory controllers.
The present Threadripper has (4) 64 bit memory channels.

If you can rewire the traces in the package, and it is possible, you can have each Threadripper 2 die using (1) memory controller from each of the (4) die in the package connecting to each channel.

The resulting memory performance is superior to using just (2) die for (4) channel memory access.

Will it happen? We'll see soon.

The only way this will happen is if it affects all threadrippers or else the 32-core part would have to use a different package which is IMHO very unlikely. So if in all TR2 dies all dies must be active we could have:

- 8-core: 1 active core per CCX (2 per die)
- 12-core: 2 dies with 2 core per ccx and 2 dies with 1 core per ccx
- 16-core : 2 active core per CCX (4 per die)
- 20-core: 2 dies with 3 core per ccx and 2 dies with 2 core per ccx
- 24-core: 3active core per CCX (6 per die)
and so forth

But given EPYC I'm not sure that 12/20 core version is actually possible. There are no such Epyc variants. So as far as i can tell, in this case there could only be 8,16, 24 and 32 core TR2. I consider this rather unlikely. Would be difficult to price these.
 
  • Like
Reactions: PeterScott

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The only way this will happen is if it affects all threadrippers or else the 32-core part would have to use a different package which is IMHO very unlikely. So if in all TR2 dies all dies must be active we could have:

- 8-core: 1 active core per CCX (2 per die)
- 12-core: 2 dies with 2 core per ccx and 2 dies with 1 core per ccx
- 16-core : 2 active core per CCX (4 per die)
- 20-core: 2 dies with 3 core per ccx and 2 dies with 2 core per ccx
- 24-core: 3active core per CCX (6 per die)
and so forth

But given EPYC I'm not sure that 12/20 core version is actually possible. There are no such Epyc variants. So as far as i can tell, in this case there could only be 8,16, 24 and 32 core TR2. I consider this rather unlikely. Would be difficult to price these.

Only symmetric configurations are possible on Zen.

MCM4 config (EPYC / TR2):

8 cores (1x2x4)
16 cores (2x2x4)
24 cores (3x2x4)
32 cores (4x2x4)

Other (single CCX) configurations are technically possible, however AMD has never shipped any SKUs with a complete CCX disabled.
 
Last edited:

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
You don't need to rewire the chip, just the package.

Each Ryzen die, has (2) 64bit memory controllers.
The present Threadripper has (4) 64 bit memory channels.

If you can rewire the traces in the package, and it is possible, you can have each Threadripper 2 die using (1) memory controller from each of the (4) die in the package connecting to each channel.

The resulting memory performance is superior to using just (2) die for (4) channel memory access.

Will it happen? We'll see soon.

IIRC each die has 2 memory controllers (1 for each CCX). All they would need to do is disable 1 controller per die. Memory bandwidth per die would be halved, but this would NOT affect 32 core workloads (except if your memory bandwidth is higher). Since Zen+ features latency improvements, they can probably set the infinite fabric at a set speed and you won't notice any latency issues at all. It's the die to die communication that causes latency, Inter-CCX latencies, especially under Zen+, aren't that bad.

Edit: Oh and even though der8auer generally knows his stuff, I'm not sure he had his RAM set up correctly. In order to properly enable quad channel mode on an epyc CPU, for instance, you must install the RAM in the correct slots. His EPYC was showing oddball latencies for everything, which makes me think that something was off with his setup. By comparison, here is my 1950X latencies with CL16 RAM:

2M5xDtG.png

Notice how my cache latencies are much lower. I'm not sure what is causing his latencies to go through the roof like they are, but IIRC they should be similar to what you see here (for L1, L2, and L3).
 
Last edited:
  • Like
Reactions: lightmanek

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
You don't need to rewire the chip, just the package.

Each Ryzen die, has (2) 64bit memory controllers.
The present Threadripper has (4) 64 bit memory channels.

If you can rewire the traces in the package, and it is possible, you can have each Threadripper 2 die using (1) memory controller from each of the (4) die in the package connecting to each channel.

The resulting memory performance is superior to using just (2) die for (4) channel memory access.

Will it happen? We'll see soon.

That is what I meant, but I assumed it would be rather difficult.

The package would have originally been designed to have the routing of data lines in the package co-located with the physical die it was connecting to. Data lines from the die and pins on the package will be the most numerous type. Cross wiring them from the original locations to locations collocated on other sockets could be something of small nightmare.

I agree having 1 channel/die is the ideal outcome, but my assumption is wiring the package cross-ways to fake out the 2 active die package will be a nightmare.

Time will tell which way it was done.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
That is what I meant, but I assumed it would be rather difficult.

The package would have originally been designed to have the routing of data lines in the package co-located with the physical die it was connecting to. Data lines from the die and pins on the package will be the most numerous type. Cross wiring them from the original locations to locations collocated on other sockets could be something of small nightmare.

I agree having 1 channel/die is the ideal outcome, but my assumption is wiring the package cross-ways to fake out the 2 active die package will be a nightmare.

Time will tell which way it was done.

They don't need to do any rewiring or faking out. This type of setup is already supported by Zen. By default you get 1 channel per CCX. If CCX0 needs data from CCX1's channel, it asks for it. There is a slight latency penalty for this, but not as nearly as high as inter-die communication. By merely disabling a memory controller, you simply force both CCXes to share 1 controller. The latencies for memory will be SLIGHTLY higher, but not much. Inter-die latency however...will be through the roof. On Zen1 (Threadripper) it was around 97ns for CL16 RAM on my 1950X, I expect 32 cores will increase this by around 30%. However, once again, a local vs distributed mode will help greatly here.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
CCX to CCX latency is not the issue in question on TR2, but the die to die latency in case half of the dies lack the memory controllers.
If the package has been redesigned for single memory channel per die, then the latency issue pretty much disappears.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
They don't need to do any rewiring or faking out. This type of setup is already supported by Zen. By default you get 1 channel per CCX. If CCX0 needs data from CCX1's channel, it asks for it. There is a slight latency penalty for this, but not as nearly as high as inter-die communication. By merely disabling a memory controller, you simply force both CCXes to share 1 controller. The latencies for memory will be SLIGHTLY higher, but not much. Inter-die latency however...will be through the roof. On Zen1 (Threadripper) it was around 97ns for CL16 RAM on my 1950X, I expect 32 cores will increase this by around 30%. However, once again, a local vs distributed mode will help greatly here.

Not what I meant. I meant physically faking out the pin arrangements.

Right now you are getting 2 memory channels from say Die 1, and the memory controller/data pins on the package for those will be located right near die one to avoid crossovers.

If you activate 4 dies, and using instead 1 memory channel on each, you now have to route all your memory controller/data pins from die 2, across the package over to where they originally connected to die 1.

So TR 4 die, has a unique cross wired package. That makes it appear to the MB that you are accessing 2 channels on 2 dies, when in reality you are doing 1 channel on 4 dies.

It's possible, but it's a spiders web of high pin count crossover connections, that may not happen.

Time will tell if the cross wire the package for 1 channel/die access, or if they just leave 2channels on 2 dies, and suffer the latency penalties.
 
  • Like
Reactions: maddie