[pcgamesn] AMD is giving Threadripper 2 moar cores and a top TDP of 250W

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
It amazes me how many people here browse the forums, but not the site.

Remember that time when you wrote:

You actually read wccftech?

Threadripper 2 will feature 16 cores. Any increase in cores would have to be done in the CCX due to socket compatibility.

And then someone else wrote:


And then you were like:

It amazes me how many people here browse the forums, but not the site.

….. I liked that time...
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
Remember that time when you wrote:



And then someone else wrote:



And then you were like:



….. I liked that time...

...and remember when you questioned my post where the 32 core/64 thread chip would be 3.0/3.4 GHz...as posted on Anandtech's own overview? My posts were written before AMD themselves announced a 32 core version. 250 watt TDP means that indeed several motherboards, including certain ASUS boards, would not be compatible...hence my post. a 32 core Threadripper will likely require dual 8 pin CPU plugs unless they stick with the 3.0/3.4 design...and even so...hence my original statement stands.

EDIT: Also, I'm not going to continue this due to the risk of moderator intervention. I only point out that all of the information is available on Anandtech's homepage. You simply failed to read it and then blamed me for your failure. I'm not here to cover for your mistakes.
 
  • Like
Reactions: ryan20fun

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Slots of the same memory channels (e.g. A0 & A1) share all but 10 signals (CAD).
You can look at any 2 DPC board schematic you can find, to check that.

Slots for the different channels (e.g. A0 & B0) share none, except the SMBUS ones for the SPD device.

All of the existing X399 boards with eight slots on them are 2 DPC (A0 & A1, B0 & B1, C0 & C1, D0 & D1), while an EPYC or 8CH TR would need to have 1 DPC (or 2 DPC with 16 slots) with A0, B0, C0, D0, E0, F0, G0, H0 configuration.

Can you ELI5 this for non-electrical engineers? naukkis explanation makes sense to me on a "common-sense" like manner but I don't understand why it would not be possible what he suggests?

A01 & A1 would be served by 1 channel from die 1, B01 & B1 would be served by 1 channel from die 2, and so forth. Instead of having 2 dies with dual-channel the config naukkis suggests would have 4 dies each with single-channel access. So why doesn't this work in laymans terms?
 
  • Like
Reactions: Jan Olšan

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Can you ELI5 this for non-electrical engineers? naukkis explanation makes sense to me on a "common-sense" like manner but I don't understand why it would not be possible what he suggests?

A01 & A1 would be served by 1 channel from die 1, B01 & B1 would be served by 1 channel from die 2, and so forth. Instead of having 2 dies with dual-channel the config naukkis suggests would have 4 dies each with single-channel access. So why doesn't this work in laymans terms?

I don't think it can be explained any simpler.

It has nothing to do with the CPU, but how the motherboard is wired.
You cannot have two memory channels with one set of signals.

It like having an extension cord with two outputs and expecting two totally different voltages from each of the outputs (which come from the same input).
 

Jan Olšan

Senior member
Jan 12, 2017
273
276
136
I don't think what was suggested was to drive two channels through one signal, I know the way mobos are made make that impossible.

The question is - why is it or would it be impossible to use single channel mode from all four dies? Those "sets or signals" are four as far as I can see and they are independent, right? So the proposal is to use first set of signals do drive one channel (two dimms) from a memory controller in die 1, second set of signals with its two dimms from one channel of memory controller in die 2, third set of signals (two dimms) from one channel of memory controller in die 3 and the finally the fourth set of signals with its two dimms from one channel of memory controller in die 4.
Why wouldn't this be possible? You still run just one channel through each instance of that "one set of signals". For this to happen, all the rewiring could be implemented in the substrate, so that compatibility could be maintained, from high-level pov?

Or are there some things that prevent this too?
 
  • Like
Reactions: beginner99

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I don't think what was suggested was to drive two channels through one signal, I know the way mobos are made make that impossible.

The question is - why is it or would it be impossible to use single channel mode from all four dies? Those "sets or signals" are four as far as I can see and they are independent, right? So the proposal is to use first set of signals do drive one channel (two dimms) from a memory controller in die 1, second set of signals with its two dimms from one channel of memory controller in die 2, third set of signals (two dimms) from one channel of memory controller in die 3 and the finally the fourth set of signals with its two dimms from one channel of memory controller in die 4.
Why wouldn't this be possible? You still run just one channel through each instance of that "one set of signals". For this to happen, all the rewiring could be implemented in the substrate, so that compatibility could be maintained, from high-level pov?

Or are there some things that prevent this too?

If the package would be modified then 1 CH per controller should be possible, however I'd imagine it would result in pretty hideous bandwidth and latency (potentially even worse than the leech die configuration).
25.6GB/s memory bandwidth per die @ 3200MHz, unless GMI is crossed once for dual channel or twice for quad channel.
 

maddie

Diamond Member
Jul 18, 2010
4,722
4,627
136
This one memory channel per die performance should be very easy to test on present Threadripper. Anyone willing?
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
...and remember when you questioned my post where the 32 core/64 thread chip would be 3.0/3.4 GHz...as posted on Anandtech's own overview?

I wasn't questioning that you were right, I was asking whether or not AMD said that.

Secondly, even if I was questioning you it would have been a valid question seeing you stated unequivocally and without doubt that Threadripper 2 wouldn't have more than 16 cores for technical reasons. You were wrong once and so statistically it seemed likely to me you could be wrong again.

Thirdly, you could have replied directly to me instead of to someone who replied to me and gave me the info I was looking for.

My posts were written before AMD themselves announced a 32 core version.

If I quoted WCCFTech and they quoted the launch announcement then it's impossible for you to have commented on my quote before the announcement. That's just simple logic... unless you have a time machine at your disposal.

250 watt TDP means that indeed several motherboards, including certain ASUS boards, would not be compatible...hence my post. a 32 core Threadripper will likely require dual 8 pin CPU plugs unless they stick with the 3.0/3.4 design...and even so...hence my original statement stands.

Your original statements were:

"Threadripper 2 will feature 16 cores." …. which... no...
" Any increase in cores would have to be done in the CCX due to socket compatibility."... again... no....

Current motherboards not supplying enough juice doesn't change the above.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
If the package would be modified then 1 CH per controller should be possible, however I'd imagine it would result in pretty hideous bandwidth and latency (potentially even worse than the leech die configuration).
25.6GB/s memory bandwidth per die @ 3200MHz, unless GMI is crossed once for dual channel or twice for quad channel.

Forgive for this poorly phrased layman's question; if the CPUs will use only two memory controllers from two dies, yet consist of four dies, will the resulting 'problem' be mostly bandwidth issues or latency?

Is it that the path is longer / more work needs to be done to get data - or that more data now has to feed more dies within the same path?
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Forgive for this poorly phrased layman's question; if the CPUs will use only two memory controllers from two dies, yet consist of four dies, will the resulting 'problem' be mostly bandwidth issues or latency?

Is it that the path is longer / more work needs to be done to get data - or that more data now has to feed more dies within the same path?

Can be both. But my guess is that the jobs that feel choked by bandwidth will be rare and limited to tools expected to run on servers. Whereas I think a lot more apps could feel the negative affect of the latency on memory calls going through a different die.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
(continuing the discussion of a possible "4 DRAM channels configured as one channel per die")

When you consider the base, certified configuration for TR1, which had a rated "speed" of DDR4-2667, then doing a single channel at 3200 does seem like it could significantly restrict single thread local memory performance for heavily memory bound applications.
However, as the ship has sailed on the TR2 configuration, I'll focus my thoughts on TR3.

The rough idea out there is that TR3 will be largely based on EPYC 2, featuring 4 dies, each at 7nm. It is speculated that Epyc 2 could have as many as 64 cores with 4 X 16 core dies. In the process of shrinking everything, they will either expand the current CCXs or just "paste" in two additional CCX units. Either way, it seems logical that the L3 cache per die would double to 32MB (8 per CCX in an x4, or 16 per ccx in an x2 config). That's a lot of L3 cache to mitigate immediate DRAM channel demands. It is also possible that they could make a significant change in direction and use the CCX from Raven Ridge for compactness, configuring 4 X CCX with 4MB L3 cache, but then using the extra space on the die to create an L4 cache. Either way, there will likely be more cache per die in TR3 and EPYC 2. That additional cache should help with keeping demands on the DRAM channels under control. That being the case, a slight reduction in DRAM bandwidth per die may not be as apparent in system performance metrics.

If most of the above is true, when compared to the original certified TR1 configuration, would a TR3 that has a single enabled DDR4-3600 or even approaching 4000 spec channel per die be significantly hampered in performance? Local cores have a mountain of cache to help with memory demands and a pretty good amount of local DRAM bandwidth. If we make the assumption that the IF throughput will be increased in EPYC2/TR3, then access to remote DRAM channels will also be at or near full bandwidth with only initial transaction setup latencies to contend with. Given how spread out the access to remote DRAM would be, inter die data transfers wouldn't be as disrupted as they would be in the current TR2 setup where remote die memory calls can heavily saturate the IF links between the die. Increased IF bandwidth can also serve to reduce the latency penalty for remote die memory calls. Looking at the number for EPYC 1 from the latency and bandwidth testing that Serve The Home did, you can see a relative drop in latency from going to 2667 DRAM from 2400 of around 11ns. That's 11ns for 266mhz. While the scaling is not linear, going from 2667 to 3200 is a jump of 533mhz. That should be good enough to shave another 18ns or so off of remote DRAM access latencies. The local die has a latency for memory calls (when configured as an EPYC core for what that's worth) of 81 ns at 2667 and remote die DRAM at ~135ns. Running at 3200, you'd expect that latency to be, again, roughly, 117ns. If the trend line continues, then at 3600, it should be vaguely 105ns and at 4000 the latency should be around 100 or less ns. Now, none of us have any idea is AMD is capable of getting the IF links between cores to run that fast on their MMC. However, if they can, then those are NOT bad memory latency numbers for remote node DRAM accesses. I suspect, however, that the addition of more cores in the 7nm die will incur some sort of latency penalty as the routing of system calls will have a bigger table to look through for each new transaction, so those numbers are likely optimistic.

The idea with having a local DRAM channel per die is to make sure that there is always an opportunity for a die to make a memory call with the lowest latency. At DDr-3600-4000 (again, I'm speculating on the certified DRAM speed for TR3 here, using AMD's demonstrated DDR-3200 on TR2 as a baseline), that's a latency that can be as low as the low 60-high 50 ns range as compared to a remote latency that's easily twice that. I think that, for a task that can be heavily threaded and can be managed to keep it's working memory local to its own NUMA node, this can make a noticeable difference, especially in cases where there are lots of small transactions instead of streaming large blocks in bulk. In cases where one die needs maximum bandwidth, having a single channel that's at the end of each inter-die IF link means that it should be better able to sustain maximum bandwidth across those links as that data transfer should not saturate the link (where as it can now).

Remember, this is not an EPYC processor where we're going all out for maximum performance in all cases. TR is in the middle of the stack. Having reduced bandwidth is OK. It's about having a LOT of cores to throw at a problem, but without some of the unneeded server features, validation, etc that can make the platform too expensive. I just think that this works out better for general use cases.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Based on the quick synthetic tests the best option would be having 1CH per die, which of course would require active IMCs on each of the dies and totally reworked SP3r2 package (memory pad arrangement wise, which is very unlikely to happen).

1950X at fixed 3.4GHz frequency, 2933MHz MEMCLK CL14-14-14-1T.

3RA

2CPD (NUMA) = 85961MB/s (Read), 86643MB/s (Write), 81097MB/s (Copy), 78.33ns
1CPD (NUMA) = 44458MB/s (Read), 43449MB/s (Write), 40789MB/s (Copy), 78.80ns
2+0 CPD (LEECH) = 34495MB/s (Read), 37059MB/s (Write), 34823MB/s (Copy), 127.00ns
 

maddie

Diamond Member
Jul 18, 2010
4,722
4,627
136
Based on the quick synthetic tests the best option would be having 1CH per die, which of course would require active IMCs on each of the dies and totally reworked SP3r2 package (memory pad arrangement wise, which is very unlikely to happen).

1950X at fixed 3.4GHz frequency, 2933MHz MEMCLK CL14-14-14-1T.

3RA

2CPD (NUMA) = 85961MB/s (Read), 86643MB/s (Write), 81097MB/s (Copy), 78.33ns
1CPD (NUMA) = 44458MB/s (Read), 43449MB/s (Write), 40789MB/s (Copy), 78.80ns
2+0 CPD (LEECH) = 34495MB/s (Read), 37059MB/s (Write), 34823MB/s (Copy), 127.00ns
Good work. Wonder how this translates into real-world applications empirical data.

Do you really think AMD would allow the 50% + latency increase plus lower memory speeds on their premium client product when a better alternative exists?
 
  • Like
Reactions: CatMerc

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Good work. Wonder how this translates into real-world applications empirical data.

Do you really think AMD would allow the 50% + latency increase plus lower memory speeds on their premium client product when a better alternative exists?

To be perfectly honest, I do :(
Or at least it wouldn't surprise me one bit.

AMD still operates on a shoe string budget and they're juicing the existing designs all of their worth, as they should. However, time to time they go too far with doing so, in my opinion.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
If the package would be modified then 1 CH per controller should be possible, however I'd imagine it would result in pretty hideous bandwidth and latency (potentially even worse than the leech die configuration).
25.6GB/s memory bandwidth per die @ 3200MHz, unless GMI is crossed once for dual channel or twice for quad channel.
Right, in the leech method at least 16 cores get normal full access, so if only a part of your software is latency sensitive, a smart algorithm could basically treat the leech dies as sort of co processors to the connected ones. Maybe the IF has some capability for handling that that will be enabled in TR2 but not in TR1?

But in a 1DPC method, you get consistent performance on all dies, but all dies will get consistently AWFUL performance.
 
Last edited:

CatMerc

Golden Member
Jul 16, 2016
1,114
1,149
136
Based on the quick synthetic tests the best option would be having 1CH per die, which of course would require active IMCs on each of the dies and totally reworked SP3r2 package (memory pad arrangement wise, which is very unlikely to happen).

1950X at fixed 3.4GHz frequency, 2933MHz MEMCLK CL14-14-14-1T.

3RA

2CPD (NUMA) = 85961MB/s (Read), 86643MB/s (Write), 81097MB/s (Copy), 78.33ns
1CPD (NUMA) = 44458MB/s (Read), 43449MB/s (Write), 40789MB/s (Copy), 78.80ns
2+0 CPD (LEECH) = 34495MB/s (Read), 37059MB/s (Write), 34823MB/s (Copy), 127.00ns
Any chance for testing in some real workload? I'm curious how it's handled there.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,710
3,554
136
Any chance for testing in some real workload? I'm curious how it's handled there.
The Stilt's benchmarking suite has a couple of CFD workloads some of which I image would perform abysmally in either of these two configurations. As a side note, for Zen 2 it is imperative that AMD adds support for full >= 256-bit vector SIMD ops. Either of these memory configurations would severely limit the potential of future TR3 based on Zen 2 if it indeed supports such operations.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Any chance for testing in some real workload? I'm curious how it's handled there.

I cannot easily test anything CPU intensive as I'm lacking a proper cooler. Currently I have a big heatsink WT'd on the CPU.
It will be probably easier for some of the guys running a proper TR system here at AT to pull the DIMMs for the other die and run few tests.
 
  • Like
Reactions: Drazick and CatMerc

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Actually, given the topic of discussion, I'd like some time with a 32 core EPYC to test some of this. Obviously it wouldn't be a perfect analog, but it would give us a general idea.

I do feel that having fast memory will be absolutely imperative for decent performance in a leech die configuration. Scaling should be heavily memory frequency dependent in any test case that overflows cache even occasionally.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
Actually, given the topic of discussion, I'd like some time with a 32 core EPYC to test some of this. Obviously it wouldn't be a perfect analog, but it would give us a general idea.

I do feel that having fast memory will be absolutely imperative for decent performance in a leech die configuration. Scaling should be heavily memory frequency dependent in any test case that overflows cache even occasionally.
When I get mine, I am ordering the fastest memory the motherboard supports in OC mode. At least 3600, but maybe up to 4266.
 
  • Like
Reactions: Drazick

Sable

Golden Member
Jan 7, 2006
1,127
97
91
When I get mine, I am ordering the fastest memory the motherboard supports in OC mode. At least 3600, but maybe up to 4266.
"People" say that @VirtualLarry spends an inordinate amount of funds on crazy PC stuff. But you sir are the polar opposite on the other end of the cost spectrum!!! (And I love it)

On a side note, one of my colleagues is putting together a beast of a multicore watercooled monstrosity at the moment and he's going HEDT but he's going Intel. For NO other reason than that's what he knows. In the days of Athlon he would have gone P4. I actually tried to extol the virtues of the threadripper but you just can't get through.

Basically, I hope these design wins (because that really IS what this is) work out to more IDENTITY for AMD.
:/ If they could just fix up their graphics cards section they could really DO the combined "GO AMD BOTH WAYS". (no comments on that last bit please, totally off topic)
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
Neither are finalized. Both turbo numbers were listed as WIP. And were listed as ALL CORE turbo btw.

Yeah, I think I read one source that said the given numbers were for the engineering samples. I'm guessing that they'll get a respectable bump by the time they launch.

A 2700 can push 3.2/4.1 base/boost for 8C16T in a 65 W power envelope. I know that it's not just a matter of glueing 4 of those together, but I think it gives us a reasonable idea of what we can expect. Assuming that these are a better bin, I think that 3.4/4.0 base/boost is around where it should land.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
"People" say that @VirtualLarry spends an inordinate amount of funds on crazy PC stuff. But you sir are the polar opposite on the other end of the cost spectrum!!! (And I love it)

On a side note, one of my colleagues is putting together a beast of a multicore watercooled monstrosity at the moment and he's going HEDT but he's going Intel. For NO other reason than that's what he knows. In the days of Athlon he would have gone P4. I actually tried to extol the virtues of the threadripper but you just can't get through.

Basically, I hope these design wins (because that really IS what this is) work out to more IDENTITY for AMD.
:/ If they could just fix up their graphics cards section they could really DO the combined "GO AMD BOTH WAYS". (no comments on that last bit please, totally off topic)
BTW, all of this hardware runs 24/7 to try and cure cancer in DC world. I HAVE cancer, and in the next couple of months I will loose my bladder to it, I just found out today,

So all of my expenditures have a real reason for me spending insane amounts on hardware. Feel free to contribute here: https://foldingathome.org/

Sorry for the OT,.