Question Will there be an 16 core TRX40 Threadripper?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
I wanted an 4 mem-channel 16 core but it seems the leaks only show 24 cores and up.
16 core I can ALSO use for games but 24 cores, I don't think that will work well?
Buying to disable 8 cores isn't really acceptable.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,480
14,434
136
Everybody keeps ignoring the below:

You want 16 core and no IO/PCIE4 and lanes ? get a 3950x $750
Need more PCIE lanes or quad channel memory or something the threadripper platform offers, get a 1950x $400 (used)
Or more horsepower than that, get a 2950x $600 (used)
Or even more get a 2970wx $800 (used, just passed on one that went for 670 today)
Or even more get a 2990wx $1200 (used, just passed one up for $1150 today)

Or if you need the ultimate in power, PCIE lanes and4.0, and the fastest cores go
24 cores for $1400
Or 32 cores for $2000

Used prices based on pretty close to the lowest "buy it now" price on ebay. I have bought many of mine from there.

So you can have almost any configuration of threadripper to meet your needs and budget, as the 2xxx series and the x399 motherboard will still be produced.

So what is the problem here ? Please somebody straighten me out.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
as the 2xxx series and the x399 motherboard will still be produced.

So what is the problem here ? Please somebody straighten me out.

They are? I thought those were going to be phased out, just like the first-gen Threadripper products were phased out.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,480
14,434
136
They are? I thought those were going to be phased out, just like the first-gen Threadripper products were phased out.
I read it somewhere in one of these threads.... Too late tonight to go back and re-read everything, but I am pretty sure thats true. At least for the next your or so, by which time TR40 or whatever will be old news.
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Problem is as core count goes up Intels clock speed advantage goes away and Cascade Lake -X is just SL-X++ which Zen 2 is already better then. End result is I think that they will be nearly a wash on price/perf.

With TR2 it was clear that TR2 was better, here with intel greatly lowering prices, it's not clear cut anymore. In fact the low-end Workstation is intel territory as AMD doesn't offer anything comparable there. Intel also has AVX512 which in some workloads like encoding can matter a lot so that the 18-core could easily reach the 24-core TR for a lower price.

Looking to get a workstation at work and pushing for a 3970x but if availability and budget becomes an issue, a Core i9-10980XE would be a good alternative even if most usage doesn't include avx512.

EDIT:

And also this.

If you use python and hence numpy heavily, the intel cpu will have a tremendous advantage due to avx512.
 
Last edited:
  • Like
Reactions: lightmanek

Kedas

Senior member
Dec 6, 2018
355
339
136
So you can have almost any configuration of threadripper to meet your needs and budget, as the 2xxx series and the x399 motherboard will still be produced.

So what is the problem here ? Please somebody straighten me out.
They are much less power efficient and slower (means louder) much lower FP and on top of that not really good in VR/games.
So the selection pool is only Zen2, anyway why would I buy such motherboard knowing it only fits the slower old CPUs, my budget isn't that small.
 

Kedas

Senior member
Dec 6, 2018
355
339
136
And also this.

If you use python and hence numpy heavily, the intel cpu will have a tremendous advantage due to avx512.
Yes I saw it a few weeks ago, AVX2 vs AVX512 is about twice the speed for certain numpy tasks, I assume in the coming years more software will be optimized for AVX512.
BUT on the other hand if you need to do something with big workload (like you have to wait) then you may be much faster (>10x) using CUDA on the GPU (Numba), making this AVX512 point not really important anymore.

P.S the Core i9-10980XE even struggles to beat the 12 core 3900X in Handbrake, and is slower in games. (so no intel option there)
 
Last edited:
  • Like
Reactions: lightmanek

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
BUT on the other hand if you need to do something with big workload (like you have to wait) then you may be much faster (>10x) using CUDA on the GPU (Numba), making this AVX512 point not really important anymore.

I disagree. In my link you can see a Ryzen 3900x on mkl is almost 10x slower than a xeon due to intel mkl defaulting to sse for AMD. This is for basic tasks which don't need a gpu at all. Even small tasks it matters if it's 2 minutes or 15 seconds.

In fact since I have the 3900x on windows (my previous link used ubuntu) I tried to recreate the results and they pretty much match what that guy got on ubuntu. However, getting numpy installed with openblas instead of mkl was a real pain in the ass on windows with anaconda, see took me an hour to get it working and that is just with numpy.

if I try to install scikit-learn on top it fails if using default channel and "downgrades" to mkl if using conda-forge meaning your back to the slow speed. or said otherwise for python/numpy on windows, better to buy a intel cpu...
 

Kedas

Senior member
Dec 6, 2018
355
339
136
I disagree. In my link you can see a Ryzen 3900x on mkl is almost 10x slower than a xeon due to intel mkl defaulting to sse for AMD. This is for basic tasks which don't need a gpu at all. Even small tasks it matters if it's 2 minutes or 15 seconds.

In fact since I have the 3900x on windows (my previous link used ubuntu) I tried to recreate the results and they pretty much match what that guy got on ubuntu. However, getting numpy installed with openblas instead of mkl was a real pain in the ass on windows with anaconda, see took me an hour to get it working and that is just with numpy.

if I try to install scikit-learn on top it fails if using default channel and "downgrades" to mkl if using conda-forge meaning your back to the slow speed. or said otherwise for python/numpy on windows, better to buy a intel cpu...
But that is some bug in anaconda that needs to be fixed, or may already have been fixed.
It's not that this is the intended way to work with openblas. by the time I have the hardware it will probably not be a problem anymore. So it's hard to take that into account for the hardware decision. In worst case I can still use pip.

But you are right that it's annoying having to set it up as not default.
if the mkl would also be optimized (read "not sabotaged") for AMD this wouldn't be a problem and the mkl performance would be the same as intel AVX2, hence only a factor 2 for AVX512, so you know who to blame for your trouble.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
But that is some bug in anaconda that needs to be fixed, or may already have been fixed.

It's not a bug, it's just how it works, on windows. It is probably better on linux but even on linux one needs to jump through hoops do not have the default mkl. These are the real uphill battles AMD is facing. It's not just making their own BLAS library but also making it easily available.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
The Ryzen 3000 will have the eco-mode feature (105W->65W, 65W->45W) not present in the new TR3, could be a great setting to make it more silent when doing heavy work. (it's not like I have a separate room to put the PC in)
280W TR3 could really use something similar like 280W->180W->105W
At release, all Epyc 7002 SKUs came with two (edit: or three) cTDPs supported by the firmware (though with a far smaller step between the two cTDPs than Ryzen's eco mode). Perhaps Threadripper 3000 firmware will support cTDP too.

I don't think there are to many 2 or 4 core dies out there. The supply of those chips would mean they are either fighting Epyc for the crippled dies, or they are crippling dies that could go into higher margin products.
AT this time there is no evidence for or against the assumption that a TR3 processor must be manufactured with exactly four CCDs.

the 2xxx series and the x399 motherboard will still be produced.
They are? I thought those were going to be phased out, just like the first-gen Threadripper products were phased out.
My understanding of the reporting so far was that they will remain available for some time. (This is not necessarily the same as still being manufactured.)

And also this.

If you use python and hence numpy heavily, the intel cpu will have a tremendous advantage due to avx512.
Yes I saw it a few weeks ago, AVX2 vs AVX512 is about twice the speed for certain numpy tasks, I assume in the coming years more software will be optimized for AVX512.
People need to stop perpetuating these misunderstandings about AVX512.
  • Only some of the problems, whose data can be organized for effective operation on 256 b vector width in the first place, can further be optimized to make even more effective use of 512 b vector width.
  • The special AVX512 instruction sets, whose support varies confusingly between processor SKUs, have rather special use cases.
  • Those AVX2-optimized applications which fall into neither of the above two cases can sometimes be ported to AVX512 nevertheless. In this case, they may see a notable increase in computing bandwidth and power efficiency on processor SKUs which feature a second AVX512 unit. (Only the use of AVX512 instructions makes this extra unit available to software.)
    Such software ports will work slower though than the original AVX2 version on AVX512-enabled processors which don't have a second AVX512 unit.
  • Large advantages in AVX2 or AVX512 enabled software can only be seen if the problem size fits into the processor caches. Otherwise, the computing throughput becomes RAM bandwidth constrained. In such and other cases, it might be worthwhile to look into a port onto GPUs.

________
Edit Nov 10: Epyc cTDP options corrected
 
Last edited:
  • Like
Reactions: moinmoin

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
People need to stop perpetuating these misunderstandings about AVX512.

It's not just avx512, intel mkl which is used by default if you use anaconda and numpy runs sse-code or even slower path on ryzen. It doesn't even use AVX2. this is what you get with ryzen if you follow the normal route according to tutorials.
 

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
It's not just avx512, intel mkl which is used by default if you use anaconda and numpy runs sse-code or even slower path on ryzen.
Intel MKL is not alone in this. Further, Zen2 tends to require different optimizations than Zen/Zen+ (with Zen2's optimum or near-optimum code path being more in line with that of recent Intel microarchitectures due to the reorganized floating point and vector execution units). This is completely orthogonal to what I said what people need to do about their understanding of AVX512.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
AT this time there is no evidence for or against the assumption that a TR3 processor must be manufactured with exactly four CCDs.

There is good reason to believe that they are limited to either 4 or 8 CCDs for several reasons. Until we see anything different then any configuration of cores will be based on that.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,480
14,434
136
Everybody keeps ignoring the below:

You want 16 core and no IO/PCIE4 and lanes ? get a 3950x $750
Need more PCIE lanes or quad channel memory or something the threadripper platform offers, get a 1950x $400 (used)
Or more horsepower than that, get a 2950x $600 (used)
Or even more get a 2970wx $800 (used, just passed on one that went for 670 today)
Or even more get a 2990wx $1200 (used, just passed one up for $1150 today)

Or if you need the ultimate in power, PCIE lanes and4.0, and the fastest cores go
24 cores for $1400
Or 32 cores for $2000

Used prices based on pretty close to the lowest "buy it now" price on ebay. I have bought many of mine from there.

So you can have almost any configuration of threadripper to meet your needs and budget, as the 2xxx series and the x399 motherboard will still be produced.

So what is the problem here ? Please somebody straighten me out.
Replying to my own post, for an update. I saw a 7551 EPYC ES for $300 on ebay, and could not resist. I got 2 of them and a motherboard and 128 gig of 2666 ECC ram for $2500. Just an example of what you can get for your money if you want a lot of performance. 16 channels of ram and 128 threads for $2500 !!!! Not they only run at 2.5 ghz, but with that many cores/threads/ram, who cares ! (not a gamer)

Soon to be an EPYC guy !!!!
 

Kedas

Senior member
Dec 6, 2018
355
339
136
There is good reason to believe that they are limited to either 4 or 8 CCDs for several reasons. Until we see anything different then any configuration of cores will be based on that.
Rome does have an 8 cores version.
https://www.amd.com/en/products/cpu/amd-epyc-7232p 120W

How do you do that with 4 dies? 1 core per CCX? then the cache doesn't add up to 32MB since you have 16MB L3 per CCX
32MB means 2 CCX you can do that with 1 or 2 CCDs

Their 12 core does have 64MB total L3 cache meaning 4 CCX, 3 active cores on 1 die, I don't think so. I think 2 CCDs here.

So, No, I don't think they need minimum 4 dies connected to the I/O die and I don't see any technical reason why this min. 4 would exist. (ignoring if it's a good idea to do it or not)

I think the main reason now is that 32 and 24 cores hence always 4 dies is the same 1 assembly/production line for TR3 now.
Also heat and mechanical stress would need to be tested again with only 2 dies and the production line will need changes or a second need to be set-up.

Anything Rome can do TR3 can do also, if they want to do it.
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,459
7,718
136
While I agree that AMD is leaving a certain gap between the Ryzens and TR 3000s (at least for the time being), that's how things are, and you have plenty of alternatives buying new or 2nd hand, from AMD or from Intel.
Of course all of the alternatives come with various compromises. Whether or not these compromises are tolerable is a case by case consideration.
  • Virtually nobody would consider engineering samples from an auction platform to provide a valid alternative.
  • All in all, while the 2nd hand market should be given more consideration generally, the value that can be had there is often not all that good compared with new kit.
  • People who use Windows may not be interested in NUMA machines.
  • People who run floating point heavy workloads may not be interested in Zen/Zen+.
  • People who have decked themselves out with 14 nm CPUs already may not be interested in further 14 nm CPUs.
Anything Rome can do TR3 can do also, if they want to do it.
And possibly more. At this point we don't have solid info how far AMD customized the I/O die and the PCB for TR3. Those who make guesses about possible and economically valid TR3 configuration should admit to themselves that these are just guesses, and wild guesses at that unless they have direct info from inside AMD.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
How do you do that with 4 dies? 1 core per CCX? then the cache doesn't add up to 32MB since you have 16MB L3 per CCX
32MB means 2 CCX you can do that with 1 or 2 CCDs
2+0, 2+0, 2+0 2+0

Their 12 core does have 64MB total L3 cache meaning 4 CCX, 3 active cores on 1 die, I don't think so. I think 2 CCDs here.
3+0 3+0 3+0 3+0

So, No, I don't think they need minimum 4 dies connected to the I/O die and I don't see any technical reason why this min. 4 would exist. (ignoring if it's a good idea to do it or not)

I think the main reason now is that 32 and 24 cores hence always 4 dies is the same 1 assembly/production line for TR3 now.
Also heat and mechanical stress would need to be tested again with only 2 dies and the production line will need changes or a second need to be set-up.

Anything Rome can do TR3 can do also, if they want to do it.

It's all 4 or 8 dies. 99.9% sure. These would give us a reason why Lower core versions of Threadripper could be available. And this was part of my reason that it was entirely possible that the few of these dies that have horrible configs available would allow AMD more variation on their much higher margin EPYC platforms and not to be wasted on a no holds barred version of Threadripper. To top it off Ryzen as a consumer CPU already has products in these markets. What it doesn't help them get is a 18c, 20c, or 22c CPU that they could price closer to the 10980xe.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
But the L3 cache doesn't add up, 4 active CCX is 64MB L3 cache and the spec says 32MB (meaning only 2 CCXs)
They have done half cache before. The 1400 had only 8MB of L2 even though it was 2x2 CCX.

This is a bare-bones. I need the IO CPU. So they take CPU's with significant problems with one of the CCX's. Possibly with issues in the L3 of one of the slivers. AMD has gone out of the way to try to things as uniform as possible. There have been some slight variations. There might not be a completely technical reason why they can't go unbalanced with the dies and have done so with first and second gen TR. But for 85%-90% of the entire lineup they have been very keen on keeping to it. On the server end they would be 100% against it (which is why I am so sure that those are 4 die chips), on TR the only reason to do it would be to sell a really expensive to manufacture chip considering the packaging and the IO die, for nearly the same as their consumer CPU's with little upside, its not like TR even really allows for a clock bonus since at its heart the 7nm looks like it lets the 2 CCD's hit their limit on the AM4 high core chips.

I mean think about it if AMD was running with uneven dies and what the competitor is selling. Where is the 56 core part that they can compare straight up with what Intel is selling and add another sku in the super high but reasonable price range? That would be a pretty golden option. But no its really easy to see that AMD is restricted by design decisions that preclude doing anything other than 4 or 8 dies.
 

Kedas

Senior member
Dec 6, 2018
355
339
136
L3 is cut in half?
8 core Rome has 32MB L3 cache (half of the 12 core Rome)

FYI Zen3 may not be 2*16MB on the die, (16MB for each CCX)
 

Kedas

Senior member
Dec 6, 2018
355
339
136
That's what I mean, half of the L3 is fused off. So the 7232P is presumably four dies with one CCX enabled per die, each with 2 cores each and half of the L3 enabled. The 7252 has the full L3.
I don't think they will sell an 8MB/CCX die in EPYC7002 or TR3 maybe in a very low end Ryzen.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
So you are saying it's easier to believe that they have broken their uniform design for EPYC then to do something they have done several times in disabling L3?