Question CPUs for shared memory parallel computing

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hermetian

Member
Sep 1, 2024
64
45
46
frostconcepts.org
If it's easy to do a test run for one of your intensive programs, I could do it on my 128 thread Zen 2 Epyc ES and see if your program is able to take advantage of all those threads to run faster. 32 threads of 9950X would easily beat my old gen 128 threads so that could make your decision to get the 9950X easy.
Thank you for your kind offer. It would be complicated, plus you'd need a license for Mathematica 13 or later.
 
  • Like
Reactions: igor_kavinski
Jul 27, 2020
19,482
13,357
146
Thank you for your kind offer. It would be complicated, plus you'd need a license for Mathematica 13 or later.
I'm willing to try if your program will work with a trial version of Mathematica. Can''t be so complicated that I can't follow some steps, can it? :)
 

Gideon

Golden Member
Nov 27, 2007
1,765
4,108
136
Given the memory quotes I received today, I'm going to upgrade to 128GB of DDR4-2933 (what the processor is designed for). This might shake up my benchmarks a bit. At the very least I'll be able to increase the number of compute kernels. My license currently allows up to 14.
This indeed seems to be the best course of action right now. With some CPUs there are issues running the fastest RAM bins (2933 MT/s JEDEC in your case) at max ram capacity (128GB) but judging by what's available online 10900K should do that just fine.

There seems to be little reason to use XMP with 128GB anyway as people seem to have mixed experience with 3200 - 3600 MT/s being stable 24/7

Regarding new CPUs Intel has hybrid cores that will limit their usefulness and AMD has 2x physically spearate CCD dies above 8 cores as well (as explained above). Strix Halo could be an option but it's availability is a huge unknown on desktop (as it's a soldered BGA CPU foremost for laptops). It will probably be available in some mini-PCs early next year, but there are no guarantees.

Another interesting (though ultra expensive) choice is to go with Apple hardware (which to my knowledge supports Mathemathica well, though you need to double check if it works for all of your extensions).

Mac Studios have fast CPUs with unified on-die ram that has absurd amount of bandwidth for CPU standards (400GB/s for the base model 800GB/s for the M2 Ultra version) but even the cheapest 12-core variant with 64GB of RAM costs $2600. Upgrading to M2 Ultra is 4000$ and adding 192GB of ram pushes it to $5599. But these are also rumored to be upgraded to M4 versions soon. But this only matter if MacOS even works for you.

Going back to x86 land, Intel will upgrade their lineup soon, but it will still be with hybrid cores. The only other unified memory option, the spiritual successor to Strix Halo (AMD Medusa lineup) is still at least 1.5 years away.

So all in all, it might indeed be the best option to just upgrade the RAM and wait
 
Jul 27, 2020
19,482
13,357
146
Given the memory quotes I received today, I'm going to upgrade to 128GB of DDR4-2933 (what the processor is designed for).
Would you like to share the memory kit model you received the quote for? If it's some typical business, they would give you bog standard RAM kit with high CAS latency (possibly Samsung or Crucial or Corsair) for the same or maybe higher price than something much better from Kingston or G.Skill. You can easily, for example, use DDR4-3600 128GB kit in your system with lower latencies at DDR4-2933. Lower latencies would help a lot if your application is reaching out to the memory subsystem a lot. Which mobo do you have?
 
Jul 27, 2020
19,482
13,357
146
Here's an idea: How about write something quick and dirty in Python to benchmark a use case on your PC and then I can try that too? In fact, if it's a particularly intensive workload, it would serve as a cool benchmark for the community to use too. The benchmark result from me and possibly other members here will inform you what kind of performance improvement you can expect from different system configurations.
 

Nothingness

Diamond Member
Jul 3, 2013
3,017
1,945
136
This is fascinating.

I know nothing about that domain, so sorry in advance if I'm completely off topic and irrelevant.

One of my areas of study are the DNA chromosomes of perennial plants. My idea of a medium size chromosome is 20 to 30 MB -- literally a string of that length composed of the letters A,T,G,C. Per chromosome and per "marker", I need to perform 70 to 210 regular expression searches for substrings of size 16 to 512 letters, each of which will produce many coordinates of matches, all of which are then annotated and written to disk per search for post-processing. All of this is regular expression dependent and thus happening asynchronously.
Is this dominating your running time or is the post-processing done to matched strings much more expensive? Could this matching be done with hand-written code outside of Mathematica? How complex are the regular expressions?

For this application on a 30 MB chromosome and a single "marker", I clock the shortest total run time by paralyzing on the searches across 9 "kernels" on my 10 core CPU. For example, a total of 72 searches would be calculated by 8 iterations per kernel (a Mathematica dispatched process). The memory utilization for the first few minutes is about 80% of my 16 GB (according to Task Manager, incl. the OS) and then drops to about 70% for an hour or so, and then drops further as each kernel finishes its task. I estimate the OS and Mathematica memory overhead at about 17%.
You make it sound like there's no memory capacity issue. I'm sure I missed something, but if that's correct then you should first look for faster* memory rather than more capacity (unless you increase the number of processes, as you hint later in your post; but then I wonder if going beyond 9 processes and start using hyper threaded processes would be a win).

*Faster here might mean higher bandwidth or lower latency, it depends on the specific bottlenecks of your Mathematica code.
 
Jul 27, 2020
19,482
13,357
146
He has a 10-core CPU and limits the number of processes to 9. So if he goes above, and provided Mathematica is able to distinguish HT from hard core processes, he'll start using HT above that limit.
Not sure how Mathematica is enforcing the license limit. If it detects above 10 cores, does it refuse to run permanently until a payment is made for the additional core licenses? Or is it just something depending on the user's honor code? I've heard an actual enterprise developer telling me that some of their customers use Oracle DB WITHOUT a license so we shouldn't need to worry about the license cost. I recommended to my COO to blacklist the developer from any further time wasting meetings.
 

Hermetian

Member
Sep 1, 2024
64
45
46
frostconcepts.org
I've been using Mathematica in my student years and I can't say good things performance wise
Efficiency in Mathematica is dependent on programming style. Code written with control structures (e.g. C language style) will execute slower than code written in functional (e.g. Lisp) style. I've experienced speedups from hours to seconds by switching to a functional model.
 

Nothingness

Diamond Member
Jul 3, 2013
3,017
1,945
136
Not sure how Mathematica is enforcing the license limit. If it detects above 10 cores, does it refuse to run permanently until a payment is made for the additional core licenses? Or is it just something depending on the user's honor code? I've heard an actual enterprise developer telling me that some of their customers use Oracle DB WITHOUT a license so we shouldn't need to worry about the license cost. I recommended to my COO to blacklist the developer from any further time wasting meetings.
I guess Mathematica just allows spawning 14 threads (I think it's the number @Hermetian quoted), that's unrelated to the exact number of cores the user has.

That's very easy to do from software, no need to have some complex license management: the max number of threads is likely read when Mathematica checks the license at start. Of course there are always ways to get around such license limitations, no protection is 100% secure.
 

Hermetian

Member
Sep 1, 2024
64
45
46
frostconcepts.org
Is this dominating your running time or is the post-processing done to matched strings much more expensive? Could this matching be done with hand-written code outside of Mathematica? How complex are the regular expressions?

You make it sound like there's no memory capacity issue. I'm sure I missed something, but if that's correct then you should first look for faster* memory rather than more capacity

*Faster here might mean higher bandwidth or lower latency, it depends on the specific bottlenecks of your Mathematica code.
The post-processing run time is less than 20 minutes, sometimes less than 3. It involves the assimilation of hundreds of files, vetting the contents, and then writing a single output file. I have a high-end SSD, plus I'm using a binary file format so the I/O portion is under a second.

Of course one could write this code in a number of other languages for compilation to a binary executable. It would require construction or acquisition of compiler-compatible libraries of set-theoretic functions; i.e. a lot of software management overhead.

The regular expressions are not complex. In fact I initially tried using a single, complex regex per marker but rapidly found out that the recursion limit will be exceeded on chromosome size strings. Consequently I replaced the complex regex with many individual cases. Soon after that I realized that parallelization would be necessary.

The on-board RAM in my system is the most efficient available for the processor.
 
Jul 27, 2020
19,482
13,357
146
The regular expressions are not complex. In fact I initially tried using a single, complex regex per marker but rapidly found out that the recursion limit will be exceeded on chromosome size strings. Consequently I replaced the complex regex with many individual cases. Soon after that I realized that parallelization would be necessary.
If you could provide a simple example of one whole calculation, I think I could whip up a quick C program because this seemingly looks like text processing more than anything else, unless some indexed lookups are involved in there too?
 

Hermetian

Member
Sep 1, 2024
64
45
46
frostconcepts.org
Man, seems like the kind of workload a 9950X3D would be great at, if AMD releases it.
I appreciate your answer, but I also believe several members here have misconstrued the computation. A "thread eating" machine will be helpful to a point. The real bottleneck in a production run of this DNA search code is the sheer number of data-dependent iterations. Hyperthreading can exacerbate the problem by generating too many memory references -- thus my concern about memory channels and bandwidth, along with the comments of @Nothingness regarding cache sizes.