Looks like the challenge is for
Sierpinski/Riesel Base 5 LLR (SR5) workunits. Does this look like the right one?
It's the right one, but too many at once to run optimally. Each SR5-LLR instance leaves a footprint of
about 8 MBytes in the processor caches.
Core i7-12700KF has got 1.25 MB L2$ on each of the 8 P cores, 2.0 MB L2$ on the cluster of E cores, and 25 MB L3$ shared between all P cores and the E cluster. (I don't recall the L3$ policy; my guess is it's non-inclusive.) Now that's a rather complex topology with hard to predict performance characteristics. I suppose that cache misses become very frequent if there are 4 or more SR5-LLR instances running together. Which means that the processor's execution units will sit partially idle, waiting for read/write operations to main memory.
The solution, short version:
Try
"Multi-threading: Max # of threads for each task"=7 in the PrimeGrid web preferences.
Long version:
You can use either an app_config.xml, or, perhaps more convenient but not as in-depth, the PrimeGrid web preferences to control the SR5-LLR workload and behavior. The
PrimeGrid preferences webpage has got these two relevant options:
- You can specify that the program should use more than 1 program thread. After you changed this and after your boinc client downloaded a new task, the client will see that this task is going to occupy respectively many logical CPUs, and launch respectively fewer of such tasks. And of course the application will pick up this setting too and spin up respectively many program threads.
- Less importantly, you can also configure the server-enforced per-host limit of "tasks in progress". (It's the server's perspective of "in progress", i.e. for the time between a task was assigned by the server to the host, and until the host reports the result back to the server.) This option is somewhat misleadingly called "Max # of simultaneous PrimeGrid tasks" on the web page.
So, purely based on cache sizes, it seems plausible that it's OK to run 1, or 2, or 3 of those SR5-LLR tasks at once, while at 4 or more the host throughput will degrade. Next step would be to figure out how many program threads per task to configure.
- The general goal would be to use all cores.
- The more threads are used per task, the shorter the run time of an individual task will be (up to a point of diminishing returns, or higher up even negative scaling).
- On the other hand, the higher the thread count, the more processing time and portions of the power budget will be spent with synchronization overhead, which is detrimental to host throughput.
- LLR makes heavy use of AVX units. In case of hyperthreaded cores, both hyperthreads of a physical core would compete for access to one and the same AVX unit. Hence, Hyperthreading does not scale at all for LLR.
And then there is a specific complication with Alder-Lake S:
The performances of P cores and E cores are very different. But I suspect that multithreaded LLR works best if all program threads work with the same speed.
I don't have such a processor myself. But it seems that many workloads get about the same performance out of an E core as from one Hyperthread of a P core, provided that both Hyperthreads of the P core are fully loaded. So
maybe it would be prudent to configure the # of concurrently running SR5-LLR tasks and the # of program threads per task such that all, or almost all, Hyperthreads of P cores and all, or almost all, E cores get used.
Another promising, but more complicated ti implement configuration would be to run 2 quick tasks with 4 threads each, "pinning" these tasks to P cores (and don't use Hyperthreading on them), and in addition 1 slow task with 4 threads, "pinned" to the E core cluster.
And a further config to explore would be to leave the E cores unused (perhaps even deactivate them in the BIOS) and let just the P cores do the work. This would reduce the amount of available execution units, but perhaps this loss would be offset by all of the power budget being available to the P cores. Due to its heavy AVX usage, LLR scales quite reasonably with per-core power budget.
PS: Putting all this guesswork aside, the optimum config could be found empirically with systematic benchmarking of the LLR program outside of BOINC. But how to do that is a whole other story, for another day. I
wrote about that elsewhere a while ago.