Well for instance I was wondering if it might not be better on a 6C/12T to do 1x11 instead of 2x6.
It's unfortunate that I didn't find time to run my usual tests before this challenge.
Looking at my older tests, llrCUL's CPU times (95 h) are between llrPSP (153 h) and llrGCW (67 h)/ llrESP (59 h)/ llr321 (48 h). Longer CPU times mean larger memory footprint of the working set = larger cache demands. (I am referring to PrimeGrid's global average CPU times which are listed at the "Edit PrimeGrid preferences" web page.)
On E5-2690 v4, which has got 35 MB shared inclusive L3 cache, I ran
- the larger llrPSP in April 2017 with 2 tasks per socket (alas, didn't try 3 tasks per socket),
- the smaller llrGCW in August 2017 with 3.5 tasks per socket,
- the even smaller llrESP in April 2018 with 3 tasks per socket. Here I also tested 1, 2, 3.5, 4, and 5 tasks per socket which were all inferior to 3 tasks per socket.
(As an aside, early in 2017 I never used HT; only later in 2017 I learned that HT can be beneficial.)
Since I don't know what's best for llrCUL, I simply run 3 per socket now. Some of the tasks finish with longer run times than others, which makes me suspect that I am off the actual optimum. I do intend to find a time slot to test llrWOO before the next challenge. llrWOO is at 117 h average CPU time, i.e. sits between llrCUL and llrPSP.
Back to the 6C/12T which you are referring to: If it is a Coffee Lake i7, then it has 12 MB inclusive L3 cache, i.e. about 1/3rd of the E5 2690 v4. So if we knew the optimum number of concurrent tasks for the E5, and assume that L3 cache size is the most influential factor for this number between architectures as similar as BDW-EP and CFL, we could make a guess for the optimum number on CFL.
But we do know that the presumably less cache demanding llrESP was best with 3/socket on the E5. This does at least indicate that 1/socket may be better than 2/socket with the probably more cache demanding llrCUL on a processor with 1/3rd the cache.