I decided to mess around with LM Studio on my 64C/128T Zen 2 server, just to see if having this many cores would help with performance. Also have a 1080 Ti installed to keep it company. Here's what I found:
LM Studio has a hard limit of 32 physical cores (it ignores the virtual threads). If you have 16C/32T, you only get to use 16 threads for calculations.
Decided to try a "preview" BF16 model, 65GB in size. FUSEO1-Deepseek-Qwen_33B_something. BF16 coz it's supposed to be accurate. The GPU did not like that format. Slowed down to a crawl (several seconds for just one token). Had to take it out of the equation. CPU cores didn't mind it as much but still the speed was hardly 2 tokens per second. It "thought" and showed its reasoning on how it arrived at the final solution, after
57 minutes 6 seconds. Don't have a compiler installed so don't know how good the solution is. Need to test this model with a better GPU to see if this can be sped up.
Went scurrying back to Deepseek V2 Coder Lite at "only" 16GB size and 8-bit quantization. Problem with this model is that it throws a lot of assumptions in the user's face even when you give it documentation to work with. This time, the GPU seemed a lot happier at around 11 tokens per second but the solution was crap. Even though the CPU was supposed to share the workload, it didn't and remained mostly idle while the GPU thought at +95% utilization.
Tried one more 33B Q8 model, Everyone_coder_V2, 35GB in size. Supposed to be a mixture of three different models that work collaboratively. GPU again couldn't handle it. Turned to CPU and it's working at least three times faster but still at something like 1 token per second speed. It tells me that what I'm asking it to do is impossible since the Windows API and Standard C++ library does not provide any function to get the frequencies of all CPU cores at once. Needs a better and more positive attitude for it to be useful
Gonna need to check AnythingLLM and see if it can use the CPU better than LM Studio.