Just as a follow-up, it took me a while to sort this out so that my old brain could understand it

-
It appears the programmer of an app has to make it "NUMA aware" for it to use all NUMA nodes (all cores on multi-processor boards).
For those few here that actually have multi-processor workstations and
actually use them to work -
Windows has a concept called,
"Processor Groups".
Each group has a limit of 64 logical cores.
So on systems with a single processor of 64 logical cores or less, there is only one group and 1 NUMA node.
But on say a 2696v3 dual processor board, there are 72 logical cores.
So there will be two groups of 36 logical cores each - each with the cores from each NUMA node.
36 in each group and node because that is what each processor has in this example.
(Bios programming does allow the option to further divide the processor into multiple NUMA nodes before Windows "sees" them, my Gigabyte board I used for dual 2696v3 Xeons allowed me to have each processor as a single NUMA node of 36 logical processors or each processor to be 2 NUMA nodes of 18 logical processors. Divided that way the 2 NUMA nodes were each 9 "real" cores and their 9 "hyper" cores, so 4 NUMA nodes total for the 2 processors, each consisting of 9 "real" cores and 9 "hyper cores")
If a dual (or more) processor board had say 72 logical cores in
each processor, then each processor would have 2 groups and 2 nodes (4 Processor Groups and 4 NUMA nodes total for a dualie) because remember, Windows Processor Groups have a limit of 64 logical cores per group.
(The same basic principle applies to a single processor board with more than 64 logical processors)
But by default, programs are only able to use 1 group (in the above example 1 group of 36 logical cores), because that
is the default in Windows.
In order to use multiple groups, the program needs to be specially programmed for it.
(MS does explain how to do that in the link above, and while I'm not a programmer, it does seem rather straight forward and makes one wonder why it isn't routinely done)
As a workaround - Process Lasso Pro can "force" apps to use multiple groups with it's Process Group Extender option.
I don't believe it is as efficient as if the app would be if properly programmed to run on multiple NUMA nodes/Processor Groups,
but I have found it does reduce my encode and transcode times roughly 35%-40% (depending on the original container, of course)
Hopefully that makes things a little clearer for anyone stumbling on this thread
