(Speaking theoretically, not out of own experience...)
I think the underlying problem which you want to solve is that a GPU feeder process should allocate memory local to that CPU to which the GPU is attached. That way, DMA to/from the GPU stays local to this CPU. Otherwise, such DMA would involve both CPUs, and the QPI link between them.
(By the way, some server mainboard makers offer special "single PCIe root" boards for multi GPU computing applications in which the GPUs need to communicate with each other. I suppose these are simply boards with PCIe switches on them. However, GPU--GPU communication is not required in any Distributed Computing project, as far as I know.)
So you'd like to configure processor affinity of GPU feeder processes. But I am not aware of direct support for processor affinity in boinc-client (and boincmgr, boinccmd, boinctasks etc.), which means you need an external tool.
In order to control processor affinity on Windows, I have several times read people recommending Process Lasso,
I haven't tried it myself yet, and haven't researched its precise capabilities. Notably I wonder whether it is possible at all to have Process Lasso detect without your manual intervention which GPU feeder processes should run on CPU0, and which one should run on CPU1. Maybe this is possible if you have separate boinc-client instances for this. Process Lasso then needs to be instructed that all new processes launched from one client shall be bound to CPU0, and all new processes launched from the other client go to CPU1.
If multiple client instances are indeed an element to solve processor affinity of GPU feeder processes, then I would go a step further and have those two client instances for GPU projects and a third client instance for CPU projects. This would have at least one benefit: You would no longer have to write app_config.xml files for each and every GPU application in which you set
<cpu_usage>0.001</cpu_usage>. Instead, just restrict boinc-client instance #3 to as few CPUs as desired (e.g. all minus four). Instances #1 and #2 don't need a restriction of how many CPUs they can use, as long as you let these instances run only GPU projects.
However, you still need to tell instance #1 to use only GPU0...GPU2, and instance #2 to use only GPU3 (or however these are numbered). And here comes a catch: There is support in boinc-client to configure this, and it works most of the time, but not all the time. In the working directory of client instance #1, add to cc_config.xml something like this:
Code:
<cc_config>
<options>
<ignore_nvidia_dev>3</ignore_nvidia_dev>
</options>
</cc_config>
And in the working directory of instance #2:
Code:
<cc_config>
<options>
<ignore_nvidia_dev>0</ignore_nvidia_dev>
<ignore_nvidia_dev>1</ignore_nvidia_dev>
<ignore_nvidia_dev>2</ignore_nvidia_dev>
</options>
</cc_config>
More info about this:
https://boinc.berkeley.edu/wiki/Client_configuration
But as I said, it works mostly, but not always. Some GPU applications ignore it. One application of a project which I don't remember right now is hardwired to use GPU0 only. And Moo!Wrapper is hardwired to use all GPUs at once, fed by a single process. I think the only possibility to run such applications as intended is on respectively simple host hardware, or maybe within virtual machines with dedicated GPU pass-through (if this is possible at all for GPU computing; again I speak entirely theoretical, not from experience).
Some more thoughts about CPU affinity: Besides the problem of binding a process to the desired CPU socket, you can use CPU affinity also to avoid negative effects of Hyperthreading when that becomes a problem, without having to switch HT off in the BIOS. E.g. if you run one of those rare CPU projects with negative scaling with HT, such as most PrimeGrid LLR applications, you can first tell boinc-client to run only half as many application threads as you have logical cores, and then use Process Lasso to bind the application threads only to one out of two logical cores which reside on the same physical core, for each CPU. From what I read, I hope this can be automated in Process Lasso.
Similarly, perhaps it can be beneficial to GPU throughput if the GPU feeder process has an entire physical core for itself, IOW if Process Lasso is used to prevent other processes from running on the logical core which is paired with the GPU feeder's logical core. Whether or not this is beneficial surely depends on the actual mix of CPU applications and GPU applications.
Oh, and another thought about Process Lasso: Another feature of it is to change scheduling priority of processes. By default, boinc-client tries to run CPU applications at "lowest" priority and GPU applications at "below normal" priority. This can be modified in cc_config.xml with
<process_priority> for CPU workers and
<process_priority_special> for GPU feeders (see
https://boinc.berkeley.edu/wiki/Client_configuration). But again, some applications ignore this, notably the various wrapper based applications. It seems to me that Process Lasso could be used to force the desired scheduling priority onto such uncooperative wrappers too.