Need BOINC Manager help on 2P system

TennesseeTony · Jul 23, 2017

Thunder-Strike, as a refresher, is a Dual Processor system with quad 1080's, running WinTen.

I'll have to review the documentation to be certain, but I believe the way I have the GPUs installed, three of them communicate (are PCIe linked) with the primary CPU, and the the last one communicates/links with the secondary CPU.

I think the primary CPU is getting over-provisioned, thus not having enough spare threads to properly feed the GPUs linked to it. I was wondering if anyone could tell how to set the BOINC Manager to use only one CPU (affinity?). Then I can have a 2nd BOINC Manager configured to use only the secondary processor.

That way I can ensure each CPU is reserving sufficient threads to run GPU projects at maximum efficiency/productivity.

I almost posted, then had a thought....I guess I will need to also know how to tell the BM to use certain GPU slots as well.

StefanR5R · Jul 24, 2017

(Speaking theoretically, not out of own experience...)

I think the underlying problem which you want to solve is that a GPU feeder process should allocate memory local to that CPU to which the GPU is attached. That way, DMA to/from the GPU stays local to this CPU. Otherwise, such DMA would involve both CPUs, and the QPI link between them.

(By the way, some server mainboard makers offer special "single PCIe root" boards for multi GPU computing applications in which the GPUs need to communicate with each other. I suppose these are simply boards with PCIe switches on them. However, GPU--GPU communication is not required in any Distributed Computing project, as far as I know.)

So you'd like to configure processor affinity of GPU feeder processes. But I am not aware of direct support for processor affinity in boinc-client (and boincmgr, boinccmd, boinctasks etc.), which means you need an external tool.

In order to control processor affinity on Windows, I have several times read people recommending Process Lasso,

https://en.wikipedia.org/wiki/Process_Lasso
https://bitsum.com/

I haven't tried it myself yet, and haven't researched its precise capabilities. Notably I wonder whether it is possible at all to have Process Lasso detect without your manual intervention which GPU feeder processes should run on CPU0, and which one should run on CPU1. Maybe this is possible if you have separate boinc-client instances for this. Process Lasso then needs to be instructed that all new processes launched from one client shall be bound to CPU0, and all new processes launched from the other client go to CPU1.

If multiple client instances are indeed an element to solve processor affinity of GPU feeder processes, then I would go a step further and have those two client instances for GPU projects and a third client instance for CPU projects. This would have at least one benefit: You would no longer have to write app_config.xml files for each and every GPU application in which you set <cpu_usage>0.001</cpu_usage>. Instead, just restrict boinc-client instance #3 to as few CPUs as desired (e.g. all minus four). Instances #1 and #2 don't need a restriction of how many CPUs they can use, as long as you let these instances run only GPU projects.

However, you still need to tell instance #1 to use only GPU0...GPU2, and instance #2 to use only GPU3 (or however these are numbered). And here comes a catch: There is support in boinc-client to configure this, and it works most of the time, but not all the time. In the working directory of client instance #1, add to cc_config.xml something like this:

Code:

<cc_config>
    <options>
        <ignore_nvidia_dev>3</ignore_nvidia_dev>
    </options>
</cc_config>

And in the working directory of instance #2:

Code:

<cc_config>
    <options>
        <ignore_nvidia_dev>0</ignore_nvidia_dev>
        <ignore_nvidia_dev>1</ignore_nvidia_dev>
        <ignore_nvidia_dev>2</ignore_nvidia_dev>
    </options>
</cc_config>

More info about this: https://boinc.berkeley.edu/wiki/Client_configuration

But as I said, it works mostly, but not always. Some GPU applications ignore it. One application of a project which I don't remember right now is hardwired to use GPU0 only. And Moo!Wrapper is hardwired to use all GPUs at once, fed by a single process. I think the only possibility to run such applications as intended is on respectively simple host hardware, or maybe within virtual machines with dedicated GPU pass-through (if this is possible at all for GPU computing; again I speak entirely theoretical, not from experience).

Some more thoughts about CPU affinity: Besides the problem of binding a process to the desired CPU socket, you can use CPU affinity also to avoid negative effects of Hyperthreading when that becomes a problem, without having to switch HT off in the BIOS. E.g. if you run one of those rare CPU projects with negative scaling with HT, such as most PrimeGrid LLR applications, you can first tell boinc-client to run only half as many application threads as you have logical cores, and then use Process Lasso to bind the application threads only to one out of two logical cores which reside on the same physical core, for each CPU. From what I read, I hope this can be automated in Process Lasso.

Similarly, perhaps it can be beneficial to GPU throughput if the GPU feeder process has an entire physical core for itself, IOW if Process Lasso is used to prevent other processes from running on the logical core which is paired with the GPU feeder's logical core. Whether or not this is beneficial surely depends on the actual mix of CPU applications and GPU applications.

Oh, and another thought about Process Lasso: Another feature of it is to change scheduling priority of processes. By default, boinc-client tries to run CPU applications at "lowest" priority and GPU applications at "below normal" priority. This can be modified in cc_config.xml with <process_priority> for CPU workers and <process_priority_special> for GPU feeders (see https://boinc.berkeley.edu/wiki/Client_configuration). But again, some applications ignore this, notably the various wrapper based applications. It seems to me that Process Lasso could be used to force the desired scheduling priority onto such uncooperative wrappers too.

Orange Kid · Jul 24, 2017

Ok. I am not at all versed in the workings of xml and no longer have any dual cpu machines, but in days of yor, I would comtrol affinity with simple command line switches. ie: -cpu0 and so on and use multiple clients.
Looking at the boinc configuration, there maybe command line switches available? So could you set four different clients to use cpu's and gpu's accordingly? ie: cpu0 gpu0 then next cpu0 gpu1 and so on. I'm going to guess that it isn't this easy.
Food for thought though?

StefanR5R · Jul 24, 2017

https://boinc.berkeley.edu/wiki/Client_configuration doesn't mention such command line switches for boinc-client itself, so I guess this could be command line parameters for specific applications which have dedicated support for this. This would go into C:\ProgramData\BOINC\projects\<project_url>\app_config.xml:

Code:

<app_config>
    <app_version>
        <app_name>application_name</app_name>
        <cmdline>application_specific_syntax</cmdline>
    </app_version>
</app_config>

Of course I may be missing something crucial outside the scope of the linked wiki.

Edit:
It occurs to me that GPU computing applications should naturally be optimized to avoid copying between main memory and video memory as far as possible. Those applications which have only few of such accesses should not be impacted noticeably if those accesses need to cross NUMA node boundaries (on a multi socket system, if the feeder allocated memory on a "far" node). But that's just me speculating; how little or how much of an impact there is in actual Distributed Computing GPU applications, I don't know.

GLeeM · Jul 24, 2017

Pretty sure you can make a startup shortcut that will assign CPU affinity to the program being started. Don't know if GPU can be assigned too.

Will "Process Lasso" help at all?

StefanR5R · Jul 24, 2017

Thanks @GLeeM. I had another look into this.

boincmgr.exe launches boinc.exe (if boinc.exe = the boinc client isn't already running), and boinc.exe launches application worker processes, and those may in turn launch further subprocesses. @TennesseeTony needs processor affinity set on those worker processes and their subprocesses.

I just tested on Windows 7 Pro: I shut down boinc and boincmgr, then launched boincmgr with a certain processor affinity mask, and then checked the affinity masks of boincmgr.exe, boinc.exe, and of the application worker processes (in this test: collatz GPU feeders). And indeed, all those child processes inherited the processor affinity mask of the boincmgr instance.

Here is how I launched boincmgr with custom processor affinity mask (after a little bit of web search):

Create a new shortcut (anywhere you like).
As target, enter
cmd /c "BoincMgr" /affinity XY "C:\Program Files\BOINC\boincmgr.exe"
As starting directory, enter
"C:\Program Files\BOINC"
(This is perhaps unnecessary.)
Optional: As icon of the shortcut, browse to boincmgr.exe and use its icon for the shortcut.
For XY in the command line, substitute a hexadecimal number which is a bitmask.
Specify the number without 0x in front.
In the mask, set a bit to zero if the logical core is to be excluded, and one if the logical core shall be allowed to run the process.
Bit 2^0 is for logical core 0, bit 2^1 for logical core 1, and so on.
E.g. a mask of 1f8 would enable logical cores 3...8.

Edit:
While the processor affinity mask restricts on which logical cores boincmgr, boinc, and the workers will be run by the Windows kernel, it does not influence the processor count which boinc is detecting on the system. E.g. if you have 24 logical cores and tell Windows to run the workers (and boinc and boincmgr) on only 12 of those logical cores, you should in addition configure in boincmgr that only 50 % of all CPUs shall be used by boinc.

GLeeM · Jul 24, 2017

https://www.eightforums.com/tutorials/40339-cpu-affinity-shortcut-program-create-windows.html

This might help with some parts like hex maybe

Orange Kid · Jul 24, 2017

If the affinity tag works, then the gpu's coiuld be controlled with xml. Limiting one gpu per instance.
A double whammy so to speak.
Tony's problem seems to be getting the gpu's directed evenly to the cpu's. Not necessarily cores or threads, thus spreading the cpu needs of a gpu app more evenly.

TennesseeTony · Jul 24, 2017

Thanks to all for the input! I'm still trying to wrap my head around all the info, but it's starting to make sense and sound feasible (with my skillset). 🙂

StefanR5R · Jul 25, 2017

An isolated test with just one GPU task could be interesting:

Take a GPU project which has fairly consistent WUs, and which shows a comparably high PCIe bus load. (I don't know which project has a high bus load. Whenever I looked at sensors, bus load seemed low.)
Run a bunch of WUs on a single GPU, feeder process located on the CPU to which the GPU's PCIe slot is attached. Calculate PPD of these WUs.
Repeat on the very same GPU, but feeder process located on the other CPU.

My guess is that this test shows only little difference near the error of measurement. If so, a second test could be interesting:

Do all of the above, but with two or three GPUs simultaneously, all GPUs attached to the same processor, feeders running first from the near processor, then from the far processor. Then you have up to double or triple the load on the QPI link between the processor sockets, depending on whether the feeders access the cards at the same times.
(Or make a quad QPU test with two boinc-client instances, set up such that first all four feeders run on their respectively near processor, and after that all four run on their respectively far processor.)

Or maybe you skip that single GPU test and go straight away to a multi GPU test.

ThunderStrike's CPUs are E5-2695 v3, right? These have a QPI speed of 9.6 GT/s. According to wikipedia this translates to 38.4 GB/s, compared to PCIe 3.0 x16 with 15.75 GB/s. So, bandwidth of QPI looks plenty for what typical Distributed Computing GPU applications require.* But bandwidth aside, there is also more or less of a latency penalty if QPI and a far memory controller need to be involved.

*) Edit: CPU applications could take away QPI bandwidth too if the Windows kernel shuffles their processes from socket to socket.

TennesseeTony · Jul 25, 2017

Folding is the only thing I know that has high bus load, but the WUs are far from consistent.

Need BOINC Manager help on 2P system

TennesseeTony

Elite Member

StefanR5R

Elite Member

Orange Kid

Elite Member

StefanR5R

Elite Member

GLeeM

Elite Member

StefanR5R

Elite Member

GLeeM

Elite Member

Orange Kid

Elite Member

TennesseeTony

Elite Member

StefanR5R

Elite Member

TennesseeTony

Elite Member

TRENDING THREADS