Limits of shared memory random access multi-threading?

tromp · Feb 21, 2014

How many threads can be accessing global shared memory at completely random addresses? Is this a simple function of the memory speed and timings?

The reason I ask is that I have this proof-of-work system called Cuckoo Cycle at https://github.com/tromp/cuckoo in which each thread performs
a cheap siphash computation to access (on average) 3.3 random words
in memory, and even running with 32 threads, I still don't see memory being
saturated.

Can anyone predict based on memory timings what would be the maximum
number of threads this application can effectively use?

Or does anyone here have access to systems with higher levels of multi-threading? If so, could you run some test like

for i in {32,40,48,56,64}; do echo $i; cc -o cuckoo -DNTHREADS=$i -DSIZEMULT=1 -DSIZESHIFT=30 cuckoo.c -O3 -std=c99 -m64 -Wall -Wno-deprecated-declarations -pthread -l crypto; (time for j in {0..9}; do ./cuckoo $j; done) 2>&1; done > output

to see where the timings (realtime) flatten out?
Thanks!

Soulkeeper · Feb 21, 2014

the limit will likely come when you run out of memory
each pthread_mutex_t/pthread_t takes a certain amount of bytes
Also the OS might have a fixed limit on the number of threads before it starts dropping/killing them.
Then you have the issue of the scheduler being stressed, which will start killing performance as you increase threads a lot.

You'd have to do some research on the limitations of your platform to know for sure.

This is more of a programming/theoretical question than a hardware/timings question.

tromp · Feb 22, 2014

Soulkeeper said:
the limit will likely come when you run out of memory
each pthread_mutex_t/pthread_t takes a certain amount of bytes

You misunderstood the question. Let me try an answer myself, and maybe people can point out mistakes in my reasoning.
I timed how long a single thread needs to generate a random address. It takes about 60ns. I think for a memory bank to service a random read, it needs the CL delay, maybe others, and then needs to transfer 8 bits. So let's say that takes 15ns. So then a single memory bank is saturated by 4 threads. My program reads 32 bit (aligned) words from 4GB of shared memory with the above compilation options, which amounts to 64 banks (I think a bank is 512 Mbit, or 64 MB). So that would put the total limit at 256 threads.

Limits of shared memory random access multi-threading?

tromp

Junior Member

Soulkeeper

Diamond Member

tromp

Junior Member

TRENDING THREADS