OK, then, I wouldn't do LLR. It's not an easy problem to solve.
As you may know, I wrote a PrimeGrid GPU sieve application. But it only works on large ranges of K's. I've wanted to write an application that would work on individual K's, like
sr2sieve, but for a GPU.
Now,
here's how a fixed-K sieve works. sr2sieve uses
baby-step giant-step to solve the discrete logarithm, which is fast, but takes a lot of memory, which is slow to access on a GPU. I was thinking of using
Pollard Rho instead, as it uses little memory. The problem with Pollard Rho is its runtime is unpredictable, but I was thinking it could be divided into chunks, probably of close to the average runtime, and once a P is solved on one process it could be swapped for a new P.
The other problem with sieves is you need to do
modular multiplication while almost never computing a modulus, if you want your program to run fast. I used
Montgomery multiplication in my sieve, which I
mostly understood. Another guy used
Barret reduction in
his factoring program, which is not a sieve. I never quite understood how to use it, but it seemed to require 128-bit numbers.
Overall, anything in this area seems like it should be a multi-year graduate project, not an undergraduate project. I believe llrCUDA was written by a math professor at a Tokyo university.