What is the Java "Enterprise-y" way of queuing long running processes ?

Bulldog13

Golden Member
Jul 18, 2002
1,655
1
81
I am doing a bit of Java development (long time C# programmer) and I am unsure how to ask this question.

I have created a simple spring boot web app, but do not know the correct method or library to scale out the processor intensive pieces of it.

Below is the gist of how it currently works:

  1. A User or system posts a file to the web application
  2. The web application runs some code against the posted file and a result .txt file is created
  3. The web application reads the .txt file and inserts the results into a database
  4. The web application responds with where the results are located ((http://localhost:8080/someRestStuff/15))
The problem is, depending on the size or complexity of the posted file, step 2 can take anywhere from 10 seconds to like an hour to process. It will also max out the CPU doing the calculations. This really becomes a problem if more than 1 file is posted before the previous one is finished being processed. So what do I need to do in order to create a global queue of step 2 ? RabbitMQ? JMS?

The way I want it to work is

  1. A User or system posts a file to the web application
  2. The web application responds with "Hey, when your file is done being processed, it can be found here AT SOME POINT IN THE FUTURE.(http://localhost:8080/someRestStuff/15)" - http://farazdagi.com/blog/2014/rest-long-running-jobs/
  3. The web application adds the file to the processing queue (different thread, so the web app is always responding to requests)
  4. The web application checks the queue, sees there is a file to process. It creates the .txt result file and updates the database with the results. It checks the queue again when it is done and repeats...
  5. The User or system waits a little bit and hits the URL is was initially provided with and either gets the results or sees that there is no result so it has to wait longer.
I do not want to roll my own basic queue because this application will have to scale out eventually. And I do not want to rewrite it.

RabbitMQ and JMS seem to almost be what I want, but I really do not want to message between applications, just have a global job queue.

Any advice ?
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
I would say go with plain producer-consumer approach and use a Queue between them. The users are the producers and you have 1 consumer (eg. step 2). Which implementation to use depends on exact use case but ConcurrentLinkedQueue is probaly the way to go.

The more tricky part is how to inform the user his job has finished. Is this a company internal app? Easiest then would probably be an email else you would need some form of polling. Also what do you do if the queue grows to much, eg. your create more jobs that can be processed?
 

Merad

Platinum Member
May 31, 2010
2,586
19
81
In .NET I would use a tool like Hangfire to queue a job in the background (probably on a different server) for the processing. When complete send an email or whatever you want for notification. I would also try to break the task up into smaller steps, so that if something fails, server dies, etc, you can resume the partially completed work instead of having to repeat an hour's worth of calculations.

In short, you want something similar to Hangfire, but for Java. Google says Quartz might work.
 

Cogman

Lifer
Sep 19, 2000
10,277
125
106
Ah, this is my bread and butter actually.

Your solution is a sound one, my only suggestion is that you separate your processing application from your web app. Do this because if you don't now it may be harder to do later. As well, the JVM's GC settings for a web app vs a processing app are very different (Web app you want G1GC, processing app you want the parallel collector. G1GC == lower response time. Parallel collector == high throughput). Our current setup for something similar to this is having our processing app spin up a processing thread pool and then join on a Jetty web app (because it is convenient to be able to pull status information down easily from the individual processors).

I've dealt with Rabbit, but I also have coworkers that have worked with and really enjoy Apache Kafka.

Two other solutions that might be easier to get up and running are Hazelcast and Akka. However, I can't really speak to how good they are. Hazelcast is dead simple to setup and get a queue going between multiple boxes. However, if you end up wanting a more complex config, you'll have to look into getting the enterprise version of it. Akka has a ton of support available and is relatively complete, but I haven't personally used it.

Our solution that we currently use is, well, not the greatest and I would not recommend it (in fact, I've wanted to get rid of it for a while now). We basically hand rolled our own queuing system. Unfortunately, if you want to do things like deduplication of work and complex prioritization, or you need to reprioritize something that is already queued, there isn't really a great way to do that AFAIK without having some secondary tracking system. We have all those problems with our current queuing system. However, for simple prioritization, you can do that simply with having multiple queues. The fast and the slow queues (or however many levels you want to have). The deduplication problem and reprioritization problem, however, is much more tricky to solve.

Let me know if you have other questions.
 

Bulldog13

Golden Member
Jul 18, 2002
1,655
1
81
I went lazy with it and am trying to use a scheduledexecutorservice...this will change in the future. Thanks Cogman!
 

Bulldog13

Golden Member
Jul 18, 2002
1,655
1
81
Decided to go with Kafka..it is pretty straightforward and dead simple to setup.

How do I set different garbage collectors, G1C1 vs parallel? Is that done in code or some kind of settings file or a command line argument ?

Is it better to run the processing intensive java class as a .jar from the command line or import it into my code and call the main class from it ?
http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start I have had success running it from the command line and from code...just curious if there is a difference?
 

Cogman

Lifer
Sep 19, 2000
10,277
125
106
Garbage collectors are setup at the command line. You can read about the options in the oracle docs (I'd give this guy a read through http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html )

You'll want to avoid (especially with G1GC) doing any settings tweaks beyond just setting the min/max heap size. With G1GC you might additionally set the target pause time (default is something like 200ms). But the other settings are not to be toyed with lightly. Before touching any of them you should have collected GC logs and studied them.

I'm not sure what you mean in regards to how to run, but with Java, it is better to have a long lived application than a short lived one. Ideally, you're long processing bit of code will just sit listening to the queue and will start processing as soon as it is notified. That isn't always doable, but that is the ideal. The JVM makes code faster the longer it runs. Additionally, the JVM has somewhat of a long startup which is less than ideal.

Good luck with Kafka and let me know how it turns out!