File writing and multiple threads.

smackababy · Sep 16, 2014

I have an interesting problem I just encountered and could use a bit of advice. I don't normally work with writing files (zip in this case) and am wondering the best way to approach this situation.

I have an API supplied by a company that allows me to retrieve folders from their system (document management software, with no way for my company to back up our files, once entered or, as is our case, move all our documents out of the system to a new system). Now, I am working in C#, which I have limited experience, but it is going along pretty good for a single instance.

However, they want this executable to be able to be run from multiple locations, saving to a central file repository (network drive), without overwriting each file. These multiple instances of this program can't be aware of themselves. The problem lies in that I am unsure how to handle the multiple files. I know the order in which the files will be created (there are like 70,000+). Doing something like

Code:

Directory.GetFiles(networkLocation).Contains(<file I am trying to save>)

won't work, as getting the data to add create the zip from can take upwards of 20 seconds to return from their server before it even begins creating the zip file, itself.

Is there a way, outside of possibly listing the files in a text file or something, to do this without communication between instances?

This doesn't have to be an entirely elegant solution, as it is an internal program used by really just me, at the moment, but I'd like it to be somewhat scalable beyond two instances.

Thanks.

uclabachelor · Sep 16, 2014

1. Create a placeholder file with 0 bytes in your network location to "reserve" the file name. You would also do a check for duplicate file names here and ensure that your file name is unique... there's a thousand ways to do this so you'll have to pick one that works with your requirements.

2. Download the file from the remote server to a LOCAL temporary file on the client computer.

3. Move the file in #2 to the final location on the networked computer with the file name selected in step 1.

smackababy · Sep 16, 2014

Yeah, I was trying to avoid the dummy files, but that likely the best case scenario.

Cogman · Sep 16, 2014

Well, first, why not something like git or svn? These systems pretty elegantly handle the file rewrite/overwrite problem.

Another option is to have a central lock manager server which handles dealing out the locks. I know you don't want this, but that really would nicely solve the problem while offering pretty high scalability.

An alternative to the lock server is a database which keeps track of file locks (though really this is pretty much serving the same purpose of a lock manager).

uclabachelors method would probably work, but I just feel like it could be pretty easy to do wrong (for example, I would be really worried about a race conditions).

smackababy · Sep 16, 2014

I don't think git or svn would work in this instance. I am forced to use their API (which is bad and incredibly obtuse, but nothing I can do about that). Using an additional service, such as GIT or SVN to write a file from their system (if it even supports it) is still going to have the problem I am experiencing in the time it takes to pull the data from there server to when I move to write the file to anywhere.

Realistically, this isn't going to be a problem in a few months, as we are moving to a different document management solution that offers a lot more in terms of actually being usable Our current product, that I am pulling from, doesn't even allow us to run a query of our files. They literally time out their DB calls due to the number of documents we have and there is nothing we, ourselves, can do about it. They are more than willing, however, to charge us per report if we have them do it internally and send us the file. They manage the database of the files and simply get access via a web interface or their SOAP API. They don't even have any method for simply returning only the first say 1000, if the count of rows returned is over that, and querying for the next thousand when we need to view it.

The issue is, since this is happening over the network, is that getting all the files takes a ton of time. We want a way to run multiple instances to shorten the time, while keeping them on different boxes.

I am open for any way to realistically do this in a way that isn't going to take 20 hours, though.

Cogman · Sep 16, 2014

Hmm. Well that stinks 🙂.

If you are moving over to a new system soon then either using uclabachelor's solution or just waiting for the new system is probably the right solution.

uclabachelor · Sep 16, 2014

Cogman said:
uclabachelors method would probably work, but I just feel like it could be pretty easy to do wrong (for example, I would be really worried about a race conditions).

Race conditions wouldn't be an issue if the code first checks for an existing file and then attempts to create a new file on the server within a try/catch/exception loop. This is assuming that network folder/file permissions are setup properly so that only the owner of the file has write permission while other instances would throw an error while trying to create the same file.

If the app instance encounters a race condition and BOTH instances try to create the same file name, one would succeed while the other would fail within the try/catch/exception loop and would have to try again with a new file name.

However, doing the above could potentially put the program in an infinite loop while it hunts for a unique file name, but that issue can be minimized or eliminated by structuring the folder/files and/or ensuring that the procedure used to generate the file name would result in minimal collisions in file names.

Another way of approaching this is that if the instance can GUARANTEE a unique file name (ie, by appending its instance GUID to the file name), there wouldn't be a need to do a check for dupes or race conditions.

DaveSimmons · Sep 17, 2014

uclabachelor said:
Another way of approaching this is that if the instance can GUARANTEE a unique file name (ie, by appending its instance GUID to the file name), there wouldn't be a need to do a check for dupes or race conditions.

Nice. A similar idea is for each PC to have an INI text file that gives it a number from 1 to (say) 99, then use that ID as part of the file name, but the GUID could be created automatically on each PC. If it makes the file names too ugly you could always write a script to strip off the GUID prefix or suffix before uploading to the new system.

I have an API supplied by a company that allows me to retrieve folders from their system (document management software, with no way for my company to back up our files, once entered or, as is our case, move all our documents out of the system to a new system)

I suppose another way to avoid collisions would be to have each instance ignore all files except its own subset -- one PC grabs document names starting with a-c, one grabs d-g, one h-l, .... one grabs q-z,0-9, other symbols.

File writing and multiple threads.

smackababy

Lifer

uclabachelor

Senior member

smackababy

Lifer

Cogman

Lifer

smackababy

Lifer

Cogman

Lifer

uclabachelor

Senior member

DaveSimmons

Elite Member

TRENDING THREADS