creating a script to delete TONS of files

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
We have millions and millions of scanned images on our system that are saved in:

E:\IMAGES\date\whatever\something\filename###.tif

We need to purge some of these files.

I can identify the exact path of the files that need to be purged (they have entries in a SQL database I can query against with some filters), but I'm going to end up with 100s of thousands, if not a million+ files to delete.

Does anyone have a "best practice" on how to created a script to delete these?

I could just create a .bat file that says:

Del E:\IMAGES\date\whatever\something\filename001.tif
Del E:\IMAGES\date\whatever\something\filename002.tif
...
Del E:\IMAGES\date\whatever\something\filename5438975439857.tif

but that just seems ridiculous.

Thoughts? Suggestions?
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?

If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?

If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.

I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.

This would be a one-time-shot.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
debian0001 said:
Use PowerShell.

And that will speed up the process or make it in any way better how?

I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.

This would be a one-time-shot.

Then I would say it's even less ridiculous since you don't plan on doing it regularly. Without doing any trials, I would think the individual deletion of the files by full path name would move much quicker than trying to use a wild card and causing the system to have to read in all of the filenames within the directory before even starting the delete.

One thing I'm not sure about is how cmd will handle such a large batch file, I've never seen one anywhere near that size.
 

MerlinRML

Senior member
Sep 9, 2005
207
0
71
If your only goal is to delete the files, then pulling the data from the database one-by-one and creating individual delete commands will probably work.

If you also need to understand the difference between what you're deleting vs. what you're not deleting, what exists on the filesystem vs. what is in the database, etc then you will need to do some preprocessing to understand that.

I would recommend that you do not create one large script with a million entries. If you run into an error, or if it takes too long and you have to stop and restart, or for some other reason, you have to start from the beginning and get past all the original deletes (now erroring because the file is already deleted) again. It might make sense to create a series of scripts with 100 or 200 or insert_whatever_number_makes_sense. This way, you will end up with a number of different script files that each deletes a batch of so many files.

This way, if your SQL query takes a while to generate the scripts, you can parallelize your script file generation and your deletes by running some of the scripts while others are being generated. Plus, if you have each script generate its own unique log, you don't have to go through a million lines to figure out what happened. Another opportunity to parallelize, if you have a fairly capable storage system, you could even run some deletes in parallel. Although, I don't recommend doing that if it's just a single spinning disk.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Interesting.
Though I'm missing how I'd use that with a huge list of files and their paths.

And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.

I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.

I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.

Like I said, "forfiles" is an interesting command, but I don't see how it'd be usable let alone truly functional in this situation.

I do like the idea of breaking up the .bat files as well. Though I'm guessing I will have a million+ or so to delete so probably something like 10,000 per .bat file and have the last line of the .bat file call the next .bat file so if it doesn't error, it just keeps rolling.
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
Directory Toolkit will probably do something like this too, but might choke on that number.

I could help generate some batch files for you if you need.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Directory Toolkit will probably do something like this too, but might choke on that number.

I could help generate some batch files for you if you need.

Thanks for the offer.
I think I can get SQL to churn out the command lines 1 by 1 exactly as I need. (literally have the resulting value in each row be "del E:\IMAGES\whatever\something\filename###.tif")

I'd just have to then break that 1,000,000 or so results into manageable .bat files of 10,000 or so rows. Not sure of an automatic way to do that.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Hmm interesting.
I just realized that I may have to use a wildcard of some nature.

Say files are stored in:

E:\IMAGES\IMAGES\CM\200606\

There is then multiple file names that are AAATW###.tif

My SQL query doesn't show what the ### is. But I WOULD want to delete EVERYTHING that was E:\IMAGES\IMAGES\CM\200606\AAATW###.tif
So I'd have to do:

del E:\IMAGES\IMAGES\CM\200606\AAATW*.*

So I would have to use a wildcard.
Not too horrific though as the # of files with that deep of a sub-folder isn't outrageous. Maybe several thousand max.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Well partially good news. Initial SQL query returns "only" 854K rows.

Granted each one of those rows, due to the wildcard listed above, could be a single .tif that needs to be deleted or some 50 page scanned file in 50 individual .tifs
 

piasabird

Lifer
Feb 6, 2002
17,168
60
91
There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.

Dont forget to back up and limit the number of file you delete at a time to test it first.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
13
81
www.markbetz.net
There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.

Dont forget to back up and limit the number of file you delete at a time to test it first.

I'm a little speechless, I have to say.
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
Homerboy, if you want, send me the file (with the rows) so I can take a look at it.

Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Homerboy, if you want, send me the file (with the rows) so I can take a look at it.

Email is my first name squished next to the first three letters of my last name, at the email service Google runs.

I don't have my definitive list yet.
I'm waiting on some middle management type people to make their final decisions on what exactly is to get purged.... so it should be 3-4 months! :)
 

sourceninja

Diamond Member
Mar 8, 2005
8,805
65
91
If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.

But I'm not a windows guy, I'm a *nix guy.
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
How many files are you keeping? Maybe it'd be simpler to copy out all the files you're keeping then del the entire remaining directory?
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.

But I'm not a windows guy, I'm a *nix guy.

agree 100%.

I would do the SQL statement then iterate over the results and delete file by file directly from the same script. Also if you do proper exception handling that will lead to far less problems if something goes wrong.

In addition to a log I would even set a flag in the database for each file deleted (after it actually was deleted). So that said row is not selected anymore in a next try if something goes wrong. Or if no longer needed delete the row.
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing. :)
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing. :)

Creating batch files from script, then running those batch files seems more complex to me than doing it all in 1 single script...