|
|
 |
|
01-23-2013, 08:51 AM
|
#1
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
creating a script to delete TONS of files
We have millions and millions of scanned images on our system that are saved in:
E  IMAGES\date\whatever\something\filename###.tif
We need to purge some of these files.
I can identify the exact path of the files that need to be purged (they have entries in a SQL database I can query against with some filters), but I'm going to end up with 100s of thousands, if not a million+ files to delete.
Does anyone have a "best practice" on how to created a script to delete these?
I could just create a .bat file that says:
Del E  IMAGES\date\whatever\something\filename001.tif
Del E  IMAGES\date\whatever\something\filename002.tif
...
Del E  IMAGES\date\whatever\something\filename54389754 39857.tif
but that just seems ridiculous.
Thoughts? Suggestions?
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 09:01 AM
|
#2
|
|
Elite Member
Join Date: Sep 2001
Posts: 30,636
|
I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?
If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
|
|
|
01-23-2013, 10:28 AM
|
#3
|
|
Member
Join Date: Jun 2012
Posts: 142
|
Use PowerShell.
|
|
|
01-23-2013, 10:37 AM
|
#4
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Quote:
Originally Posted by Nothinman
I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?
If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
|
I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.
This would be a one-time-shot.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 11:32 AM
|
#5
|
|
Elite Member
Join Date: Sep 2001
Posts: 30,636
|
Quote:
|
Originally Posted by debian0001
Use PowerShell.
|
And that will speed up the process or make it in any way better how?
Quote:
Originally Posted by Homerboy
I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.
This would be a one-time-shot.
|
Then I would say it's even less ridiculous since you don't plan on doing it regularly. Without doing any trials, I would think the individual deletion of the files by full path name would move much quicker than trying to use a wild card and causing the system to have to read in all of the filenames within the directory before even starting the delete.
One thing I'm not sure about is how cmd will handle such a large batch file, I've never seen one anywhere near that size.
|
|
|
01-23-2013, 12:05 PM
|
#6
|
|
Member
Join Date: Sep 2005
Posts: 196
|
If your only goal is to delete the files, then pulling the data from the database one-by-one and creating individual delete commands will probably work.
If you also need to understand the difference between what you're deleting vs. what you're not deleting, what exists on the filesystem vs. what is in the database, etc then you will need to do some preprocessing to understand that.
I would recommend that you do not create one large script with a million entries. If you run into an error, or if it takes too long and you have to stop and restart, or for some other reason, you have to start from the beginning and get past all the original deletes (now erroring because the file is already deleted) again. It might make sense to create a series of scripts with 100 or 200 or insert_whatever_number_makes_sense. This way, you will end up with a number of different script files that each deletes a batch of so many files.
This way, if your SQL query takes a while to generate the scripts, you can parallelize your script file generation and your deletes by running some of the scripts while others are being generated. Plus, if you have each script generate its own unique log, you don't have to go through a million lines to figure out what happened. Another opportunity to parallelize, if you have a fairly capable storage system, you could even run some deletes in parallel. Although, I don't recommend doing that if it's just a single spinning disk.
|
|
|
01-23-2013, 12:46 PM
|
#7
|
|
Diamond Member
Join Date: Feb 2006
Posts: 5,544
|
__________________
"All men are not created equal, and if you believe they are, there's something seriously wrong with you. Some men are destined for greatness. Most aren't. End of story." - Jose Canseco
|
|
|
01-23-2013, 01:14 PM
|
#8
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Quote:
Originally Posted by drebo
|
Interesting.
Though I'm missing how I'd use that with a huge list of files and their paths.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 01:24 PM
|
#9
|
|
Elite Member
Join Date: Sep 2001
Posts: 30,636
|
Quote:
Originally Posted by Homerboy
Interesting.
Though I'm missing how I'd use that with a huge list of files and their paths.
|
And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.
I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
|
|
|
01-23-2013, 01:34 PM
|
#10
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Quote:
Originally Posted by Nothinman
And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.
I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
|
Like I said, "forfiles" is an interesting command, but I don't see how it'd be usable let alone truly functional in this situation.
I do like the idea of breaking up the .bat files as well. Though I'm guessing I will have a million+ or so to delete so probably something like 10,000 per .bat file and have the last line of the .bat file call the next .bat file so if it doesn't error, it just keeps rolling.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 01:39 PM
|
#11
|
|
Discussion Club Moderator Elite Member
Join Date: May 2012
Posts: 6,419
|
Directory Toolkit will probably do something like this too, but might choke on that number.
I could help generate some batch files for you if you need.
__________________
Webmaster, The PC Guide -- Relaunching in 2014 with all-new material!
Author, The TCP/IP Guide (getting a bit old but still lots of good free info)
"The apparent accuracy of a Wikipedia article is inversely proportional to
the depth of the reader's knowledge of the topic." -- Kozierok's First Law
|
|
|
01-23-2013, 01:42 PM
|
#12
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Quote:
Originally Posted by CharlesKozierok
Directory Toolkit will probably do something like this too, but might choke on that number.
I could help generate some batch files for you if you need.
|
Thanks for the offer.
I think I can get SQL to churn out the command lines 1 by 1 exactly as I need. (literally have the resulting value in each row be "del E  IMAGES\whatever\something\filename###.tif")
I'd just have to then break that 1,000,000 or so results into manageable .bat files of 10,000 or so rows. Not sure of an automatic way to do that.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 01:43 PM
|
#13
|
|
Discussion Club Moderator Elite Member
Join Date: May 2012
Posts: 6,419
|
I can help with that too, probably.
__________________
Webmaster, The PC Guide -- Relaunching in 2014 with all-new material!
Author, The TCP/IP Guide (getting a bit old but still lots of good free info)
"The apparent accuracy of a Wikipedia article is inversely proportional to
the depth of the reader's knowledge of the topic." -- Kozierok's First Law
|
|
|
01-23-2013, 01:45 PM
|
#14
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Hmm interesting.
I just realized that I may have to use a wildcard of some nature.
Say files are stored in:
E:\IMAGES\IMAGES\CM\200606\
There is then multiple file names that are AAATW###.tif
My SQL query doesn't show what the ### is. But I WOULD want to delete EVERYTHING that was E:\IMAGES\IMAGES\CM\200606\AAATW###.tif
So I'd have to do:
del E:\IMAGES\IMAGES\CM\200606\AAATW*.*
So I would have to use a wildcard.
Not too horrific though as the # of files with that deep of a sub-folder isn't outrageous. Maybe several thousand max.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 01:49 PM
|
#15
|
|
Discussion Club Moderator Elite Member
Join Date: May 2012
Posts: 6,419
|
You could do that with a single .bat most likely.
__________________
Webmaster, The PC Guide -- Relaunching in 2014 with all-new material!
Author, The TCP/IP Guide (getting a bit old but still lots of good free info)
"The apparent accuracy of a Wikipedia article is inversely proportional to
the depth of the reader's knowledge of the topic." -- Kozierok's First Law
|
|
|
01-23-2013, 01:52 PM
|
#16
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Well partially good news. Initial SQL query returns "only" 854K rows.
Granted each one of those rows, due to the wildcard listed above, could be a single .tif that needs to be deleted or some 50 page scanned file in 50 individual .tifs
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 02:00 PM
|
#17
|
|
Lifer
Join Date: Feb 2002
Posts: 13,241
|
There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.
Dont forget to back up and limit the number of file you delete at a time to test it first.
|
|
|
01-23-2013, 02:07 PM
|
#18
|
|
Moderator Programming
Join Date: Sep 2005
Posts: 8,154
|
Quote:
Originally Posted by piasabird
There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.
Dont forget to back up and limit the number of file you delete at a time to test it first.
|
I'm a little speechless, I have to say.
|
|
|
01-23-2013, 02:08 PM
|
#19
|
|
Discussion Club Moderator Elite Member
Join Date: May 2012
Posts: 6,419
|
Homerboy, if you want, send me the file (with the rows) so I can take a look at it.
Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
__________________
Webmaster, The PC Guide -- Relaunching in 2014 with all-new material!
Author, The TCP/IP Guide (getting a bit old but still lots of good free info)
"The apparent accuracy of a Wikipedia article is inversely proportional to
the depth of the reader's knowledge of the topic." -- Kozierok's First Law
|
|
|
01-23-2013, 02:39 PM
|
#20
|
|
Lifer
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,209
|
Quote:
Originally Posted by CharlesKozierok
Homerboy, if you want, send me the file (with the rows) so I can take a look at it.
Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
|
I don't have my definitive list yet.
I'm waiting on some middle management type people to make their final decisions on what exactly is to get purged.... so it should be 3-4 months!
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
|
|
|
01-23-2013, 06:33 PM
|
#21
|
|
Diamond Member
Join Date: Mar 2005
Posts: 7,377
|
If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.
But I'm not a windows guy, I'm a *nix guy.
|
|
|
01-23-2013, 07:15 PM
|
#22
|
|
Golden Member
Join Date: Mar 2008
Posts: 1,389
|
How many files are you keeping? Maybe it'd be simpler to copy out all the files you're keeping then del the entire remaining directory?
|
|
|
01-24-2013, 05:34 AM
|
#23
|
|
Golden Member
Join Date: Jun 2009
Posts: 1,557
|
Quote:
Originally Posted by sourceninja
If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.
But I'm not a windows guy, I'm a *nix guy.
|
agree 100%.
I would do the SQL statement then iterate over the results and delete file by file directly from the same script. Also if you do proper exception handling that will lead to far less problems if something goes wrong.
In addition to a log I would even set a flag in the database for each file deleted (after it actually was deleted). So that said row is not selected anymore in a next try if something goes wrong. Or if no longer needed delete the row.
|
|
|
01-24-2013, 05:44 AM
|
#24
|
|
Discussion Club Moderator Elite Member
Join Date: May 2012
Posts: 6,419
|
Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing.
__________________
Webmaster, The PC Guide -- Relaunching in 2014 with all-new material!
Author, The TCP/IP Guide (getting a bit old but still lots of good free info)
"The apparent accuracy of a Wikipedia article is inversely proportional to
the depth of the reader's knowledge of the topic." -- Kozierok's First Law
|
|
|
01-24-2013, 06:38 AM
|
#25
|
|
Golden Member
Join Date: Jun 2009
Posts: 1,557
|
Quote:
Originally Posted by CharlesKozierok
Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing. 
|
Creating batch files from script, then running those batch files seems more complex to me than doing it all in 1 single script...
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 12:05 PM.
|