Go Back   AnandTech Forums > Software > Programming

Forums
· Hardware and Technology
· CPUs and Overclocking
· Motherboards
· Video Cards and Graphics
· Memory and Storage
· Power Supplies
· Cases & Cooling
· SFF, Notebooks, Pre-Built/Barebones PCs
· Networking
· Peripherals
· General Hardware
· Highly Technical
· Computer Help
· Home Theater PCs
· Consumer Electronics
· Digital and Video Cameras
· Mobile Devices & Gadgets
· Audio/Video & Home Theater
· Software
· Software for Windows
· All Things Apple
· *nix Software
· Operating Systems
· Programming
· PC Gaming
· Console Gaming
· Distributed Computing
· Security
· Social
· Off Topic
· Politics and News
· Discussion Club
· Love and Relationships
· The Garage
· Health and Fitness
· Merchandise and Shopping
· For Sale/Trade
· Hot Deals
· Free Stuff
· Contests and Sweepstakes
· Black Friday 2013
· Forum Issues
· Technical Forum Issues
· Personal Forum Issues
· Suggestion Box
· Moderator Resources
· Moderator Discussions
   

Reply
 
Thread Tools
Old 01-23-2013, 08:51 AM   #1
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default creating a script to delete TONS of files

We have millions and millions of scanned images on our system that are saved in:

EIMAGES\date\whatever\something\filename###.tif

We need to purge some of these files.

I can identify the exact path of the files that need to be purged (they have entries in a SQL database I can query against with some filters), but I'm going to end up with 100s of thousands, if not a million+ files to delete.

Does anyone have a "best practice" on how to created a script to delete these?

I could just create a .bat file that says:

Del EIMAGES\date\whatever\something\filename001.tif
Del EIMAGES\date\whatever\something\filename002.tif
...
Del EIMAGES\date\whatever\something\filename54389754 39857.tif

but that just seems ridiculous.

Thoughts? Suggestions?
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 09:01 AM   #2
Nothinman
Elite Member
 
Nothinman's Avatar
 
Join Date: Sep 2001
Posts: 30,672
Default

I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?

If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
__________________
http://www.debian.org
Nothinman is offline   Reply With Quote
Old 01-23-2013, 10:28 AM   #3
debian0001
Senior Member
 
Join Date: Jun 2012
Posts: 290
Default

Use PowerShell.
debian0001 is offline   Reply With Quote
Old 01-23-2013, 10:37 AM   #4
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Quote:
Originally Posted by Nothinman View Post
I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?

If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.

This would be a one-time-shot.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 11:32 AM   #5
Nothinman
Elite Member
 
Nothinman's Avatar
 
Join Date: Sep 2001
Posts: 30,672
Default

Quote:
Originally Posted by debian0001
Use PowerShell.
And that will speed up the process or make it in any way better how?

Quote:
Originally Posted by Homerboy View Post
I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.

This would be a one-time-shot.
Then I would say it's even less ridiculous since you don't plan on doing it regularly. Without doing any trials, I would think the individual deletion of the files by full path name would move much quicker than trying to use a wild card and causing the system to have to read in all of the filenames within the directory before even starting the delete.

One thing I'm not sure about is how cmd will handle such a large batch file, I've never seen one anywhere near that size.
__________________
http://www.debian.org
Nothinman is offline   Reply With Quote
Old 01-23-2013, 12:05 PM   #6
MerlinRML
Senior Member
 
Join Date: Sep 2005
Posts: 200
Default

If your only goal is to delete the files, then pulling the data from the database one-by-one and creating individual delete commands will probably work.

If you also need to understand the difference between what you're deleting vs. what you're not deleting, what exists on the filesystem vs. what is in the database, etc then you will need to do some preprocessing to understand that.

I would recommend that you do not create one large script with a million entries. If you run into an error, or if it takes too long and you have to stop and restart, or for some other reason, you have to start from the beginning and get past all the original deletes (now erroring because the file is already deleted) again. It might make sense to create a series of scripts with 100 or 200 or insert_whatever_number_makes_sense. This way, you will end up with a number of different script files that each deletes a batch of so many files.

This way, if your SQL query takes a while to generate the scripts, you can parallelize your script file generation and your deletes by running some of the scripts while others are being generated. Plus, if you have each script generate its own unique log, you don't have to go through a million lines to figure out what happened. Another opportunity to parallelize, if you have a fairly capable storage system, you could even run some deletes in parallel. Although, I don't recommend doing that if it's just a single spinning disk.
MerlinRML is offline   Reply With Quote
Old 01-23-2013, 12:46 PM   #7
drebo
Diamond Member
 
Join Date: Feb 2006
Posts: 6,317
Default

forfiles.

http://technet.microsoft.com/en-us/l...(v=ws.10).aspx
__________________
"All men are not created equal, and if you believe they are, there's something seriously wrong with you. Some men are destined for greatness. Most aren't. End of story." - Jose Canseco
drebo is online now   Reply With Quote
Old 01-23-2013, 01:14 PM   #8
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Quote:
Originally Posted by drebo View Post
Interesting.
Though I'm missing how I'd use that with a huge list of files and their paths.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 01:24 PM   #9
Nothinman
Elite Member
 
Nothinman's Avatar
 
Join Date: Sep 2001
Posts: 30,672
Default

Quote:
Originally Posted by Homerboy View Post
Interesting.
Though I'm missing how I'd use that with a huge list of files and their paths.
And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.

I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
__________________
http://www.debian.org
Nothinman is offline   Reply With Quote
Old 01-23-2013, 01:34 PM   #10
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Quote:
Originally Posted by Nothinman View Post
And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.

I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
Like I said, "forfiles" is an interesting command, but I don't see how it'd be usable let alone truly functional in this situation.

I do like the idea of breaking up the .bat files as well. Though I'm guessing I will have a million+ or so to delete so probably something like 10,000 per .bat file and have the last line of the .bat file call the next .bat file so if it doesn't error, it just keeps rolling.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 01:39 PM   #11
Charles Kozierok
Elite Member
 
Join Date: May 2012
Posts: 6,762
Default

Directory Toolkit will probably do something like this too, but might choke on that number.

I could help generate some batch files for you if you need.
__________________
"Of those who say nothing, few are silent." -- Thomas Neill
Charles Kozierok is offline   Reply With Quote
Old 01-23-2013, 01:42 PM   #12
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Quote:
Originally Posted by CharlesKozierok View Post
Directory Toolkit will probably do something like this too, but might choke on that number.

I could help generate some batch files for you if you need.
Thanks for the offer.
I think I can get SQL to churn out the command lines 1 by 1 exactly as I need. (literally have the resulting value in each row be "del EIMAGES\whatever\something\filename###.tif")

I'd just have to then break that 1,000,000 or so results into manageable .bat files of 10,000 or so rows. Not sure of an automatic way to do that.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 01:43 PM   #13
Charles Kozierok
Elite Member
 
Join Date: May 2012
Posts: 6,762
Default

I can help with that too, probably.
__________________
"Of those who say nothing, few are silent." -- Thomas Neill
Charles Kozierok is offline   Reply With Quote
Old 01-23-2013, 01:45 PM   #14
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Hmm interesting.
I just realized that I may have to use a wildcard of some nature.

Say files are stored in:

E:\IMAGES\IMAGES\CM\200606\

There is then multiple file names that are AAATW###.tif

My SQL query doesn't show what the ### is. But I WOULD want to delete EVERYTHING that was E:\IMAGES\IMAGES\CM\200606\AAATW###.tif
So I'd have to do:

del E:\IMAGES\IMAGES\CM\200606\AAATW*.*

So I would have to use a wildcard.
Not too horrific though as the # of files with that deep of a sub-folder isn't outrageous. Maybe several thousand max.
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 01:49 PM   #15
Charles Kozierok
Elite Member
 
Join Date: May 2012
Posts: 6,762
Default

You could do that with a single .bat most likely.
__________________
"Of those who say nothing, few are silent." -- Thomas Neill
Charles Kozierok is offline   Reply With Quote
Old 01-23-2013, 01:52 PM   #16
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Well partially good news. Initial SQL query returns "only" 854K rows.

Granted each one of those rows, due to the wildcard listed above, could be a single .tif that needs to be deleted or some 50 page scanned file in 50 individual .tifs
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 02:00 PM   #17
piasabird
Lifer
 
Join Date: Feb 2002
Posts: 14,659
Default

There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.

Dont forget to back up and limit the number of file you delete at a time to test it first.
__________________
Obama care gives more power to the IRS. When you house is repossessed
when you have no health care and you have to pay the tax
Remember that this is what you voted for!
piasabird is offline   Reply With Quote
Old 01-23-2013, 02:07 PM   #18
Markbnj
Moderator
Programming
 
Markbnj's Avatar
 
Join Date: Sep 2005
Posts: 10,525
Default

Quote:
Originally Posted by piasabird View Post
There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.

Dont forget to back up and limit the number of file you delete at a time to test it first.
I'm a little speechless, I have to say.
__________________
Everytime I try to tell you, the words just come out wrong

**
Some meaningless scribbling of no account

The 4th Realm

Arts and Letters Daily - Get some culture
Markbnj is offline   Reply With Quote
Old 01-23-2013, 02:08 PM   #19
Charles Kozierok
Elite Member
 
Join Date: May 2012
Posts: 6,762
Default

Homerboy, if you want, send me the file (with the rows) so I can take a look at it.

Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
__________________
"Of those who say nothing, few are silent." -- Thomas Neill
Charles Kozierok is offline   Reply With Quote
Old 01-23-2013, 02:39 PM   #20
Homerboy
Lifer
 
Homerboy's Avatar
 
Join Date: Mar 2000
Location: MKE, WI
Posts: 19,399
Default

Quote:
Originally Posted by CharlesKozierok View Post
Homerboy, if you want, send me the file (with the rows) so I can take a look at it.

Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
I don't have my definitive list yet.
I'm waiting on some middle management type people to make their final decisions on what exactly is to get purged.... so it should be 3-4 months!
__________________
"Ah... In a time of such ugliness, the only true protest is to be beautiful." - Refused
Homerboy is offline   Reply With Quote
Old 01-23-2013, 06:33 PM   #21
sourceninja
Diamond Member
 
sourceninja's Avatar
 
Join Date: Mar 2005
Posts: 7,756
Default

If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.

But I'm not a windows guy, I'm a *nix guy.
sourceninja is offline   Reply With Quote
Old 01-23-2013, 07:15 PM   #22
degibson
Golden Member
 
degibson's Avatar
 
Join Date: Mar 2008
Posts: 1,389
Default

How many files are you keeping? Maybe it'd be simpler to copy out all the files you're keeping then del the entire remaining directory?
degibson is offline   Reply With Quote
Old 01-24-2013, 05:34 AM   #23
beginner99
Platinum Member
 
Join Date: Jun 2009
Posts: 2,072
Default

Quote:
Originally Posted by sourceninja View Post
If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.

But I'm not a windows guy, I'm a *nix guy.
agree 100%.

I would do the SQL statement then iterate over the results and delete file by file directly from the same script. Also if you do proper exception handling that will lead to far less problems if something goes wrong.

In addition to a log I would even set a flag in the database for each file deleted (after it actually was deleted). So that said row is not selected anymore in a next try if something goes wrong. Or if no longer needed delete the row.
beginner99 is offline   Reply With Quote
Old 01-24-2013, 05:44 AM   #24
Charles Kozierok
Elite Member
 
Join Date: May 2012
Posts: 6,762
Default

Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing.
__________________
"Of those who say nothing, few are silent." -- Thomas Neill
Charles Kozierok is offline   Reply With Quote
Old 01-24-2013, 06:38 AM   #25
beginner99
Platinum Member
 
Join Date: Jun 2009
Posts: 2,072
Default

Quote:
Originally Posted by CharlesKozierok View Post
Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing.
Creating batch files from script, then running those batch files seems more complex to me than doing it all in 1 single script...
beginner99 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -5. The time now is 11:04 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.