creating a script to delete TONS of files

Discussion in 'Programming' started by Homerboy, Jan 23, 2013.

  1. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    We have millions and millions of scanned images on our system that are saved in:

    E:\IMAGES\date\whatever\something\filename###.tif

    We need to purge some of these files.

    I can identify the exact path of the files that need to be purged (they have entries in a SQL database I can query against with some filters), but I'm going to end up with 100s of thousands, if not a million+ files to delete.

    Does anyone have a "best practice" on how to created a script to delete these?

    I could just create a .bat file that says:

    Del E:\IMAGES\date\whatever\something\filename001.tif
    Del E:\IMAGES\date\whatever\something\filename002.tif
    ...
    Del E:\IMAGES\date\whatever\something\filename5438975439857.tif

    but that just seems ridiculous.

    Thoughts? Suggestions?
     
  2. Nothinman

    Nothinman Elite Member

    Joined:
    Sep 14, 2001
    Messages:
    30,672
    Likes Received:
    0
    I don't see why that's ridiculous if it's something you can automate. If you have a task that queries the database and builds the batch file for you why does it matter if it's a straight list of filenames or a fancy glob or regular expression?

    If they're all within one directory it would probably be quicker that way too as any globbing against a directory with millions of files will be very slow.
     
  3. debian0001

    debian0001 Senior member

    Joined:
    Jun 8, 2012
    Messages:
    429
    Likes Received:
    0
    Use PowerShell.
     
  4. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    I guess it seemed "ridiculous" to have a .bat file with potentially a million lines in it.

    This would be a one-time-shot.
     
  5. Nothinman

    Nothinman Elite Member

    Joined:
    Sep 14, 2001
    Messages:
    30,672
    Likes Received:
    0
    And that will speed up the process or make it in any way better how?

    Then I would say it's even less ridiculous since you don't plan on doing it regularly. Without doing any trials, I would think the individual deletion of the files by full path name would move much quicker than trying to use a wild card and causing the system to have to read in all of the filenames within the directory before even starting the delete.

    One thing I'm not sure about is how cmd will handle such a large batch file, I've never seen one anywhere near that size.
     
  6. MerlinRML

    MerlinRML Senior member

    Joined:
    Sep 9, 2005
    Messages:
    207
    Likes Received:
    0
    If your only goal is to delete the files, then pulling the data from the database one-by-one and creating individual delete commands will probably work.

    If you also need to understand the difference between what you're deleting vs. what you're not deleting, what exists on the filesystem vs. what is in the database, etc then you will need to do some preprocessing to understand that.

    I would recommend that you do not create one large script with a million entries. If you run into an error, or if it takes too long and you have to stop and restart, or for some other reason, you have to start from the beginning and get past all the original deletes (now erroring because the file is already deleted) again. It might make sense to create a series of scripts with 100 or 200 or insert_whatever_number_makes_sense. This way, you will end up with a number of different script files that each deletes a batch of so many files.

    This way, if your SQL query takes a while to generate the scripts, you can parallelize your script file generation and your deletes by running some of the scripts while others are being generated. Plus, if you have each script generate its own unique log, you don't have to go through a million lines to figure out what happened. Another opportunity to parallelize, if you have a fairly capable storage system, you could even run some deletes in parallel. Although, I don't recommend doing that if it's just a single spinning disk.
     
  7. drebo

    drebo Diamond Member

    Joined:
    Feb 24, 2006
    Messages:
    7,043
    Likes Received:
    0
  8. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
  9. Nothinman

    Nothinman Elite Member

    Joined:
    Sep 14, 2001
    Messages:
    30,672
    Likes Received:
    0
    And that scans the directory to match the names so it would most likely be many orders of magnitude slower than a straight 'del blah' over and over within a batch file.

    I agree with what MerlinRML said about breaking up the files into like 1000 files so that it's easier to debug and you can run several of them in parallel but I think any more than that would be over-engineering for something you're going to do once.
     
  10. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    Like I said, "forfiles" is an interesting command, but I don't see how it'd be usable let alone truly functional in this situation.

    I do like the idea of breaking up the .bat files as well. Though I'm guessing I will have a million+ or so to delete so probably something like 10,000 per .bat file and have the last line of the .bat file call the next .bat file so if it doesn't error, it just keeps rolling.
     
  11. Charles Kozierok

    Charles Kozierok Elite Member

    Joined:
    May 14, 2012
    Messages:
    6,762
    Likes Received:
    0
    Directory Toolkit will probably do something like this too, but might choke on that number.

    I could help generate some batch files for you if you need.
     
  12. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    Thanks for the offer.
    I think I can get SQL to churn out the command lines 1 by 1 exactly as I need. (literally have the resulting value in each row be "del E:\IMAGES\whatever\something\filename###.tif")

    I'd just have to then break that 1,000,000 or so results into manageable .bat files of 10,000 or so rows. Not sure of an automatic way to do that.
     
  13. Charles Kozierok

    Charles Kozierok Elite Member

    Joined:
    May 14, 2012
    Messages:
    6,762
    Likes Received:
    0
    I can help with that too, probably.
     
  14. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    Hmm interesting.
    I just realized that I may have to use a wildcard of some nature.

    Say files are stored in:

    E:\IMAGES\IMAGES\CM\200606\

    There is then multiple file names that are AAATW###.tif

    My SQL query doesn't show what the ### is. But I WOULD want to delete EVERYTHING that was E:\IMAGES\IMAGES\CM\200606\AAATW###.tif
    So I'd have to do:

    del E:\IMAGES\IMAGES\CM\200606\AAATW*.*

    So I would have to use a wildcard.
    Not too horrific though as the # of files with that deep of a sub-folder isn't outrageous. Maybe several thousand max.
     
  15. Charles Kozierok

    Charles Kozierok Elite Member

    Joined:
    May 14, 2012
    Messages:
    6,762
    Likes Received:
    0
    You could do that with a single .bat most likely.
     
  16. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    Well partially good news. Initial SQL query returns "only" 854K rows.

    Granted each one of those rows, due to the wildcard listed above, could be a single .tif that needs to be deleted or some 50 page scanned file in 50 individual .tifs
     
  17. piasabird

    piasabird Lifer

    Joined:
    Feb 6, 2002
    Messages:
    16,573
    Likes Received:
    1
    There is probably some way to build an engine in Java that reads the files address and just clears out the the address space one at a time. Some programs can just delete all data at a specific address. It would be even neater if you could rename all the files with a prefix and then del everything with that prefix.

    Dont forget to back up and limit the number of file you delete at a time to test it first.
     
  18. Markbnj

    Markbnj Elite Member <br>Moderator Emeritus
    Moderator

    Joined:
    Sep 16, 2005
    Messages:
    15,682
    Likes Received:
    3
    I'm a little speechless, I have to say.
     
  19. Charles Kozierok

    Charles Kozierok Elite Member

    Joined:
    May 14, 2012
    Messages:
    6,762
    Likes Received:
    0
    Homerboy, if you want, send me the file (with the rows) so I can take a look at it.

    Email is my first name squished next to the first three letters of my last name, at the email service Google runs.
     
  20. Homerboy

    Homerboy Lifer

    Joined:
    Mar 1, 2000
    Messages:
    23,384
    Likes Received:
    57
    I don't have my definitive list yet.
    I'm waiting on some middle management type people to make their final decisions on what exactly is to get purged.... so it should be 3-4 months! :)
     
  21. sourceninja

    sourceninja Diamond Member

    Joined:
    Mar 8, 2005
    Messages:
    8,576
    Likes Received:
    2
    If I'm writing a script, I personally wouldn't want to use that to write a intermediate script. I'd use something like python to query the database and delete the files while writing a output log.

    But I'm not a windows guy, I'm a *nix guy.
     
  22. degibson

    degibson Golden Member

    Joined:
    Mar 21, 2008
    Messages:
    1,389
    Likes Received:
    0
    How many files are you keeping? Maybe it'd be simpler to copy out all the files you're keeping then del the entire remaining directory?
     
  23. beginner99

    beginner99 Platinum Member

    Joined:
    Jun 2, 2009
    Messages:
    2,810
    Likes Received:
    2
    agree 100%.

    I would do the SQL statement then iterate over the results and delete file by file directly from the same script. Also if you do proper exception handling that will lead to far less problems if something goes wrong.

    In addition to a log I would even set a flag in the database for each file deleted (after it actually was deleted). So that said row is not selected anymore in a next try if something goes wrong. Or if no longer needed delete the row.
     
  24. Charles Kozierok

    Charles Kozierok Elite Member

    Joined:
    May 14, 2012
    Messages:
    6,762
    Likes Received:
    0
    Well, I prefer KISS myself. If there's a way to do it with a bunch of delete commands spit out from a quick Awk script, that's what I'm doing. :)
     
  25. beginner99

    beginner99 Platinum Member

    Joined:
    Jun 2, 2009
    Messages:
    2,810
    Likes Received:
    2
    Creating batch files from script, then running those batch files seems more complex to me than doing it all in 1 single script...