Need File Archiving Suggestions

EXCellR8

Diamond Member
Sep 1, 2010
4,125
936
136
I back up our project files at work on a regular bi-weekly basis. Nothing fancy, just copying folders to other drives. However I also want to keep monthly increments as well if we ever need to fall back on older data.

My question is, what is my best option for archiving the monthly the backups, preferably with some compression? ...as it is several hundred gigs worth of stuff. I'd also like to be able to extract single files from an archive without unpacking everything, if possible.

Open to suggestions and/or feedback, thanks!
 

code65536

Golden Member
Mar 7, 2006
1,006
0
76
I'd also like to be able to extract single files from an archive without unpacking everything, if possible.

I don't have any specific recommendations, but how much other data needs to be decompressed in order to access a particular file depends on the size of data that share the dictionary.

With dictionary-based compression (which is pretty much everything: zip, rar, 7zip, gzip, et al.), there is a "dictionary" that starts out blank and is gradually filled with patterns that the compressor encounters. The more patterns in the dictionary that the compressor can draw from, the better the compression, which is why you get better compression when you increase the dictionary size in 7-Zip and WinRAR.

With Zip, the dictionary is reset with each file. The advantage of not resetting the dictionary is that the work done to compress file #1 can help compress file #2 (or #3 or any subsequent file with which dictionary data might be shared). If file #1 and #2 are very different, then there really isn't any advantage to not resetting. If #1 and #2 are very similar, however, then there would be huge savings: to the point where huge files can be reduced to virtually nothing, if it's virtually identical to the file that preceded it.

The advantage of resetting the dictionary is that it makes random file extraction much easier. Say you want to extract file #2. If each file was compressed with its own individual dictionary, then you don't care about any other file, and you can just grab file #2. If it shared a dictionary with file #1, then the extractor must rebuild file #1's dictionary state before it can work on file #2. Also, any corruption in the archive would damage everything following that point of corruption, up to the end of that dictionary block. So with a resetting dictionary, the damage would be limited to just that file, but with a non-resetting dictionary, the corruption would damage that file and every file that comes after it.

Zip resets after each file.

WinRAR can either reset after each file (default) or never reset ("solid" compression).

7-Zip tries to strike a balance, offering always-resetting ("non-solid"), never-resetting ("solid"), and a compromise option where the dictionary is reset after X number of bytes, so if the 7z solid block size is set to 128MB, then you will have to decompress, at most, 128MB of preceding data before you can access a file.

Compressed tarballs (tar.gz, tgz, tbz, etc.) or using zip/rar/etc. to compress a disk image is also all solid (non-resetting), since all the individual files are coalesced into a single file before being compressed.
 
Last edited:

EXCellR8

Diamond Member
Sep 1, 2010
4,125
936
136
hmmm, this may be trickier than i thought then... maybe i'll just leave the data uncompressed and just bug the boss man for more drives. that way everything will just be plug and play quick. as the overall size of the data increases i'll have to come up with something else.

thanks for all that info btw... good to know
 

Chiefcrowe

Diamond Member
Sep 15, 2008
5,056
199
116
That is pretty interesting, did not know this. So the "word sizes" are essentially blocks of bytes right?

I quickly looked at 7-zip and didn't see the option to for the reset settings.



I don't have any specific recommendations, but how much other data needs to be decompressed in order to access a particular file depends on the size of data that share the dictionary.

With dictionary-based compression (which is pretty much everything: zip, rar, 7zip, gzip, et al.), there is a "dictionary" that starts out blank and is gradually filled with patterns that the compressor encounters. The more patterns in the dictionary that the compressor can draw from, the better the compression, which is why you get better compression when you increase the dictionary size in 7-Zip and WinRAR.

With Zip, the dictionary is reset with each file. The advantage of not resetting the dictionary is that the work done to compress file #1 can help compress file #2 (or #3 or any subsequent file with which dictionary data might be shared). If file #1 and #2 are very different, then there really isn't any advantage to not resetting. If #1 and #2 are very similar, however, then there would be huge savings: to the point where huge files can be reduced to virtually nothing, if it's virtually identical to the file that preceded it.

The advantage of resetting the dictionary is that it makes random file extraction much easier. Say you want to extract file #2. If each file was compressed with its own individual dictionary, then you don't care about any other file, and you can just grab file #2. If it shared a dictionary with file #1, then the extractor must rebuild file #1's dictionary state before it can work on file #2. Also, any corruption in the archive would damage everything following that point of corruption, up to the end of that dictionary block. So with a resetting dictionary, the damage would be limited to just that file, but with a non-resetting dictionary, the corruption would damage that file and every file that comes after it.

Zip resets after each file.

WinRAR can either reset after each file (default) or never reset ("solid" compression).

7-Zip tries to strike a balance, offering always-resetting ("non-solid"), never-resetting ("solid"), and a compromise option where the dictionary is reset after X number of bytes, so if the 7z solid block size is set to 128MB, then you will have to decompress, at most, 128MB of preceding data before you can access a file.

Compressed tarballs (tar.gz, tgz, tbz, etc.) or using zip/rar/etc. to compress a disk image is also all solid (non-resetting), since all the individual files are coalesced into a single file before being compressed.
 

ignatzatsonic

Senior member
Nov 20, 2006
351
0
0
If it's only "several hundred gigs" of data, I don't see any need to compress it or use any fancy schemes or complications. A one terabyte drive is 60 or 70 bucks.

Use an ordinary "file by file" backup program, rather than imaging. The first run might take hours, but thereafter it should be down to seconds or minutes.

I use FreeFileSync from sourceforge.net, but there's a bunch of apps that do the same thing. You include or exclude based on folder, file name, extension, etc.
 

code65536

Golden Member
Mar 7, 2006
1,006
0
76
hmmm, this may be trickier than i thought then

Not really. I was just addressing one specific question/requirement you had. The takeaway point is that if you want to preserve easy random access, then 1) don't use compressed tarballs (or something like that) 2) if using RAR, make sure solid archiving is unchecked and if using 7zip, make sure you're non-solid or have a small solid block size. For other compression programs, options may vary, but once you know the jist of what you're looking for, it should be easy to identify.

I often use WinRAR with "Fastest" compression and solid turned off--it's pretty fast, preserves random access, and while the compression isn't great compared to best/solid, it does take care of the "low-hanging fruit" as far as compression is concerned and saves enough space that it's worthwhile.

That is pretty interesting, did not know this. So the "word sizes" are essentially blocks of bytes right?

I quickly looked at 7-zip and didn't see the option to for the reset settings.

No, word size is a different setting. 7-Zip offers three different settings.

Dictionary size: Bigger is better, but bigger is slower and increases the RAM requirements for both compression and decompression (Zip's biggest weakness is a tiny dictionary--just 4KB for the default "implode" setting--but keep in mind that zip was developed back in the days of DOS and memory that was measured in hundreds of kilobytes; this is also why solid [no-reset] compression doesn't make much sense in Zip, because if you're only discarding a few KB of dictionary per file, that's not going to make a huge difference with modern file sizes.)

The dictionary reset in 7-Zip is controlled by the "Solid Block size" option, with "non-solid" being always-reset-per-file, "solid" being never-reset, and the various sizes in between being reset-after-X-bytes.

I'm not 100% sure what Word size controls, but I think it controls how 7-Zip chunks the data for finding patterns.

Say you have the following data:
0123456789abcdef0123456789abcdef0123456789abcdef
With a word size of 8 bits, it'll look like this:
01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef
And with a word size of 64-bits, it'll look like:
0123456789abcdef 0123456789abcdef 0123456789abcdef

In this example, the pattern repeats every 64-bits, so a 64-bit word size would be more efficient for finding patterns. Larger isn't necessarily better with word size--the optimal word size depends on the data that you are compressing. In practice, though, different word sizes rarely make a huge difference (in part because most files have repeating patterns of different lengths), so it's fine to just leave it at the default and not bother with it.