need fast, efficient way to consolidate a lot of little files

NTB

Diamond Member
Mar 26, 2001
5,179
0
0
I've already come up with a couple different ways to do this, and I don't doubt that you guys can come up with a couple more. Here's the situation:

It's something for work, and let's just say that I work for a rather large company ;). The data my team needs for one of our apps is not immediately available at our office; instead it's sent up every morning as a bunch of pipe-delimited text files. To make things easier for a couple of mainframe programmers, I've been given the task of consolidating all of these little files into a single, larger, fixed-record-length file that a mainframe job can pick up, rather than having to hunt for all the individuals.

The original data files average a few KB a piece. Doesn't sound like much, but in a worst-case scenario, I could be looking at processing 3,500+ of these in a single run. Under normal circumstances the number of files might be a few hundred up to about 1,000.

I might be nit-picking considering the server this will be running on (multi-CPU HP-UX box), but it's still good-to-know information, especially if I ever have to work on a less powerful machine. So, anybody have any suggestions? Do I:

a) create the consolidated file and then format that?

b) read and format the little files and append to the consolidated file as I go?

c) something else?

Nathan
 

esun

Platinum Member
Nov 12, 2001
2,214
0
0
I'm guessing (b) would be faster, but I'm not entirely sure. It's almost always a good idea to process as large chunks as possible, but having to process twice maybe be worse overall (which is what I'm thinking, even if the initial processing is just consolidating the small files).

Honestly, it should be pretty trivial to change from method (a) to method (b) and back, so it may be a good idea to implement one, time it, then adjust it into the other version, time it, and compare.
 

sourceninja

Diamond Member
Mar 8, 2005
8,805
65
91
that is what I was thinking myself. cat everything into a single file, then format it as you would the individual files.
 

QED

Diamond Member
Dec 16, 2005
3,428
3
0

If you write your formatting program to take input on standard input and send it to standard output, then just do:

cat *.txt | myformatter.pl > myoutput.file

This will be just about as fast as you can do it.

In the newer versions of Perl any file placed on the command line is assumed to be standard input, so this might be an even faster option (assuming your formatter is written in Perl):

myformatter.pl *.txt > myoutput.file


The only problem is that on a few systems 3500 files might cause a file matching regular expression to blow up, so you might have to use a "find" statement instead. The find statement itself may run faster than "cat *.txt", but having to call "cat" 3500 times instead of just once makes this method MUCH slower, so use only if necessary:

find . -name \*.txt -exec cat \{\} \; | myformatter.pl > myoutput.file





 

esun

Platinum Member
Nov 12, 2001
2,214
0
0
If you want to use QED's method using the find command, look up the command xargs. That will prevent you from having to do something horribly inefficient such as cat'ing each file individually.