Matlab: Dealing with Increasing delay when converting binary files to text files

magomago

Lifer
Sep 28, 2002
10,973
14
76
Hi,

I've been converting masses of binary files to text files for a while. Using some code I wrote, I initially convert these binary files (just full of numbers of data I look at on a daily basis - they are split up into smaller files as opposed to having one monster file) at a rate of 3 files/min. However, as time goes on, the whole process slows down a LOT -> after 3 hours, i'm going at a rate of 1 file/10 min.

I thought I had a memory leak somewhere or I was not closing the binary files properly as I loaded up new ones, but it doesn't seem to be the case. I threw in some extra commands to try to force close any files that were open, but i still get this inexplicable slowdown in file conversion. I looked at memory usage and nothing seems to be building - I get regular spikes as the .dat file is loaded, processed, and closed.


Here is the basic code with very little modified (FYI, I'm not a programmer at all so excuse any ugliness in my code).

Btw I'm aware that with STUFF0 I'm using a vercat operation which is slow, but at most that array is only a few thousand long ,and 3000x1 shouldn't be creating minutes of slowdown as vercat goes to 3001x1.

Even though I specifically close the file using its FID, I purposely force a close(all) and then just try to set FID=[] to see if blanking it out helps.

I'd appreciate any help, thanks!



STUFF1 = -56*ones(XXXX,1)
STUFF2 = -56*ones(XXXX,1)
for i=1:length(Files)
[FID,MESSAGE]=fopen(char(Files(i))) ; % load up the ith file
status=fseek(FID,0,'eof') ;% place the marker at the origin which is EOF
length_of_file = ftell(FID) - 1 ; % -1 because the last byte index is the EOF which simply marks the end of file

% check for successful load
if status == 0
disp('Binary File loaded Successfully')
elseif status == -1
disp('Failed Loading of Binary File')
return
end

fseek(FID,0,'bof') ;

fileversion=fread(FID,1,'uchar'); % pull out fileversion - shoud always be 1
timestamp=fread(FID,1,'uint64'); % should be some weird in msec dating from 1970


while ftell(FID) < length_of_file
STUFF0=[STUFF0; fread(FID,1,'uint32')];

STUFF1(u:u+49)=[fread(FID,50,'float32')];

ftell(FID);

STUFF2(u:u+49)=[fread(FID,50,'float32')];
TEMP=[TEMP; STUFF0(end) STUFF1(u:u+49)' STUFF2(u:u+49)'];
u=u+50;
end
% save([PATHNAME 'HIHI' num2str(i) '.txt' ],'-ascii', '-double', '-tabs','TEMP')


disptitle = ['File #' num2str(i) ' Complete'];
disp(disptitle)
fclose(FID);
fclose('all');

pause(5)
FID=[];
end
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
There are a couple of problems that I see here.

For starters, Don't fclose('all'), it makes no sense to do this as you are already closing files after opening them.

Next, this little ditty

STUFF0=[STUFF0; fread(FID,1,'uint32')];

is extremely slow. When pulling stuff from the HD, (or any memory) you should pull it in sequential order. Pulling it out of order like this will make things go very slow. STUFF0 should never change, so why not do it once before the loop? If you need to copy it, do something like

STUFF3 = fread(FID,1,'uint32')
before the loop and
STUFF0 = [STUFF0; STUFF3];

inside the loop. This will prevent excessive seeking around the hard disk.

Next, what are you using ftell inside the loop for? It isn't doing anything. In fact, since you are only using ftell to indicate if you have hit the end of file, I would suggest using feof.

I don't know that this will fix your problem (as it definitely sounds like you have memory leaking issues which I don't see right off the bat) but it should make it run faster.

You should know that because matlab uses C for its file io stuff, FID is going to just be a handle. It isn't an array of any type. It is a handle, an identifier that links fread to the file. So all this stuff with trying to close all, set it to a blank array (which actually might be using up more memory), ect is misdirected. It isn't taking up any room.
 
Last edited:

magomago

Lifer
Sep 28, 2002
10,973
14
76
Cogman thanks for the reply.

I initially did not use fclose, but I threw it in because I was running out of ideas of what was building up. I didn't see any real difference in the time of file processing.

As for using vertcat ( stuff0=[stuff0; bla bla]) I use it here just because I haven't seen any real world speed impact. That array is only a few thousand long when it is all said and done, and running a quick vercat loop from 1:100,000 took basically no time at all.

I'm confused about the comment that stuff0 never changes. It does - each new line has a different value. and I toss that new value into stuff0.

I was using ftell because when i wrote this i had no prior experience with binary data at all, so I was taking it slow :x I'll comment that out. Although, I doubt its leading to my problem. I'll look into feof =) Thanks there.


I'm hesitant to want to say that seek times is impacting the work is that everything initially converts very very quickly: 3 files/min. Sure I might get to 4/min which would be great to optimize for later, but at the moment if I could keep a consistent 3files/min then I'd be happy. If I have bad technique in approaching these files (which I'm fine with admitting as I've only taken matlab as an engineering course that focused on using it for problem solving rather than a programming approach, I'll correct these code problems) I would think it should impact everything equally. Those seek times should be gimping me as the script is converting the first 10 files the same as when the script is converting files 490-500.
But that isn't what I'm getting. Files 0-10 convert ridiculously fast (3/min). But when I'm looking at files 250-300, I see that the timestamps in windows for each file are 10 minutes apart...and that is killing me. And when I look to see what happens between files 11-249, I see that it takes progressively longer and longer to convert these files. If I restart the script at file 300, then again I'm running at like 3/min. I'll try to plot these file delays later to see if its this linear delay.

On a side note, I have to access my data through a server over a network. However, the actual load on he network doesn't seem to matter. I get the same basic behavior at any time I run my data (whether it is during the day, or in the dead of night). I could throw in some tic/tocs to see if that delay is coming from the actual loop or loading the actual file....

when I thnk about it I need to do exactly what I said earlier: see how long at certain key sections of the code it takes to execute...if my delay is due to the network getting stuffy, then I should be seeing that delay coming the file loading; if my technique is causing something to build up, then i should see the delay coming from there.

As a final note, This script is running on Matlab 2006a.
 
Last edited:

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
I don't use binary files for anything in MATLAB since I work with relatively small datasets, but I can say that the vertcat can take a very long time. The "better" way to do this is to preallocate an array which will always be larger than your dataset (e.g. data=zeros(1e5,1)). Then, you simply count the length of actual entries in the array (or use find() to locate the end of the data, though this can require some knowledge about the type of data you're dealing with) and snip the size of the array at the end. In some cases, this has improved my execution time by orders of magnitude, even with relatively small data sets. How big are the sets you're looking at?
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Hi,

I've been converting masses of binary files to text files for a while. Using some code I wrote, I initially convert these binary files (just full of numbers of data I look at on a daily basis - they are split up into smaller files as opposed to having one monster file) at a rate of 3 files/min. However, as time goes on, the whole process slows down a LOT -> after 3 hours, i'm going at a rate of 1 file/10 min.
[snip]

(assume: These are big files, assume: Matlab is not great at optimizing file I/O)
Track Matlab's memory usage over time. When you start to saturate memory from a lot of OS-side or Matlab-side buffered File I/O, you will put increased pressure on disk to actually sink the writes. Over time, performance will degrade quite naturally.
 

magomago

Lifer
Sep 28, 2002
10,973
14
76
I don't use binary files for anything in MATLAB since I work with relatively small datasets, but I can say that the vertcat can take a very long time. The "better" way to do this is to preallocate an array which will always be larger than your dataset (e.g. data=zeros(1e5,1)). Then, you simply count the length of actual entries in the array (or use find() to locate the end of the data, though this can require some knowledge about the type of data you're dealing with) and snip the size of the array at the end. In some cases, this has improved my execution time by orders of magnitude, even with relatively small data sets. How big are the sets you're looking at?

100-1000 raw binary files @ ~1.5 megs each.

So I also threw in some tocs to find out where the time was being eaten up, and its definitely in the execution of the loop. you literally see it climb up like a mountain - it starts of as slightly linear between file 0-15 (ie: like the foothills), and then immediately on my 16th file up through the 30th file (i was working with a smaller subset) i was seeing huge increases in the time to read it all - it just climbed up and up, as if it was a mountain thrusting out of the ground, hence the mountain analogy.
I'm now running through more data, but throwing in tocs within that while loop to see what is causing it. If its really vertcat that is causing me woes, Imma take it out back and shoot it haha. But I'd be confused why vertcat would be okay when parsing the first 15 files, but then have issues on the later files (even though i close each file and clear my memory before opening the next one)...

wait....

do I even clear my STUFF0 matrix before i execute each time?

Or is it getting ever larger and larger?

*checks the code*



awwwwwwwwwwwwwwwwwwwwwwwwwwww :(


Now it makes sense.....I have enough memory that I can easily have an ever increasing STUFF0 matrix (I initialize it once before the main loop to cycle through files, and NOT before the while loop to cycle through each smaller file. I only posted the portion of code that is clearly causing my grief, which is why you don't see STUFF0 initialized at all).
In the beginning, i'm seeing some small (albiet we are talking on the level of seconds) increase in the duration of file parsing as STUFF0 gets larger. But I must be hitting a point where I start hitting the limits of free memory and start thrashing the HDD over and over as the data increases in length, or matlab just starts to suck balls at managing huge amounts of actively used memory, and I get huge increase in times to process....atleast it makes sense to me.

I'm going to try the same set of data twice, once with clearing STUFF0 right before the while loop, and another as it is...and record the times that it takes to execute through the loop.

If things are as I think they are, I should see the same pattern repeated in the STUFF0 matrix, not because vercat sucks and is slow (Which it is), but because the length of that matrix is getting fugly fugly ugly

edit:
I now remember WHY i chose to keep STUFF0 contiguous rather than reinitialize it, but looking back at it, it wasn't a good choice...

now let see what the numbers say!
 
Last edited:

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Yes, that would do it. :p Even if you do clear it each time, you will get much faster performance by preallocating, so it might be worth the couple lines of code you need to do it even if you have plenty of RAM. 1.5 MB files should be small enough to deal with using higher-level functions (dlmread/dlmwrite), though you will still have to get the data to ascii first, but this is obviously not necessary and could be slower than the present approach. All you should need to do to read in a binary file is as follows:
Code:
id = fopen('file.bin');
data = fread(fid);
fclose(fid);
This should put the contents of file.bin into the array data. You can then split up the columns and assign data types as necessary: I think this will probably be a lot faster than what you're doing now, as I can't imagine this taking more than 1-2 seconds unless you're working with a very slow hard drive. Check this out for more examples of how to do this (I don't think any of these functions has changed since 2006).