• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Need help with parsing very large file

Turkey

Senior member
I need to parse an extremely large file at work and store the data in some structured data objects. The program is being written in C++ for a Unix-based system. Previous iterations of this program have exhausted the program's 4GB virtual memory space storing the info in the data objects, so memory is at a premium here.

The question is: should I use demand-based parsing where the data objects ask for further parsing from the parser object when they need it, or a full parsing scheme where the parser object fills in the data structures and hands back a pointer at the end?

The file is basically newline delimited.
 
I think that you should probably do the full parse, but don't try to store al the objects at once. Write the objects themseves to disk (can't remember if there's an equivalent to Serializable (from Java) in C++, but I imagine so), and maintain a hash table so you can get to the hander and filename for each object quickly. After you've parsed the whole file, and written everything to disk, you can load them up on demand, defining the maximum amount of memory you want them to take up, and release them back to memory based on expected demand.
 
Sounds like your parser needs to build a temporary random-access flatfile-DB file of the structured objects, perhaps returning an array of file offsets instead of memory pointers? Only limit then is disk space. Of course you'll also need functions to read back an individual record given a file pointer, and possibly to store a changed object back to the file.

If the DB file is temporary you might get by with doing a binary write of your class / struct object (using sizeof(my-class-name)) instead of truly serializing it, but that assumes your class / struct is fixed-size, isn't doing any memory allocations of its own.

(text file) --> (records file) + (record pointers)

Another method would be to make 1 "spotting" pass through the text file then jump around and re-parse single structs as needed:

(Text-file) --> (text-file-offset fo each struct) + optional: (length of text to read for that struct)

This way requires lots of re-parsing single items but doesn't use any extra disk space.
 
Back
Top