• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Extract certain fields from PDF to a spreadsheet

PuppettMaster001

Golden Member
I have PDF forms that I would like to keep track of certain fields using a spreadsheet. I dont know if this is possible, but is there a way to have the spreadsheet/database keep track of the fields from all of the PDFs in a certain directory? Sorry if I'm not being clear, long night last night.

Short:
Have PDF forms, want to track with spreadsheet.


Thanks

-Gino
 
Depends on what you mean by the fields.

If the PDF is a document instance of a form that just has a certain visual format like an invoice for
utility service that you get every month, there may not be a specific "field" for the items of interest
that is distinct from the "text" of the field labels or anything else.

If the PDF document was authored to be a PDF Form that you can actually interactively fill out when you open the PDF then there is more structure to the definitions and values of the fields and field values that you can access with other programs. I think Adobe Acrobat something or other and some of their other software lets you set up a server program to collect data from PDF forms people fill out over the internet, and I assume that'd work for a local computer being both the server and the viewing/filling location. I believe that such interaction would most likely occur as someone fills out the PDF form and submits the changes rather than in batch mode on already stored PDFs, but maybe there are various ways to set it up. I believe they'd likely have some database connectivity solution too, but I don't know much about their software other than that it is usually expensive and somewhat underwhelming. 🙂

If there are actual forms / fields defined in the document, you can try a tool called pdftk to dump out the state of the data in the forms in a given PDF file and turn that information into a report in a specific format text file it generates. Then you could take that report from pdftk and through some kind of custom database/spreadsheet import filter turn it into the database / spreadsheet data you need representing that particular document. You'd need a program that went out and looked all over the directories of interest for PDF files and then which ran that sort of dump export / import process on each PDF file found.
This is the data extraction command AFAIK:
pdftk dump_data_fields

If the PDF files are just basically text documents with a certain *format* but which do not actually contain
fillable or pre-filled distinct field definitions and form values, you will basically have to create a program
to read the PDF file and parse its format into the data you seek and then save that as a CSV file or generate SQL statements or use ODBC or COM/OLE or whatever to send the data to your spreadsheet. Again you'd have to have a spider program that periodically or on command or when triggered by something else went out and crawled over the directories to process each PDF file in turn.

PDF libraries like poppler or tools like PDFTK and parsers like ANTLR or whatever could be used to do the reading / decoding / parsing / report exporting.

None of this is particularly difficult to do with a bit of basic scripting / programming capacity like in VisualBasic or JAVA or PERL or RUBY or whatever.

Unfortunately it is a lot harder than it OUGHT to be where what we should have is not a useless version of
electronic paper but an electronic document that is semantically encoded and meaningfully manipulable so that you CAN just import it into a database and it'd have its own schema and DublinCore type filters already associated with its instance. Frankly for reasons like this perpetuation of making things more DIFFICULT than ever before even in an era where IT / computers should be making our lives / information processing tasks EASIER than ever before is inexcusable, and I think PDF / XPS / .DOC / et. al. should burn in hell and people should use semantically meaningful information storage technologies instead like RDF / DC / XML / et. al.


 
Back
Top