• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Extracting info from a messy OCR'd Word file?

etherealfocus

Senior member
I've got a 62-page xlsx (scanned from paper as PDF, OCR'd, exported to Word) document full of contact information for various clients. Trying to extract only names and email addresses for a mailing list. It's in a two-column format; names are the first line, email is the last line.

I found a bunch of threads on the subject of using VBscript to extract data but I'm a complete non-programmer aside from a couple comp sci classes a decade ago and have no idea how to deal with the two-column issue.

Does this seem like a problem I could reasonably solve with a bit of help or should I just bite the bullet and do it manually? We're gonna be getting a fairly steady stream of these things so I'd really like to automate if possible.

Would it help for me to post a screencap so you guys can see what I'm working with?
 
The name is the first blacked out line. The email is obviously the last.

My guess is that the solution will involve something like telling it to find all instances of a common identifying word like 'Email', going up x lines to get the name and over x characters to get the email. The end goal is to have something we can easily and reliably import into Constant Contact.

Currently we're putting the crunched info into an Excel sheet (columns: email, first name, last name) and another guy is handling the Constant Contact side. Seem reasonable?
 
The best way to go would be to use a good OCR program and then a script combined with some manual retouching.

You can probably get it done cheap on one of the freelancing sites like vWorker.com.
 
You will have to manually proof anyhow.

the better the OCR program, the less corrections will be needed.

Especially if the original input is a fixed typeset.
 
Already got good OCR from Acrobat X Pro. The PDF is fairly clean. Some errors but I'll fix em later. Half the goal of this is to have something I can easily tweak for similar future projects. Is this something I could practically figure out on my own with a little help, or no?
 
It will be a lot easier if the output can go to a text file vs Word. then you do not have to go through the headache of parsing out word formatting characters.
 
Back
Top