Extracting info from a messy OCR'd Word file?

etherealfocus · Aug 2, 2012

I've got a 62-page xlsx (scanned from paper as PDF, OCR'd, exported to Word) document full of contact information for various clients. Trying to extract only names and email addresses for a mailing list. It's in a two-column format; names are the first line, email is the last line.

I found a bunch of threads on the subject of using VBscript to extract data but I'm a complete non-programmer aside from a couple comp sci classes a decade ago and have no idea how to deal with the two-column issue.

Does this seem like a problem I could reasonably solve with a bit of help or should I just bite the bullet and do it manually? We're gonna be getting a fairly steady stream of these things so I'd really like to automate if possible.

Would it help for me to post a screencap so you guys can see what I'm working with?

Charles Kozierok · Aug 2, 2012

I've done this type of work before. Post the screenshot so we can see what it looks like.

etherealfocus · Aug 2, 2012

http://checker.uphero.com/sample.jpg

etherealfocus · Aug 2, 2012

The name is the first blacked out line. The email is obviously the last.

My guess is that the solution will involve something like telling it to find all instances of a common identifying word like 'Email', going up x lines to get the name and over x characters to get the email. The end goal is to have something we can easily and reliably import into Constant Contact.

Currently we're putting the crunched info into an Excel sheet (columns: email, first name, last name) and another guy is handling the Constant Contact side. Seem reasonable?

Charles Kozierok · Aug 2, 2012

The best way to go would be to use a good OCR program and then a script combined with some manual retouching.

You can probably get it done cheap on one of the freelancing sites like vWorker.com.

EagleKeeper · Aug 2, 2012

You will have to manually proof anyhow.

the better the OCR program, the less corrections will be needed.

Especially if the original input is a fixed typeset.

etherealfocus · Aug 2, 2012

Already got good OCR from Acrobat X Pro. The PDF is fairly clean. Some errors but I'll fix em later. Half the goal of this is to have something I can easily tweak for similar future projects. Is this something I could practically figure out on my own with a little help, or no?

EagleKeeper · Aug 6, 2012

It will be a lot easier if the output can go to a text file vs Word. then you do not have to go through the headache of parsing out word formatting characters.

Extracting info from a messy OCR'd Word file?

etherealfocus

Senior member

Charles Kozierok

Elite Member

etherealfocus

Senior member

etherealfocus

Senior member

Charles Kozierok

Elite Member

EagleKeeper

Discussion Club Moderator<br>Elite Member

etherealfocus

Senior member

EagleKeeper

Discussion Club Moderator<br>Elite Member

TRENDING THREADS