• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Any one know of an email extractor that works with DOC files?

Torghn

Platinum Member
I've got about 3000 doc files that I need to be able to extract an email address out of each of them. I've found many programs that can read txt and rtf files, but non that can read doc files. Any one know of a program that can do this?
 
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.
 
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? 😕
 
Originally posted by: KLin
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? 😕

It will concatenate all the txt files in the directory, pipe that to perl which will then use the regular expression to extrapolate the email address out of the line (if present), and the output is then finally redirected into the file emails.txt. Like I said, it was just an example of what you could do, but I acknowledge that not everyone would want to do the same. The benefit is that of time, but it comes at a learning cost to those not familiar.

[edit]Oh, and it does work.[/edit]
 
Originally posted by: Descartes
Originally posted by: KLin
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? 😕

It will concatenate all the txt files in the directory, pipe that to perl which will then use the regular expression to extrapolate the email address out of the line (if present), and the output is then finally redirected into the file emails.txt. Like I said, it was just an example of what you could do, but I acknowledge that not everyone would want to do the same. The benefit is that of time, but it comes at a learning cost to those not familiar.

[edit]Oh, and it does work.[/edit]


And it will capture the email address exactly whether the recipient name before the @ is 5 characters long or 10? that's awesome. I'm just a VBA newb though 🙂
 
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal.

*snip*.

after this point, I think he's got it, actually. he knows of software that can extrapolate emails from TXT files, right?
 
Back
Top