Any one know of an email extractor that works with DOC files?

Torghn

Platinum Member
Mar 21, 2001
2,171
0
76
I've got about 3000 doc files that I need to be able to extract an email address out of each of them. I've found many programs that can read txt and rtf files, but non that can read doc files. Any one know of a program that can do this?
 

Descartes

Lifer
Oct 10, 1999
13,968
2
0
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.
 

KLin

Lifer
Feb 29, 2000
30,474
778
126
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? :confused:
 

Descartes

Lifer
Oct 10, 1999
13,968
2
0
Originally posted by: KLin
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? :confused:

It will concatenate all the txt files in the directory, pipe that to perl which will then use the regular expression to extrapolate the email address out of the line (if present), and the output is then finally redirected into the file emails.txt. Like I said, it was just an example of what you could do, but I acknowledge that not everyone would want to do the same. The benefit is that of time, but it comes at a learning cost to those not familiar.

[edit]Oh, and it does work.[/edit]
 

KLin

Lifer
Feb 29, 2000
30,474
778
126
Originally posted by: Descartes
Originally posted by: KLin
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal. I would personally use Perl to extrapolate the email addresses using something like the following:

cat *.txt | perl -ne "/(\w+[\w-\.]*\@\w+((-\w+)|(\w*))\.[a-z]{2,3})/;print \"$1\n\";" > emails.txt

I didn't run this myself obviously, so don't berate me for my ad-hoc syntax. I think at least the first part should be useful to you.

wtf does that do? :confused:

It will concatenate all the txt files in the directory, pipe that to perl which will then use the regular expression to extrapolate the email address out of the line (if present), and the output is then finally redirected into the file emails.txt. Like I said, it was just an example of what you could do, but I acknowledge that not everyone would want to do the same. The benefit is that of time, but it comes at a learning cost to those not familiar.

[edit]Oh, and it does work.[/edit]


And it will capture the email address exactly whether the recipient name before the @ is 5 characters long or 10? that's awesome. I'm just a VBA newb though :)
 

Amorphus

Diamond Member
Mar 31, 2003
5,561
1
0
Originally posted by: Descartes
Well, there are a number of ways to do it, but here's what I would do:

I would record a simple macro that saves the doc as a text file, then later process those text files. You could then issue a simple command like the following at the CLI:

for %i in (*.doc) do "winword /mYOURMACRONAME"

The 'm' switch to winword just tells it to run the macro. You would then have text files that you could process as normal.

*snip*.

after this point, I think he's got it, actually. he knows of software that can extrapolate emails from TXT files, right?