Is it possible to scan a PDF to extract some fields?

mcveigh

Diamond Member
Dec 20, 2000
6,468
6
81
My work wants some of us to go through thousands of PDF's and get some fields from each one.
There is a name and a serial number on each one, among other data.

Is it possible to somehow scan a folder of PDF's to get this info? This would save many hours of work!
 

sdifox

No Lifer
Sep 30, 2005
94,694
14,940
126
Look into abbyy, you can get a desktop licence and just batch all of them.
 

Comdrpopnfresh

Golden Member
Jul 25, 2006
1,202
2
81
PDF is an evil format. Try and see if there is a Foxit Application that provides what you need, and use the trial version. DO NOT download shoddy PDF Tools that have poorly document sites and claim to be able to write an application for whatever you or your business needs. Those are very untrustworthy sources.
 

corkyg

Elite Member | Peripherals
Super Moderator
Mar 4, 2000
27,370
238
106
From your question, I gather you are talking about printed PDF docs. Are they available in digital format? If so, that is not a difficult task.
 
  • Like
Reactions: mcveigh

mcveigh

Diamond Member
Dec 20, 2000
6,468
6
81
Sorry it took so long to respond, I've been putting this off for a while.
Yes these are pure digital PDF's created by software for export.
 

corkyg

Elite Member | Peripherals
Super Moderator
Mar 4, 2000
27,370
238
106
If they are PDF files, then no scanning is necessary. Just move or copy the files with your File Manager. To then extract a specific piece of data, it is read, copy and paste. There's no easy extraction batch tool unless every one is precisely the same format.
 

mcveigh

Diamond Member
Dec 20, 2000
6,468
6
81
They are all in exactly the same format. One one line there is an ID number and then a serial number. I'd like to extract those 2 numbers into 2 columns on a spreadsheet.

I;m downloading abbyy, but right now I'm doing it by hand.
 

pjmssn

Member
Aug 17, 2017
89
11
71
I don't know if you like writing code or if it is worth your time, but it should be pretty easy to automate the text extraction with a python script https://automatetheboringstuff.com/chapter13/.
You can extract the text in the pdf files, search for the required information and store it in a text or excel file.