Automatically sort jpg document scans into folders based on contents

SAdata

Well-Known Member
Joined
Sep 4, 2012
Messages
396
Does anyone know if there is a program to automatically sort jpg scans of documents (each 1 page) where the program uses OCR to search for unique text in the page and then sorts it into folders based on this? I have about 10000 jpg scans which I need to sort into about 150 folders and would rather not go through them one by one lol? The scans are in no order and the names of the jpg files and document properties have no use in sorting as they are totally random.
 

gfmalan

Expert Member
Joined
Nov 11, 2013
Messages
2,676
If you have Adobe Acrobat, you can convert jpg to pdf, and the you can search the contents of those pdfs
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Have you tried Google Drive?

Alternatively you could write the code to do this yourself; fairly simple if you're a developer?
for example: tie into something like tesseract
 
Last edited:

freddster

Expert Member
Joined
Dec 13, 2013
Messages
2,470
[)roi(];17489850 said:
Have you tried Google Drive?

Alternatively you could write the code to do this yourself; fairly simple if you're a developer?
for example: tie into something like tesseract

My first reaction was Has he tried Google Search? Ok tried searching myself not much on it. Thought a while ago how they go about recognising the number plate numbers on speed camera fotos. Probably some serious analysis crap in the photo going on in there.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
My first reaction was Has he tried Google Search? Ok tried searching myself not much on it. Thought a while ago how they go about recognising the number plate numbers on speed camera fotos. Probably some serious analysis crap in the photo going on in there.
Not really, license plates are pretty easy; given the font for plates is chosen to be legible.

The OCR aspect is also relatively simple (in developer terms); identifying the license area for OCR basically entails scanning the image for rectangular shapes and then running OCR on those regions. Images that don't find a match can then be put through automatic deblurring processing i.e. to minimise motion blur, shaky camera, etc... & finally rerunning the OCR part. Naturally images that fail both processes would be binned for human intervention.

The new potential with GPU (graphics processing unit) based parallel processing (OpenCL/CUDA/OpenCV) means that getting snapped by a speed camera and receiving a fine via SMS within a few minutes thereafter is quite feasible.
 
Last edited:

freddster

Expert Member
Joined
Dec 13, 2013
Messages
2,470
[)roi(];17490056 said:
Not really, license plates are pretty easy; given the font for plates is chosen to be legible.

The OCR aspect is also relatively simple (in developer terms); identifying the license area for OCR basically entails scanning the image for rectangle shapes and then running OCR on those regions. Images that don't find a match can then be put through automatic deblurring processing i.e. to minimise motion blur, shaky camera, etc... & finally rerunning the OCR part.

Naturally images that fail both processes would be binned for human intervention.

Ah, standards make things easier. Valid points here. thats why they take a couple of photos of a trangressor. Making sure the software can get the plate right. And thats why you see people placing tape between parts of the letters or numbers. Confuse the system.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Ah, standards make things easier. Valid points here. thats why they take a couple of photos of a trangressor. Making sure the software can get the plate right. And thats why you see people placing tape between parts of the letters or numbers. Confuse the system.
Exactly! as to the taping over parts of the license, well the fat metro police (hippos) occasionally need to work.
Theoretically also feasible with parallel processing (GPU) that the system could notify police (within minutes) to look out for the problem license plate.
 
Top