Marco.org

I’m : a programmer, writer, podcaster, geek, and coffee enthusiast.

Mac software to add searchable text to scanned PDFs

I use a Fujitsu ScanSnap S510M (which has since been replaced in their lineup) and love it. I’ve scanned, shredded, and recycled more than 4,200 pages so far that could have been taking up space in my house, but now aren’t.

As part of my workflow, which isn’t very interesting, I’d like OCR software to recognize the text in scanned documents and embed it under the page images in their PDF files. With the text embedded, I can search the documents with Spotlight and attempt to organize them more easily.

The ScanSnap came with ABBYY FineReader, which does an acceptable job, but degrades the image quality noticeably when it saves the text-embedded PDF copy. It’s enough of a problem that I’m not comfortable deleting the original, and I’d rather not keep two copies of every file around, so I tried to find an alternative that could output better-quality PDFs with text.

NOTE: I know there are more OCR apps than this. I probably forgot yours. There’s only so much time in the day, so I picked the ones that people recommended most and that seemed like good fits for what I want.

To test these apps, I made them all process a scan of a common document: a New York driver’s license eye-test form. (It was the last thing I scanned. I’m still 20/20, but probably not for much longer.)

ABBYY FineReader

Bundled with many ScanSnaps, based on ABBYY’s own OCR engine.

Moderately degraded image quality.

Few OCR errors.

Easily automated. (That’s what it’s for.)

Prizmo

$49, based on OpenRTK. It’s intended for recognizing text from photos, not scanners, to do cool things like “scan” from your iPhone camera.

Destroyed image quality, reduced to low-resolution monochrome.

Very few OCR errors.

I can’t figure out if it can be automated, but it’s clearly not designed for this type of use, so I can’t blame them if it can’t be.

VelOCRaptor

$29, based on Google’s free OCRopus engine.

Severely degraded image quality.

Many OCR errors.

Can be automated easily.

PDF OCR X

$29, based on the free Tesseract engine.

Perfect image quality.

Few OCR errors.

Can be automated easily, but the results still forcibly open in Preview after conversion, which gets in the way for my intended use.

PDFpen

$59, based on Nuance’s commercial OmniPage engine. This app does a lot; OCR is just one feature.

Perfect image quality.

Very few OCR errors.

Can be automated with AppleScript, although the windows still get shoved in your face while it’s working.

Acrobat

It also came with the ScanSnap, but testing it would require me to… install Acrobat. On my Mac. Where things work.

No.

I hate having to write “conclusion” headers

Only PDF OCR X and PDFpen preserved perfect image quality, so they’re the only options for which I’d feel comfortable deleting the original PDFs and keeping only the embedded-text copies.

PDF OCR X looks… like someone wrapped a bare-bones interface around an open-source OCR library.

PDFpen is nicely designed and built by an extremely well-respected, well-established Mac developer, and it’s available in the App Store. This means that it’s likely to be maintained for a while, an OS update probably won’t kill it, I’ll never need to to worry about serial numbers or licensing it between my desktop and laptop, and it will update automatically when I update other App Store apps.

So I’m going to try PDFpen for a while. I’ve been eyeing it for years because it does a lot of very useful things, but I’ve never quite been pushed to get it for a particular need. But I think this is it.