Systems for converting images into editable text: OCR

By Emilio Lara, Systems Technician

What does OCR mean? It stands for Optical Character Recognition, which is the process we use with images or PDF files when we want to extract the text from them and convert it into editable text.There are numerous tools which help us greatly in this task, but there is no perfect software capable of extracting the text, format and layout as we would like.Therefore, in this post, I would like to talk about the different OCR tools that I normally use, so that you can approach each project in the most suitable way.

For starters, there is one question that you always have to ask yourself before starting a project of this kind: Is there an original document?

Let me explain: generally speaking, every image or PDF file has been obtained from an original document (Office, InDesign, Photoshop, QuarkXpress, Illustrator files etc.). That file is hugely important, because having it can save you a lot of time and work. Also, if you have it, the end result will be infinitely better, because you’ll keep the original layout and you’ll only be altering the text.

But if you had the original document I don’t think you’d be reading this post. The problem comes when you only have the PDF file or image and you’re left to fend for yourself. Obviously, copying the text by hand is a bad idea (unless it’s a very short document), so we’re going to make use of the OCR tools. Let’s begin:

– Adobe Acrobat Pro

The ultimate PDF reader has its own OCR system. To use it, simply go to File>Save as…>Microsoft Word and select your preferred version of Word.

Pros: It copies the PDF file into a Word version of the file, trying to emulate the layout of the document. It is highly recommended for PDF files that require formatting and which come from a .doc file.

Cons: If the PDF isn‘t of a very good quality, it usually makes quite a mess of the page (page breaks, carriage returns, format changes etc.).

– Omnipage

This was one of the first programs that I discovered and it works very well for batch files. It has a wizard that allows you to automatically perform the OCR process for multiple files at once.

Pros: If you have several PDF files, it’s the best option. Having to open them one by one in Acrobat and then save them is a waste of time which you can avoid with Omnipage.

Cons: Sometimes it has compatibility issues with certain types of files and the quality of the OCR could also be improved.

– ABBYY FineReader

For me, this is the best OCR around today. As well as effectively extracting the text with very few errors, it is capable of recreating the layout of the document without losing a great deal of quality, even if the document is of a poor quality. It also has its own spell checker to correct the text that is extracted, as well as many other features.

Pros: It works very well with files with virtually any type of quality.It also allows you to export the result directly to Word.

Cons: I’m yet to find anything major.

– Online OCRs

To tell you the truth, online OCRs don’t inspire a lot of confidence in me, but I have to admit that they’re handy in an emergency. There are hundreds of them and, if you don’t have any of the above options, they are a good choice.

Pros: If you have an Internet connection you can use it anywhere in the world if you don’t have any software installed.

Cons: I wouldn’t use it with documents that contain sensitive information; you never know.Also, this option is only for emergencies, when you don’t have any of the above options.

Of course, besides those mentioned above, there are numerous options that can serve this purpose, but these are the ones that I normally use and they give me the best results.

As a final tip, it’s highly recommended to install the TransTools Utilities (if you don’t already have it). This is a set of macros created for Office which speed up many processes. It includes options that allow you to automatically optimise the result obtained from OCRs and, for example, remove unnecessary carriage returns, double spaces and section breaks or match the text format to prevent annoying tags from appearing when you’re working with a translation tool.

I hope that I’ve helped you a little with these tips, but if you have any recommendation of your own, make sure you leave a comment so that we can all carry on learning.

See you!