You've probably heard the term "OCR" being used when it comes to scanned documents or in relation to searchable PDF files. Here's a closer at what it involves, and why it's important for eDiscovery.

What is OCR?

OCR stands for Optical Character Recognition, and is the process of converting information found on images of text (obtained via a source like a scanner) into a format of data that your computer can understand as text. This allows you to interact with the text through a word processing tool like Microsoft Word, where otherwise you would only be able to place the original image.

How does OCR work?

The output of the OCR process contains more than just the textual information that would be found in a Microsoft Word or TXT file - it also contains positional information of where each word/line of text is in the originally-scanned "image" of the file. And it uses the positional information to render each word of the text on top of its respective word in the image, in a special "hidden" PDF text rendering mode. 

Once this process is complete you will be able to search for text in the output PDF using software like Adobe Reader. It will look for a match of the search term in the hidden text layer and will highlight it (in the hidden text layer.) But since the invisible text is positioned accurately on top of the identical text in the visible image, it will appear to be highlighting the correct visible text. This carries over to other text-related functions like selecting or copying a range of text; you are actually interacting with the hidden text layer.

Given that it is a complicated process, there are issues that may arise from a number of different places though

1) Ideally, the hidden text is properly aligned, sized, and oriented over the image. But sometimes the hidden words are offset, so when you search within the PDF document it ends up highlighting a different area than the actual word in the image.

In the example below, the blue boxes represent the hidden overlay text, which has been misaligned with the word 'goldfynch'
2) Inaccurate OCR can lead to some instances of a word/phrase being skipped/unhighlighted if searched for, even though they are visible in the image.

For example, if the word "important" exists multiple times on a page, but one instance was incorrectly OCRd as "irnportant" (lower-case "IRN" at the beginning of the word,) then searching for "important" will scroll to the page and highlight the properly-recognized "important" word, but skip past the incorrectly-recognized word even though you can see it visually in the scan of the page.

3) Even if the OCR process executes correctly, it is possible for other factors to affect how you interact with an OCR'd document as well. Sometimes the software you are using may incorrectly display a layer, rendering the scanned image on top of the hidden text layer, such that the highlights occur underneath the image and are not visible. If you perform a search and are able to scroll through search results but cannot see the highlights, it is worth trying a different software like Adobe Acrobat, Mac Preview, Google Chrome, Firefox or Windows Edge.

Additionally, access to OCR isn't always easy, and there are often many steps just to have it run on your documents:

  • First, you need to get in touch with a vendor and receive a quote. These are usually charged by the page
  • The process can take anywhere between a few days to a week
  • Once the OCR is performed, you then need to upload the documents to a system that will allow you to search through them

Is there anything that can make this easier?

