You've probably heard the term "OCR" being used when it comes to scanned documents or in relation to searchable PDF files.
How does OCR work?
The process involves taking an image of a document (usually by a scanner) and converting that image into a format of data that your computer can understand as text so you will be able to interact with that text using the right software. It's not as simple as it may sound, however.
The output of a typical OCR process contains not just text information, like how a Microsoft Word or TXT file would - it also contains positional/coordinate information for each word or line of text. Part of creating a searchable PDF from a scan involves taking this OCR output information and rendering the words on top of the image, in a special "hidden" PDF text rendering mode.
By doing this, if you search for a word in the PDF, say in Adobe Reader, it finds a match using the hidden text and highlights the hidden word, but you see the scan image underneath. Similarly, if you go to select a range of text, for example, to copy it, you are actually interacting with the hidden text layer.
Given that it is a complicated process, there are issues that may arise from a number of different places though:
1) Ideally, the hidden text word is properly aligned, sized, and oriented to match the underlying word in the image. But sometimes the hidden words are offset, so when you search within the PDF document it ends up highlighting a different area than the actual word in the image.
In the example below, the blue boxes represent the hidden overlay text, which has been misaligned with the word 'goldfynch'
2) Another issue that may arise is caused by inaccurate OCR: some instances of a word/phrase that is visible on the scan of the page may be skipped by, without being highlighted.
For example, if the word "important" exists multiple times on a page, but one instance was incorrectly OCRd as "irnportant" (IRN at the beginning), then searching for "important" will scroll to the page and highlight the properly-recognized "important" word but skip past the incorrectly-recognized word even though you can see it visually in the scan of the page.
Even if the OCR process executes correctly, it is possible for other factors to affect how you interact with an OCR'd document as well:
3) Sometimes the software you are using may incorrectly display a layer, rendering the scan image on top of the hidden text layer, such that the highlights occur underneath the image and are not visible. If you perform a search and are able to scroll through search results but cannot see the highlights, it is worth trying a different software like Adobe Acrobat, Mac Preview, Google Chrome, Firefox or Windows Edge.
Additionally, access to OCR isn't always easy, and there are often many steps just to have it run on your documents:
- First you need to get in touch with a vendor and receive a quote. These are usually charged by the page
- The process can take anywhere between a few days to a week
- Once the OCR is performed, you then need to upload the documents to a system that will allow you to search through them
Is there a better option?
Yes! OCR is a central part of the eDiscovery platform GoldFynch, and you don't have to go through any extra steps to have it run on your documents. In fact, you often don't even realize that OCR is being carried out, as it is automatically performed on any documents it encounters as you upload them to the platform. GoldFynch has a built-in document viewer and a robust search engine that lets you search through your uploaded, fully-searchable documents. This saves your law firm both time and money.
GoldFynch is great for many other reasons, too! It is simple to use, powerful, and cost-effective. To learn more, check out https://goldfynch.com