Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Introduction to OCR and Searchable PDFs

Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Why best practices are important

These steps will help you to create a clear plan before you begin the OCR process so that you can save time and produce the highest quality and most flexible final product.

Some General OCR Guidelines

Scanning Considerations:

  • The recommended resolution for best scanning results for OCR accuracy is 300 dots per inch (dpi).

  • Brightness settings that are too high or too low can have negative effects on the accuracy of your image. A brightness of 50% is recommended.

  • The straightness of the initial scan can affect OCR quality. Skewed pages can lead to inaccurate recognition.

  • Older and discolored documents must be scanned in RGB mode in order to capture all of the image data.

Textual Considerations:

  • Language: texts published before 1850 may not be the most compatible with OCR software. However, if the pages you are scanning are in different a different language, many OCR systems allow you to select the language of the document.

  • Documents with low contrast can result in poor OCR.

  • Typescript results in poorer OCR than printed type; inconsistent use of font faces and sizes can lower OCR accuracy. 

Other Considerations:

  • An OCR software's ability to accurately analyze your document is dependent on the condition of the original and/or quality of the digital file.

  • If you do not have a digital document, or if what you have is poor quality, you are able to scan the original document using your OCR program as your scanning software.

  • No special skills are required to use OCR software.

  • You should be aware that if your goal is 100% text accuracy, you will need to check and correct the text after it has completed the original recognition process. The system cannot do this check itself. The editing/correcting process may take a considerable amount of time for large amounts of text and/or poor quality original text.

For more information on how to obtain a quality image, please consult the LibGuide on How to Use Digital Tools for Archival Research.

Digital Content Creation's 5 Best Practices for OCR (in brief)

Consider the Uses of OCR Output

Thinking through your intentions for the final OCR'd text will help you to create a final text that is rich in all of the appropriate ways. You should consider the level of precision you wish to have in your final text. Should it be a facsimile style representation of the full-text? Or, are certain standards required for the repository you may be sending the text to in the future? It is also important that all collaborators are on the same page in terms of requirements and expectations for the final product.

Consider Functionality of OCR Software

Understanding how the OCR software will handle your text is also an important consideration. You should consider the structural elements of your text like the headings, images, tables, captions, and even the font and language. How will the OCR software deal with each of these elements? What can you do to help the process go as smoothly as possible? De-skewing, cropping, etc.

Consider Output File Formats

In you planning for the uses of your output, you may have already considered the output file format of your final text document. Keep in mind that making your text widely searchable is one of the main uses of OCR. Consider the best format for your project based on who you want to have access to your text as well as how you could make it accessible to your intended audience.

Consider the 2 Factors Affecting Accuracy of OCR

Remember, software packages may boast between 97% and 99% accuracy, however, these rates are based on character errors, not word errors.

  • Textual Considerations

    • Special fonts (typewriter), super small fonts (6pt), and low contrast text can all decrease the accuracy of the OCR software. Sometimes, OCR software will not be helpful to use at all. For example, OCR software cannot recognize handwritten documents with any degree of accuracy.

  • Scanning Considerations

    • Getting a quality image is the first step in having the best and most accurate OCR experience. Consider such things as resolution, brightness, straightness, and discoloration before you scan your text.

Consider Possible Corrections

If your OCR'd text requires 100% accuracy, this can be done using the editing tools within the OCR software. This process can be labor intensive and time consuming, however, this is the best way to produce an exact copy of a text.

More resources