LibGuides: Introduction to OCR and Searchable PDFs: An Introduction to OCR

This Guide

This guide is meant to serve as an introduction to OCR, explaining the basic concepts of what OCR is, how OCR is used, software options, and best practices. This guide will give in-depth instructions on ABBYY FineReader, Adobe Acrobat Pro, and Tesseract, three popular OCR software options. If you have questions after reading this guide, or would like some guidance on using OCR software, please contact the Scholarly Commons.

What is OCR?

Are you curious about optical character recognition (OCR) software? Interested in learning how OCR software may be able to enhance your research project? Or, maybe you are interested in the ways in which OCR can aid in textual comparisons. This guide aims to help you explore the special features of different OCR software.

Optical character recognition (OCR) is the electronic identification and digital encoding of typed or printed text by means of an optical scanner and specialized software. Using OCR software allows a computer to read static images of text and convert them into editable, searchable data. OCR typically involves three steps: opening and/or scanning a document in the OCR software, recognizing the document in the OCR software, and then saving the OCR-produced document in a format of your choosing.

OCR can be used for a variety of applications. In academic settings, it is oftentimes useful for text and/or data mining projects, as well as textual comparisons. OCR is also an important tool for creating accessible documents, especially PDFs, for blind and visually-impaired persons.

Licensing

Except where otherwise indicated, original content in this guide is licensed under a Creative Commons Attribution (CC BY) 4.0 license. You are free to share, adopt, or adapt the materials. We encourage broad adoption of these materials for teaching and other professional development purposes, and invite you to customize them for your own needs.