LibGuides: Introduction to OCR and Searchable PDFs: Tesseract

What is Tesseract?

Tesseract is an optical character recognition (OCR) system. It is used to convert image documents into editable/searchable PDF or Word documents. It is a free, open-source software run through a Command-Line Interface (CLI). Tesseract is considered one of the most accurate open source OCR engines currently available and its development has been sponsored by Google since 2006.That being said, its capabilities can be more limited than commercial software like Adobe Acrobat Pro and ABBYY FineReader. However, because it is an open source software, anyone with programming knowledge can edit the code behind Tesseract and help it learn what you need to do. It can be used on Mac, Windows, and Linux machines.

How Tesseract analyzes documents:

User inputs document title, desired title, and desired format into Tesseract
Tesseract analyzes these images and creates a new, searchable document in the user's desired format
Unlike other OCR software, you cannot scan something directly into Tesseract

Basic OCR Operations in Tesseract:

Image format (JPG, TIF, PNG, etc.) to PDF, Microsoft Word
New document appears in the same directory as initial document
Run through your Command-Line Interface

With the resulting files being editable and searchable, researchers will be able to:

Copy, paste, and edit passages of text within the new document
Search the text in PDF readers or word processing programs
Ingest the text into analysis programs like ATLAS.ti or NVivo
Make information easier to find via the Internet by creating searchable documents

Command-Line Resources

The Command-Line Interface (CLI) is the user's window into the computer operating window. The user uses text-based commands to instruct the computer on what to do. Prior to the 1980s and the rise of the Graphic User Interface that we are now used to today, the majority of computer users used CLI to run their systems. Today, however, most computer users don't know a whole lot about CLI.

If you're one of those people, don't panic! We've gathered some resources here that will help you learn how to run the CLI through text-based commands. Learning the basics shouldn't take you more than an hour or two. Knowing how to use your CLI allows you not only to run Tesseract, but will help you learn more about your system, and teach you tricks to operating it.

Learn the Command Line Tutorial from Codecademy
Primarily for Mac and Linux users
Introduction to the Mac OS X Command Line from Treehouse
For Mac users
Introduction to the Bash Command Line from The Programming Historian
For Windows and Mac users.

Important Links

Tesseract Wiki
The go-to hub for figuring out how you should download and use Tesseract
Tesseract GitHub Page
Includes the repositories used for Tesseract
An Overview of the Tesseract OCR Engine
A PDF file of a paper written by Google's Ray Smith describing Tesseract in detail