Skip to main content

University Library, University of Illinois at Urbana-Champaign

Introduction to OCR and Searchable PDFs: Tesseract

Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Important Links

Command-Line Resources

The Command-Line Interface (CLI) is the user's window into the computer operating window. The user uses text-based commands to instruct the computer on what to do. Prior to the 1980s and the rise of the Graphic User Interface that we are now used to today, the majority of computer users used CLI to run their systems. Today, however, most computer users don't know a whole lot about CLI.

If you're one of those people, don't panic! We've gathered some resources here that will help you learn how to run the CLI through text-based commands. Learning the basics shouldn't take you more than an hour or two. Knowing how to use your CLI allows you not only to run Tesseract, but will help you learn more about your system, and teach you tricks to operating it.

What is Tesseract?

Tesseract is an optical character recognition (OCR) system. It is used to convert image documents into editable/searchable PDF or Word documents. It is a free, open-source software run through a Command-Line Interface (CLI). Tesseract is considered one of the most accurate open source OCR engines currently available and its development has been sponsored by Google since 2006.That being said, its capabilities can be more limited than commercial software like Adobe Acrobat Pro and ABBYY FineReader. However, because it is an open source software, anyone with programming knowledge can edit the code behind Tesseract and help it learn what you need to do. It can be used on Mac, Windows, and Linux machines.

How Tesseract analyzes documents:

  • User inputs document title, desired title, and desired format into Tesseract
  • Tesseract analyzes these images and creates a new, searchable document in the user's desired format
  • Unlike other OCR software, you cannot scan something directly into Tesseract

Basic OCR Operations in Tesseract:

  • Image format (JPG, TIF, PNG, etc.) to PDF, Microsoft Word
  • New document appears in the same directory as initial document
  • Run through your Command-Line Interface

With the resulting files being editable and searchable, researchers will be able to:

  • Copy, paste, and edit passages of text within the new document
  • Search the text in PDF readers or word processing programs
  • Ingest the text into analysis programs like ATLAS.ti or NVivo
  • Make information easier to find via the Internet by creating searchable documents

Scholarly Commons

Scholarly Commons's picture
Scholarly Commons
Contact:
306 Main Library
Drop-ins welcome
Monday-Friday 8:30am-6:00pm
Phone: 217-244-1331
Website