Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Introduction to OCR and Searchable PDFs

Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Basic Command

All Tesseract commands follow the same basic format:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

It is by shaping this command that you will be able to use Tesseract and tell it how you want it to work. For definitions of each part of the command, see the below image:

Note: As a beginner, you will probably won't be using pagesegmode or configfile just yet, so we won't be focusing on those commands in this LibGuide. If you're intersted in what these can do, check out the ControlParams page on the Tesseract Wiki. For a list of all possible commands that can be used with Tesseract, see the Command Line Usage GitHub page.

File Input Formats

Tesseract will only take image files for input. These include:

  • TIFF (preferred)
  • JPG
  • PNG

File Output Formats

Tesseract has a limited number of file output formats. These include:

  • Plain txt (utf-8 encoded)
  • PDF (searchable)
  • HTML
  • hOCR

Examples

These are some examples of how to draft a Tesseract command that will work for particular inputs and outputs. They should show you how to draft commands for your own work when using Tesseract.


TIF -> TXT

This will be one of the most basic commands you can perform in Tesseract. Let's say you have an image file called words.tif and you would like to use Tesseract to create a txt file called words.txt. The command for that would look like this:

tesseract words.tif out

You don't need to add a lot onto this command, because the automatic language is English, and txt files are the automatic output.


PNG -> PDF

This one will be a little more complicated. Say you have a document in German called words.png and would like to create a searchable PDF from it. The command for that would look like this:

tesseract words.png out -l deu PDF

In order to perform this command, you have to include a minus sign followed by a lowercase letter L and then the language code [-l deu], which tells the program that the file is in German, and [PDF] to tell the program that the output should not be the automatic txt file, but a PDF. All PDFs created in Tesseract should be searchable.


Getting the hang of it?

Tesseract is different than the other OCR options on this LibGuide because you can tell it and train it to do very specific things. It may be tricky starting out, but once you start playing around with Tesseract, it offers a lot of flexibility.

Why Isn't Tesseract Working?

Sometimes, things can lead to an error that keeps Tesseract from producing an output for your file, or the file Tesseract produces for you ends up looking a little strange. Here are some quick troubleshooting ideas for when this happens to you.

  • Tesseract tells me that there's an error
    • Check to see if your command is correct; it's easy to make mistakes, so there's no harm in looking again
    • Make sure that you have the langcode or other additional options downloaded and that they are ready to work
    • Check that your input and output formats are supported by Tesseract
  • The output looks strange
    • Check to see if your command is correct; it's easy to make mistakes, so there's no harm in looking again
    • Look at the quality of the input image -- low quality images are harder for Tesseract to read
      • See Tesseract's ImproveQuality page for more information about improving the quality of the image
    • Understand that no OCR software is perfect -- you will need to check over its work for 100% accuracy

Important Links