LibGuides: Introduction to OCR and Searchable PDFs: Using Tesseract

Basic Command

All Tesseract commands follow the same basic format:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

It is by shaping this command that you will be able to use Tesseract and tell it how you want it to work. For definitions of each part of the command, see the below image:

Note: As a beginner, you will probably won't be using pagesegmode or configfile just yet, so we won't be focusing on those commands in this LibGuide. If you're intersted in what these can do, check out the ControlParams page on the Tesseract Wiki. For a list of all possible commands that can be used with Tesseract, see the Command Line Usage GitHub page.

File Input Formats

Tesseract will only take image files for input. These include:

TIFF (preferred)
JPG
PNG

File Output Formats

Tesseract has a limited number of file output formats. These include:

Plain txt (utf-8 encoded)
PDF (searchable)
HTML
hOCR

Examples

These are some examples of how to draft a Tesseract command that will work for particular inputs and outputs. They should show you how to draft commands for your own work when using Tesseract.

TIF -> TXT

This will be one of the most basic commands you can perform in Tesseract. Let's say you have an image file called words.tif and you would like to use Tesseract to create a txt file called words.txt. The command for that would look like this:

tesseract words.tif out

You don't need to add a lot onto this command, because the automatic language is English, and txt files are the automatic output.

PNG -> PDF

This one will be a little more complicated. Say you have a document in German called words.png and would like to create a searchable PDF from it. The command for that would look like this:

tesseract words.png out -l deu PDF

In order to perform this command, you have to include a minus sign followed by a lowercase letter L and then the language code [-l deu], which tells the program that the file is in German, and [PDF] to tell the program that the output should not be the automatic txt file, but a PDF. All PDFs created in Tesseract should be searchable.

Getting the hang of it?

Tesseract is different than the other OCR options on this LibGuide because you can tell it and train it to do very specific things. It may be tricky starting out, but once you start playing around with Tesseract, it offers a lot of flexibility.

Why Isn't Tesseract Working?

Sometimes, things can lead to an error that keeps Tesseract from producing an output for your file, or the file Tesseract produces for you ends up looking a little strange. Here are some quick troubleshooting ideas for when this happens to you.

Tesseract tells me that there's an error
- Check to see if your command is correct; it's easy to make mistakes, so there's no harm in looking again
- Make sure that you have the langcode or other additional options downloaded and that they are ready to work
- Check that your input and output formats are supported by Tesseract
The output looks strange
- Check to see if your command is correct; it's easy to make mistakes, so there's no harm in looking again
- Look at the quality of the input image -- low quality images are harder for Tesseract to read
  - See Tesseract's ImproveQuality page for more information about improving the quality of the image
- Understand that no OCR software is perfect -- you will need to check over its work for 100% accuracy

Important Links

Tesseract Wiki
The go-to hub for figuring out how you should download and use Tesseract
Tesseract GitHub Page
Includes the repositories used for Tesseract
An Overview of the Tesseract OCR Engine
A PDF file of a paper written by Google's Ray Smith describing Tesseract in detail