Sometimes, things can lead to an error that keeps Tesseract from producing an output for your file, or the file Tesseract produces for you ends up looking a little strange. Here are some quick troubleshooting ideas for when this happens to you.
All Tesseract commands follow the same basic format:
It is by shaping this command that you will be able to use Tesseract and tell it how you want it to work. For definitions of each part of the command, see the below image:
Note: As a beginner, you will probably won't be using pagesegmode or configfile just yet, so we won't be focusing on those commands in this LibGuide. If you're intersted in what these can do, check out the ControlParams page on the Tesseract Wiki. For a list of all possible commands that can be used with Tesseract, see the Command Line Usage GitHub page.
These are some examples of how to draft a Tesseract command that will work for particular inputs and outputs. They should show you how to draft commands for your own work when using Tesseract.
This will be one of the most basic commands you can perform in Tesseract. Let's say you have an image file called words.tif and you would like to use Tesseract to create a txt file called words.txt. The command for that would look like this:
You don't need to add a lot onto this command, because the automatic language is English, and txt files are the automatic output.
This one will be a little more complicated. Say you have a document in German called words.png and would like to create a searchable PDF from it. The command for that would look like this:
In order to perform this command, you have to include a minus sign followed by a lowercase letter L and then the language code [-l deu], which tells the program that the file is in German, and [PDF] to tell the program that the output should not be the automatic txt file, but a PDF. All PDFs created in Tesseract should be searchable.
Tesseract is different than the other OCR options on this LibGuide because you can tell it and train it to do very specific things. It may be tricky starting out, but once you start playing around with Tesseract, it offers a lot of flexibility.
Tesseract will only take image files for input. These include:
Tesseract has a limited number of file output formats. These include: