Tesseract is an optical character recognition (OCR) system. It is used to convert image documents into editable/searchable PDF or Word documents. It is a free, open-source software run through a Command-Line Interface (CLI). Tesseract is considered one of the most accurate open source OCR engines currently available and its development has been sponsored by Google since 2006.That being said, its capabilities can be more limited than commercial software like Adobe Acrobat Pro and ABBYY FineReader. However, because it is an open source software, anyone with programming knowledge can edit the code behind Tesseract and help it learn what you need to do. It can be used on Mac, Windows, and Linux machines.
How Tesseract analyzes documents:
Basic OCR Operations in Tesseract:
With the resulting files being editable and searchable, researchers will be able to:
The Command-Line Interface (CLI) is the user's window into the computer operating window. The user uses text-based commands to instruct the computer on what to do. Prior to the 1980s and the rise of the Graphic User Interface that we are now used to today, the majority of computer users used CLI to run their systems. Today, however, most computer users don't know a whole lot about CLI.
If you're one of those people, don't panic! We've gathered some resources here that will help you learn how to run the CLI through text-based commands. Learning the basics shouldn't take you more than an hour or two. Knowing how to use your CLI allows you not only to run Tesseract, but will help you learn more about your system, and teach you tricks to operating it.