LibGuides: Managing Digital Resources in Global and Area Studies: OCR Tools

What is OCR?

OCR stands for "optical character recognition" or "optical character reader". OCR software converts images of text (including typed, handwritten, and printed documents) into machine-encoded text; i.e., something that a machine can read for you. This is really helpful for global and area studies scholars who need to scan and convert photographs or archival documents, handwritten documents, or even things like posters, television screenshots, and magazine or newspaper articles. Using OCR software, you can scan and convert the text from a photo into an easier format (such as a PDF or a Word document), and then edit, search, and read it more easily. OCR tools can be an amazing way to streamline your research and reading processes when working with other languages, less legible historical documents, and non-Roman scripts. They can do a lot of work to your benefit, or make your work easier and faster to complete.

Software

ABBY FineReader logo.

Tesseract logo.

Adobe Acrobat Pro logo.

ReadMe logo.

Learn more!

Introduction to OCR and Searchable PDFs
This University of Illinois LibGuide from the Scholarly Commons is a wonderful starting point for getting oriented to optical character recognition (OCR) software and three of the best current options: ABBYY FineReader, Adobe Acrobat Pro, and Tesseract.
Section on ABBYY FineReader
Section on Adobe Acrobat Pro
Section on Tesseract

Tools

ABBYY FineReader
This OCR program converts image documents such as photos, scans, PDFs, and screen captures into file formats that can be edited. You can convert your photos into Word, Powerpoint, or Excel documents, into searchable and editable PDFs, into HTML and plain text files, and more! And of special interest to area studies scholars, Version 15 supports text recognition in 192 languages, with built-in spellcheck for 48.
Adobe Acrobat Pro
This is a well-known software tool that can covert scans, PDFs, and images into editable and searchable documents. It does not have as many language options as ABBYY FineReader, however, which is something to consider if you routinely work with non-covered languages.
ReadMe
With a free option, ReadMe can be a great choice for working with APIs and OCR technology. It is customizable and works well with languages.
Tesseract
Tesseract is a useful multilingual OCR tool that converts image documents into PDF or Word documents, which you can then edit or search.
Unlike some other tools that are available, Tesseract requires some basic coding knowledge.
This link connects to the GitHub Tesseract code page.
Xodo: All Xodo Tools in One Platform
PDF tools to process digital documents in high quality across devices and platforms.

Language Tools

We have developed a language and area-focused guide to many AI and OCR tools under the "Language Resources" tab in this LibGuide. Please check it out for more resources and language-specific information!