Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Introduction to OCR and Searchable PDFs

Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Activity #1: PDF -> Excel in ABBYY FineReader

This activity will help you familiarize yourself with importing PDFs, provide an introduction to correcting areas, and teach you to export a document as a table.

Step 1: Import PDF Document

  1. Download Activity #1 - On-Campus Student Enrollment.pdf from the Documents box in this LibGuide.
  2. Open ABBYY FineReader PDF 15 (it should be on your desktop, or you can look through the programs on the widow button in the lower left-hand corner)
  3. Once ABBYY FineReader is open, the next step is to locate the document that you would like to work with. Click Open in OCR Editor and navigate to Activity #1 - On-Campus Student Enrollment.pdf to import it.
  4. Once you’ve selected your document the software should import and begin analyzing.

Step 2: Ensure ABBYY is recognizing tables.

Helpful guide to colored boxes and what they represent:

Green: Non-table Text

Red: Picture

Blue: Table

  1. Check the areas in the document to ensure that ABBYY Finereader imported the document properly.
    1. Did ABBYY recognize the information as a table?
      1. Hint: is the table within a blue box?
  2. Read through the pages of the document to make sure that everything is recognized correctly. Once everything looks good, you should be ready to save the document to Excel.

Step 3: Output in Excel

  1. Save the document as an Excel document
    1. Hint: Saving/converting a document as a different format can be found in the toolbar.
  2. Open the document in Excel and check your work.

Activity #2: Correcting OCR Result in ABBYY FineReader

The purpose of this activity is to gain practice with correcting the result of OCR performed by ABBYY FineReader.

Step 1: Opening a document

  1. Download Activity #2 – Franklin Roosevelt Letter.jpg from the Documents box in this LibGuide.
  2. When you first open ABBYY FineReader, click “Open in OCR Editor”. This will open a file dialogue. Navigate to the FDR letter you just downloaded. Click Open.
  3. ABBYY will then perform recognition on the image, which will only take a few minutes.
  4. On the left of the screen will be the original image with boxes on top. On the right will be the text it recognized. On the right side, any text highlighted in blue is text that ABBYY is less sure is correct.

Step 2: Text Correction and Training

  1. To adjust the recognized text, click in the text boxes on the right side and type in the corrections.
  2. To perform training on the text of the image, click Tools > Options > OCR. Click the Use training to recognize new characters and ligatures radio button.
  3. Now if you right click on any of the green boxes and click Recognize, ABBYY will ask you whether it is recognizing a particular character correctly or not.

    In this screenshot, it only has part of the letter M selected. Click the >> button several times to expand the box to cover the entire letter. Once it does, it may correctly identify it as an M, but if not, type M into the text input. Then click Train.
  4. You can go through several letters in the image until you either run out of letters or you get tired of training the OCR. If you want to stop training it, click Close and then Yes to save changes to the trained OCR pattern.

Step 3: Correcting Boxes

On the left, you’ll notice that Franklin Roosevelt’s signature is in a red box. This means that it is being recognize as an image, not text. If you were to save the document to Microsoft Word, this element would remain an image.

Since this contains text, we need to ensure that we put text in this area.

  1. Click on the red box and delete it. Next, click the text button at the top in the toolbar. Then draw two green rectangles, one around “Very sincerely yours,” and another around FDR’s signature. Then right click each box and click Recognize.
  2. This will generate the text at the right, but you will likely have to type out the name for the signature, as OCR does not do a good job recognizing handwriting.

Step 4: Output as PDF

  1. Save the document as a PDF
    1. Hint: Saving/converting a document as a different format can be found in the toolbar.
  2. Open the document in a PDF viewer and check your work.

Activity #3: Play with Sample Documents

Now you have a chance to play with ABBYY and how it does OCR. Download Activity #3, which contains several images you can import into ABBYY FineReader. Import them, see how good of a job ABBYY does in recognizing text, make corrections, draw new recognition boxes, and try exporting to different programs.