Skip to Main Content

University Library

LibGuides

Introduction to OCR and Searchable PDFs

Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide.

Activity #1: JPEG -> PDF -> Microsoft Word in Adobe Acrobat Pro

This activity will help you familiarize yourself with the Adobe Acrobat Pro interface. The goal of this exercise will be to convert a scanned image into a PDF file, implement OCR, and then export the file as a Microsoft Word document.

Step 1: Import JPEG file

  1. Open Adobe Acrobat Pro (it should be on your desktop, or you can look through the programs on the window button in the lower left-hand corner).
  2. Once Adobe Acrobat Pro is open, the next step is to locate the document you would like to work with. For this activity, we will use document titled 'Activity #1- Pride & Prejudice', found in the Activity Document section of this page. Download the document.
  3. Return to Acrobat, and from the file menu select Create > PDF from File. Import the document you just downloaded.
  4. The image will be immediately converted into a non-editable PDF.

Step 2: Using OCR

  1. You will want to make a few tools visible on your navigation panel at the left of the screen:
    1. Go to View > Show/Hide > Navigation Panes > Order
    2. Go to View > Show/Hide > Navigation Panes > Content
      1. These two tools will let you tag your PDF and set a reading order
  2. On the right hand tools bar, click ‘Edit PDF’.
  3. Wait for Adobe to work its magic.
  4. You should now be able to highlight text, and use the edit tools. Play around with the 'Edit Text & Images' tools to familiarize yourself with them.
    1. If you want to create a reading order and tag structure on the page, be sure to click on at least one text box to ensure the text is registered
  5. Click “Close” button on the upper right to turn off editing
    1. You should now be able to highlight the text, but there will be extra bits from the marginal notes on the page
  6. Click on the icon for Order in the left sidebar (Four boxes with a “Z” connecting them)
    1. Click the Options menu (rectangle with two dots and dashes) and select Show reading order panel
      1. The reading order panel will open
    2. Click the “Clear Page Structure…” button at the bottom of the panel
    3. Click and drag on the page to draw a box over the text
      1. The selected text will then be surrounded by purple boxes
    4. Select what type of text this is in the PDF (a book or chapter title may be Heading 1 while the main body text will be Text/Paragraph)
  7. You can check your work by using Read Out Loud (View > Read Out Loud > Activate Read Out Loud) or by exporting it as a Microsoft Word Document

Step 3: Exporting as a Microsoft Word Document

  1. Once you have familiarized yourself with the Edit Text & Images and Reading Order tools, go to File > Export to.
  2. From the drop down list of options, choose Microsoft Word Document.
  3. Give your document a name and save.
  4. Give the program a few seconds to load, then go to your desktop and open up the document. Notice what did/did not translate correctly, and how time-intensive it would be to fix every oddity or mistake.

Activity #2: Using OCR on Multiple Files at Once in Adobe Acrobat Pro

This activity will show you one of Adobe Acrobat Pro's most useful features, the ability to use OCR on multiple files at once.

Step 1: Downloading Files

  1. Adobe Acrobat Pro can use this feature on multiple file types, including PDFs, JPEGs, PNGs, etc.. For this activity, download the 'Activity #2- Sense & Sensibility' zip file in the Documents section on the top of this page.
  2. Once you download the file, unzip it simply by opening it up. Inside should be three PNG files, named 1, 2, and 3.
  3. Create a new folder in your documents with the name Activity #2.
  4. Move the three PNG files to the Activity #2 folder.
    1. HINT: This is an important step because Adobe Acrobat Pro cannot use files located in zip files.

Step 2: Importing Files into Adobe Acrobat Pro

  1. Open Adobe Acrobat Pro go to the ‘Tools’ tab.
  2. Choose the ‘Create PDF’ option, then click the 'Combine Files into a Single PDF’.
  3. Add files then press 'Combine'. When the files are uploaded, select 'Edit Text'.
  4. Save files as a PDF in the same folder.