Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

About this Guide

This guide details how to acquire textual data for computational text analysis. Each tab includes information about text data collections for different genres of resources, to the extent possible for the collection: the provider of the collection, the scope of the collection, the data formats available, and how to access the text data.

For more information on text mining methods, tools, and example projects, see the Text Mining Tools and Methods guide. Computational methods can be used to visualize word trends, sort phrases into topics, and see connections between people.

What does it mean if a source says I need to "access the API"?

An Application Programming Interface, or API, is basically an interface that allows applications to talk to one another. They can be used in a variety of ways, including downloading large amounts of data from a website without requiring user input. In this way, a researcher can even download the entire contents of a digital library hands-free. Using an API does require some technical or programming knowledge. Some, but not all, resources in this guide require use of an API to access data. 

A good directory of publicly available web APIs you can use is available on GitHub from Todd Motto. It includes APIs that allow you to gather data (textual and other types of data) as well as APIs that allow you to do things (such as post to Twitter or other sites with bots). APIs in this directory are organized by topical areas, and it includes information about whether you need an API key and a link to the API documentation.

What if none of these sources have the text I need?

Worry not! If the text you want to analyze isn't available through any of these sources, Optical Character Recognition (OCR) software can be employed to turn a printed book into machine-readable plain text. See the library guide on OCR, or visit the Scholarly Commons to get started. The Scholarly Commons has scanners and software available to perform OCR, including ABBYY FineReader, one of the best available.

If you have a question about a vendor or resource not listed in this guide, or otherwise need assistance getting your text, please contact the Scholarly Communications and Publishing Department.

Related Guides


Creative Commons License

Except where otherwise indicated, original content in this guide is licensed under a  Creative Commons Attribution (CC BY) 4.0 license. You are free to share, adopt, or adapt the materials. We encourage broad adoption of these materials for teaching and other professional development purposes, and invite you to customize them for your own needs.