Skip to main content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Where to find data sources for computational text analysis

Books from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact Spencer Keralis, Digital Humanities Librarian.

ACM  

  • The Association for Computing Machinery’s Digital Library provides research articles, books, conference proceedings, and magazine articles on topics in computer and data science and technology. 

  • Data Mining Instructions: Data mining requests are approved on a case-by-case basis. Contact Spencer Keralis with the University Library to get started. 

Elsevier 

  • Elsevier provides research articles and books focused on fields of science and technology, including engineering, medicine, social science, and GIS. Notable databases include INSPEC, ScienceDirect, Scopus, and Engineering Village.

  • Data Mining Instructions: Access the Elsevier API, and review their data mining policy. Each researcher must create an Elsevier account and register for their own API key. Data delivered in XML format.

JSTOR 

  • JSTOR is a collection of research articles and books dating back to the earliest publications in humanities fields, especially language, literature, history, and philosophy. 

  • Data Mining Instructions: JSTOR Data for Research provides an API that can be used to retrieve metadata and reference information for up to 25,000 documents. For researchers needing to conduct full-text analysis OR retrieve more than 25,000 documents, contact JSTOR directly to request the data set. Data delivered in PDF, HTML (if available), XML, or JSON format.

Wiley

  • Wiley's AGU Digital Library, which includes 100 years of earth and space science research, includes books in their collection as they become available.

  • Data Mining Instructions: Use the CrossRef data mining service, which includes thousands of publishers, including Wiley. See more of Wiley's terms and conditions here. Data delivered in JSON format.

Books from Digital Libraries

Libraries and archives make available online some digitized content that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century.

HathiTrust

  • The HathiTrust Digital Library is a collection of books, digitized primary sources, images, and more. They focus on long-term preservation, and provide both public domain and copyright content from Google, the Internet Archive, and Microsoft. A related organization, the HathiTrust Research Center, provides research support and tools for a variety of research methods.
  • Data Mining Instructions: HathiTrust offers a few different tools to assist in research. The Bibliographic API can be used to retrieve small amounts of bibliographic records. The Data API can be used to retrieve content such as page scans and OCR text. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.
  • HTRC Analytics provides a few computational analysis tools, and contains the portal to access the Data Capsule. The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule. 

Internet Archive

  • The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
  • Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Text Creation Partnership for EBBO, ECCO, Evans

  • The Text Creation Partnership (TCP) is a coalition of professionals who manually create digital, fully searchable text from content published pre-1800. The text is created from works available in the following collections: Eighteenth Century Collections Online (ECCO), Early English Books Online (EEBO), and Evans Early American Imprints. ECCO includes every significant title printed in the UK during the 18th century; EEBO includes books printed before 1700; Evans includes titles printed in the United States between 1470 and 1790.
  • Data Mining Instructions:

Project Gutenberg

  • Project Gutenberg is a volunteer-driven, free digital library that offers over 56,000 free eBooks for public use. They offer works in many languages, but most books are in English. All their eBooks are public domain, meaning the copyright has expired and that the newest title was originally published in 1923.
  • Data Mining Instructions: Project Gutenberg states that they will block any perceived use of automated tools to access their site, with some exceptions. Data delivered in RDF/XML or a compressed folder.

Digital Public Library of America

  • DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
  • Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

World Digital Library

  • The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe. 
  • Data Mining Instructions: Access the WDL API. Data delivered in XML format.

Biodiversity Heritage Library

  • The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books.
  • Data Mining Instructions: Request an API key to access the BHL API. Data delivered in JSON or XML format.

Women Writers Online

  • Women Writers Online is the digital library of the Women Writers Project out of Northeastern University. The library contains text of early women's writing in English, from 1526 to 1850.
  • Data Mining Instructions: Review the information on their text database, and email the team at wwp@neu.edu, with a brief description of your research plans.