Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Note on Scope

The resources listed here strictly contain archival materials like primary source documents and other digitized collections. If you are looking for contemporary or historical published books, research articles, or newspapers, see those respective pages. 

Digitized Archives from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact the Scholarly Communications and Publishing Department..

Adam Matthew Digital

  • Adam Matthew provides primary source and digitized archival collections related to a range of humanities fields, from varying eras and regions. Notable fields include international studies, history, gender studies, and popular culture. University of Illinois affiliates have access to these archival collections through the University Library.
  • Data Mining Instructions: Review their text and data mining policy, then contact please contact the Scholarly Communications and Publishing Department. Data delivered in JSON format.

Gale 

  • Gale provides digitized primary source archives for text and data mining. Many collections are housed in the UK and are focused on European studies. Notable archives include State Papers Online containing English government documents from the early 16th Century to the 19th Century, The Making of Modern Law containing legal documents from various nations, and Slavery and Anti-Slavery with pamphlets and other primary documents from Europe, the Caribbean, and North America.
  • Data Mining Instructions: Hard drives that contains the text files in XML are located in the Scholarly Commons. Contact Scholarly Commons to get started. Review Gale's data mining FAQ here. Original image files are also available in JPG or TIF format. See the spreadsheet below to view all hard drives and additional information.

Digitized Archives From Digital Libraries

Libraries and archives make available online some digitized content that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century.

HathiTrust

  • The HathiTrust Digital Library is a collection of books, digitized primary sources, images, and more. They focus on long-term preservation, and provide both public domain and copyright content from Google, the Internet Archive, and Microsoft. A related organization, the HathiTrust Research Center, provides research support and tools for a variety of research methods.
  • Data Mining Instructions: HathiTrust offers a few different tools to assist in research. The Bibliographic API can be used to retrieve small amounts of bibliographic records. The Data API can be used to retrieve content such as page scans and OCR text. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.
  • HTRC Analytics provides a few computational analysis tools, and contains the portal to access the Data Capsule. The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule. 

Internet Archive

  • The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
  • Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Digital Public Library of America

  • DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
  • Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

World Digital Library 

  • The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe. 
  • Data Mining Instructions: Access the WDL API. Data delivered in XML format.

Documenting the American South by UNC Libraries

  • DocSouth contains digitized primary materials that offer a uniquely Southern perspective on the American south. 
  • Data Mining Instructions: See DocSouth's data page for information on bulk data download and analysis. Data delivered in plain-text and XML format.

Folger Digital Texts 

  • Folger Digital Texts offer the entirety of Shakespeare's plays in machine-readable format. 
  • Data Mining Instructions: Access the Folger API.

Canadiana

  • Canadiana is an online archive of digitized collections from Canada’s libraries, museums, and archives. 
  • Data Mining Instructions: Access the Canadiana API. Note: This API is a work in progress. Data delivered in JSON format. 

Perseus Digital Library

  • Perseus Digital Library, out of Tufts University, has created a library of pre-modern texts in machine readable format.
  • Data Mining Instructions: Download the full library or collections. Data delivered in TEI/XML format.