LibGuides: Finding Text Data Sets: Digitized Archives

Note on Scope

The resources listed here strictly contain archival materials like primary source documents and other digitized collections. If you are looking for contemporary or historical published books, research articles, or newspapers, see those respective pages.

Digitized Archives from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact the Scholarly Communication and Publishing.

The following pages link directly to the scholarly publishing databases, and do not automatically sign you in via the University of Illinois' institutional login. To login to these databases, you can either find and log in via the "Institutional Login" pages for these databases, or you can find the database in the library's A-Z Databases page.

Adam Matthew Digital

Adam Matthew provides primary source and digitized archival collections related to a range of humanities fields, from varying eras and regions. Notable fields include international studies, history, gender studies, and popular culture. University of Illinois affiliates have access to these archival collections through the University Library.
Data Mining Instructions: Review their text and data mining policy, then contact please contact the Scholarly Communication and Publishing. Data delivered in JSON format.

Gale

Gale provides digitized primary source archives for text and data mining. Many collections are housed in the UK and are focused on European studies. Notable archives include State Papers Online containing English government documents from the early 16th Century to the 19th Century, The Making of Modern Law containing legal documents from various nations, and Slavery and Anti-Slavery with pamphlets and other primary documents from Europe, the Caribbean, and North America.
Data Mining Instructions: The primary access point for data mining Gale resources is through Gale's Digital Scholar Lab or through U of I's purchased data collections. For access to purchased data collections, please send your request via this form.

Other Purchased Data Collections

The library offers access to purchased collections of historic newspapers, digitized primary sources, and data from a variety of publishers. Highlights include Eighteenth Century Collections Online, 19th Century U.K. Periodicals: Empire, and the African American Newspapers Collection.
Data Mining Instructions:
- Check out our menu of available collections in this spreadsheet.
- These ReadMes have more detailed descriptions of the collection contents and their file structure.
- Request access through this form.

Digitized Archives From Digital Libraries

Libraries and archives make available online some digitized content that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century.

HathiTrust

The HathiTrust Digital Library is a collection of books, digitized primary sources, images, and more. They focus on long-term preservation, and provide both public domain and copyright content from Google, the Internet Archive, and Microsoft. A related organization, the HathiTrust Research Center, provides research support and tools for a variety of research methods.
Data Mining Instructions: HathiTrust offers a few different tools to assist in research. The Bibliographic API can be used to retrieve small amounts of bibliographic records. You can request access to public domain works and research datasets through the HathiTrust Research Datasets page. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.
HTRC Analytics provides a few computational analysis tools, and contains the portal to access the Data Capsule. The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule.

Internet Archive

The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Digital Public Library of America

DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

World Digital Library

The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe.
Data Mining Instructions: Access the WDL API. Data delivered in XML format.

Documenting the American South by UNC Libraries

DocSouth contains digitized primary materials that offer a uniquely Southern perspective on the American south.
Data Mining Instructions: See DocSouth's data page for information on bulk data download and analysis. Data delivered in plain-text and XML format.

Folger Digital Texts

Folger Digital Texts offer the entirety of Shakespeare's plays in machine-readable format.
Data Mining Instructions: Access the Folger API.

Canadiana

Canadiana is an online archive of digitized collections from Canada’s libraries, museums, and archives.
Data Mining Instructions: Access the Canadiana API. Note: This API is a work in progress. Data delivered in JSON format.

Perseus Digital Library

Perseus Digital Library, out of Tufts University, has created a library of pre-modern texts in machine readable format.
Data Mining Instructions: Download the full library or collections. Data delivered in TEI/XML format.