Libraries and archives make available online some digitized content that can be used in text analysis. Due to copyright restrictions, the texts available are primarily texts created before the early twentieth century.
HathiTrust
- The HathiTrust Digital Library is a collection of books, digitized primary sources, images, and more. They focus on long-term preservation, and provide both public domain and copyright content from Google, the Internet Archive, and Microsoft. A related organization, the HathiTrust Research Center, provides research support and tools for a variety of research methods.
- Data Mining Instructions: HathiTrust offers a few different tools to assist in research. The Bibliographic API can be used to retrieve small amounts of bibliographic records. The Data API can be used to retrieve content such as page scans and OCR text. In-copyright works are available under special contract; otherwise, only public domain works can be retrieved with the Data API. Data delivered in JSON or XML format.
- HTRC Analytics provides a few computational analysis tools, and contains the portal to access the Data Capsule. The Data Capsule is a secure virtual environment that can be used for non-consumptive text analysis of HathiTrust Digital Library content, meaning that the text would not be able to be reproduced. When using the Data Capsule, the researcher requests an extraction of data at the end of the analysis. HathiTrust will strip the data of features that would allow the text to be reproduced. The extracted features datasets are completed examples of this method, and are freely available. Create an HTRC Analytics account, then sign up for the Data Capsule.
Internet Archive
- The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
- Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.
Text Creation Partnership for EBBO, ECCO, Evans
- The Text Creation Partnership (TCP) is a coalition of professionals who manually create digital, fully searchable text from content published pre-1800. The text is created from works available in the following collections: Eighteenth Century Collections Online (ECCO), Early English Books Online (EEBO), and Evans Early American Imprints. ECCO includes every significant title printed in the UK during the 18th century; EEBO includes books printed before 1700; Evans includes titles printed in the United States between 1470 and 1790.
- Data Mining Instructions:
Project Gutenberg
- Project Gutenberg is a volunteer-driven, free digital library that offers over 56,000 free eBooks for public use. They offer works in many languages, but most books are in English. All their eBooks are public domain, meaning the copyright has expired and that the newest title was originally published in 1923.
- Data Mining Instructions: Project Gutenberg states that they will block any perceived use of automated tools to access their site, with some exceptions. Data delivered in RDF/XML or a compressed folder.
Digital Public Library of America
- DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
- Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.
World Digital Library
- The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe.
- Data Mining Instructions: Access the WDL API. Data delivered in XML format.
Biodiversity Heritage Library
- The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books.
- Data Mining Instructions: Request an API key to access the BHL API. Data delivered in JSON or XML format.
Women Writers Online
- Women Writers Online is the digital library of the Women Writers Project out of Northeastern University. The library contains text of early women's writing in English, from 1526 to 1850.
- Data Mining Instructions: Review the information on their text database, and email the team at wwp@neu.edu, with a brief description of your research plans.