Skip to main content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Where to find data sources for computational text analysis

Newspapers from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact Spencer Keralis, Digital Humanities Librarian.

ProQuest

  • ProQuest Historical Newspapers include digitized newspapers from a variety of time periods and regions, with collections like the Chinese Newspapers Collection and The New York Times. See more collections in ProQuest Historical Newspapers.
  • Data Mining Instructions: The library will provide text data access to the researcher through a secure link. Contact Daniel Tracy with the University Library to get started. 

Gale 

  • Gale provides newspaper and magazine archives for text mining. Archives from the early 19th Century to today are available. Notable collections include the Daily Mail Historical ArchiveFinancial Times Historical Archive, and various other English and American newspapers and periodicals. 
  • Data Mining Instructions: The researcher will receive a hard drive that contains the text files in XML. Contact Spencer Keralis with the University Library to get started. Review their data mining FAQ here. Original image files are also available in JPG or TIF format. See the spreadsheet below to view all hard drives and additional information.

Newspapers from Digital Libraries

Listed below are digital libraries that offer digitized versions of newspapers. Some are not in machine-readable format. For information on making scanned images machine readable, see the library guide on Optical Character Recognition.

Chronicling America

  • The Chronicling America archive contains digital copies of all known newspapers published in the United States from 1690 until today. Some publications are not available digitally; for those items, the database provides complete metadata.
  • Data Mining Instructions: Access the Chronicling America API. Data delivered in HTML or JSON format.
  • See other APIs and machine interfaces that can be used with Library of Congress content.

Europeana Digital Newspapers

  • Europeana provides digital images of artifacts currently held in museums, libraries, and archives across Europe. 
  • Data Mining Instructions: Request an API key to access the Europeana Newspapers API. Please note that some newspapers provide metadata only, and do not have full-text for text mining purposes.

Digital Public Library of America

  • DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
  • Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

Internet Archive

  • The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
  • Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Newspapers Direct from Publisher

New York Times

  • The New York Times keeps archives of the newspaper’s past issues dating back to 1851.
  • Data Mining Instructions: NYT offers many APIs to retrieve content from their history of publications. Each API retrieves specific information, such as most popular stories, community comments, book reviews from the NYT bestseller lists, and specific articles. Request an API key to get started. Data delivered in JSON format.
  • Note: More recent articles are only accessible at cost. Those published prior to 1922 are free for all to download.