Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Newspapers from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact the Scholarly Communications and Publishing Department.

ProQuest

Gale 

  • Gale provides newspaper and magazine archives for text mining. Archives from the early 19th Century to today are available. Notable collections include the Daily Mail Historical ArchiveFinancial Times Historical Archive, and various other English and American newspapers and periodicals. 
  • Data Mining Instructions: The library will provide text data access to the researcher via hard drives located in the Scholarly Commons. please contact the Scholarly Commons to arrange access. Original image files are also available in JPG or TIF format. See the spreadsheet below to view all hard drives and additional information.

Accessible Archives

  • Accessible Archives provides access to text data from the African American Newspaper Collections in XML files. Contact scpub@library.illinois.edu to arrange access to the data. 

Newspapers from Digital Libraries

Listed below are digital libraries that offer digitized versions of newspapers. Some are not in machine-readable format. For information on making scanned images machine readable, see the library guide on Optical Character Recognition.

Cline Center

  • The Cline Center for Advanced Social Research's Global News Index and Extracted Features Repository draws on over 153 million historical news reports from around the world.
  • Contact the Cline Center directly for policies and procedures for accessing the Index.

Chronicling America

  • The Chronicling America archive contains digital copies of all known newspapers published in the United States from 1690 until today. Some publications are not available digitally; for those items, the database provides complete metadata.
  • Data Mining Instructions: Access the Chronicling America API. Data delivered in HTML or JSON format.
  • See other APIs and machine interfaces that can be used with Library of Congress content.

Europeana Digital Newspapers

  • Europeana provides digital images of artifacts currently held in museums, libraries, and archives across Europe. 
  • Data Mining Instructions: Request an API key to access the Europeana Newspapers API. Please note that some newspapers provide metadata only, and do not have full-text for text mining purposes.

Digital Public Library of America

  • DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
  • Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

Internet Archive

  • The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
  • Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Newspapers Direct from Publisher

New York Times

  • The New York Times keeps archives of the newspaper’s past issues dating back to 1851.
  • Data Mining Instructions: NYT offers many APIs to retrieve content from their history of publications. Each API retrieves specific information, such as most popular stories, community comments, book reviews from the NYT bestseller lists, and specific articles. Request an API key to get started. Data delivered in JSON format.
  • Note: More recent articles are only accessible at cost. Those published prior to 1922 are free for all to download.