LibGuides: Finding Text Data Sets: Newspapers

Newspapers from Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact the Scholarly Communications and Publishing Department.

ProQuest

ProQuest Historical Newspapers include digitized newspapers from a variety of time periods and regions, with collections like the Chinese Newspapers Collection and The New York Times. See more collections in ProQuest Historical Newspapers.
Data Mining Instructions: Contact The Cline Center to arrange access for researcher text data access.

Gale

Gale provides newspaper and magazine archives for text mining. Archives from the early 19th Century to today are available. Notable collections include the Daily Mail Historical Archive, Financial Times Historical Archive, and various other English and American newspapers and periodicals.
Data Mining Instructions: For access to these collections, contact Scholarly Communications and Publishing. Review Gale's data mining FAQ. Original image files are also available in JPG or TIF format. See the spreadsheet below to view all hard drives and additional information.

Accessible Archives

Accessible Archives provides access to text data from the African American Newspaper Collections in XML files. Contact Scholarly Communications and Publishing to arrange access to the data.

Gale Hard Drive Directory
List of all the primary source and newspaper content stored on hard drives from Gale, including detailed descriptions, file types and counts, and collection highlights.

Newspapers from Digital Libraries

Listed below are digital libraries that offer digitized versions of newspapers. Some are not in machine-readable format. For information on making scanned images machine readable, see the library guide on Optical Character Recognition.

Cline Center

The Cline Center for Advanced Social Research's Global News Index and Extracted Features Repository draws on over 153 million historical news reports from around the world.
Contact the Cline Center directly for policies and procedures for accessing the Index.

Chronicling America

The Chronicling America archive contains digital copies of all known newspapers published in the United States from 1690 until today. Some publications are not available digitally; for those items, the database provides complete metadata.
Data Mining Instructions: Access the Chronicling America API. Data delivered in HTML or JSON format.
See other APIs and machine interfaces that can be used with Library of Congress content.

Europeana Digital Newspapers

Europeana provides digital images of artifacts currently held in museums, libraries, and archives across Europe.
Data Mining Instructions: Request an API key to access the Europeana Newspapers API. Please note that some newspapers provide metadata only, and do not have full-text for text mining purposes.

Digital Public Library of America

DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media.
Data Mining Instructions: Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

Internet Archive

The Internet Archive provides digitized print resources and born-digital content, with a special focus on web pages and digitized books. It contains millions of web pages, books, audio and video recordings, images, and even software programs. They aim to provide a quality collection, like would be found in a public library, to those who do not have access to such services.
Data Mining Instructions: IA suggests using wget to download files from their site in bulk . There is no download limit, but they recommend downloading only 10,000 items per query to prevent errors. Download wget. Data delivered in XML or JSON format.

Viral Texts Project

The Viral Texts Project uses text mining methods to study how news stories, short fiction, poetry, and more went “viral” in nineteenth-century newspapers and magazines. Their data is open access on the Viral Texts GitHub for researchers to reuse for their own text mining projects.

Newspapers Direct from Publisher

New York Times

The New York Times keeps archives of the newspaper’s past issues dating back to 1851.
Data Mining Instructions: NYT offers many APIs to retrieve content from their history of publications. Each API retrieves specific information, such as most popular stories, community comments, book reviews from the NYT bestseller lists, and specific articles. Request an API key to get started. Data delivered in JSON format.
Note: More recent articles are only accessible at cost. Those published prior to 1922 are free for all to download.