Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Where to find data sources for computational text analysis

Research From Library Databases

The University of Illinois has contract agreements with several scholarly publishing vendors to conduct text mining. These vendors are listed below, along with instructions on how to get started. If you have a question about a vendor not listed below, or otherwise need assistance accessing the data, please contact Spencer Keralis, Information Sciences and Digital Humanities Librarian.


  • The Association for Computing Machinery’s Digital Library provides research articles, books, conference proceedings, and magazine articles on topics in computer and data science and technology. 
  • Data Mining Instructions: Data mining requests are approved on a case-by-case basis. Contact Spencer Keralis with the University Library to get started. 


  • Elsevier provides research articles and books focused on fields of science and technology, including engineering, medicine, social science, and GIS. Notable databases include INSPEC, ScienceDirect, Scopus, and Engineering Village.
  • Data Mining Instructions: Access the Elsevier API, and review their data mining policy. Each researcher must create an Elsevier account and register for their own API key. Data delivered in XML format.


  • JSTOR is a collection of research articles and books dating back to the earliest publications in humanities fields, especially language, literature, history, and philosophy. 
  • Data Mining Instructions: JSTOR Data for Research provides an API that can be used to retrieve metadata and reference information for up to 25,000 documents. For researchers needing to conduct full-text analysis OR retrieve more than 25,000 documents, contact JSTOR directly to request the data set. Data delivered in XML or plain-text format.


  • Springer is the provider of BioMed Central, an open access database of academic articles related to science, technology, and medicine. Social scientists may be interested in researching topics like public health, substance abuse, and health care policy using this source.
  • Data Mining Instructions: Springer provides multiple APIs depending on the researcher's needs. Access the Springer APIs. Alternatively, you may download articles directly from the website. See their full policy and how-to guide. Data delivered in PDF, HTML (if available), XML, or JSON format.


  • Wiley provides a variety of databases, notably Anthrosource for current and archived issues of publications from the American Anthropological Association, and the International Studies Encyclopedia. They also provide AGU Digital Library for earth science, and Organic Reactions related to chemistry.
  • Data Mining Instructions: Use the CrossRef data mining service, which includes thousands of publishers, including Wiley. See more of Wiley's terms and conditions here. Data delivered in JSON format.

Research From Open Access Publishers and Indexes

Listed below are some Open Access Journals that publish research publicly. Most articles from the libraries below are related to science and technology.


  • arXiv is an electronic archive for research articles in various STEM fields, including physics, mathematics, computer science, biology, finance, engineering, and economics.
  • Data Mining Instructions: Use the arXiv API to access arXiv data, search, and linking facilities. The API can only be used to download metadata, not full-text. No key is required. To access full-text articles in bulk, the researcher must purchase a license from Amazon S3. Data delivered in Atom XML format.

Public Library of Science

  • Public Library of Science (PLoS) is a publisher of research articles in various fields of science, such as biology, medicine, computational biology, genetics, and disease.
  • Data Mining Instructions: PLoS offers two APIs for data retrieval. The Article-Level Metrics API retrieves data regarding an article’s usage statistics to demonstrate its reach. The Search API provides the ability to query PLoS content across their journals. Data delivered in XML or JSON format. An API key is required to access either API. 


  • PubMed contains 27 million citations for research articles on biomedicine from various research journals and online books. Full-text of each article may be available from its original publisher, but not through this database itself.
  • Data Mining Instructions: Text and data mining is limited due to copyright restrictions, but some collections are available for mining. The largest collection is the Open Access Subset. Data delivered in XML format. 
  • PubMed has also published a list of tools that can be used for data mining.


  • CORE is an initiative in the UK to harvest and maintain metadata and full-text content from Open Access journals and repositories across the world.
  • Data Mining Instructions: Access the CORE API. Data delivered in JSON format.

Research From Digital Libraries

Biodiversity Heritage Library

  • The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books.
  • Data Mining Instructions: Request an API key to access the BHL API. Data delivered in JSON or XML format.