Skip to main content

University Library, University of Illinois at Urbana-Champaign

Finding Text Data Sets

Where to find data sources for computational text analysis

Social Media Data

Many social media sites have APIs that provide access to textual content, including Twitter, Facebook, and Reddit. Using these APIs requires, generally, some technical know-how and an access key from the data provider. 

Other social media analysis tools:

Crimson Hexagon

  • Crimson Hexagon is a powerful platform for social media analytics. Researchers affiliated with the University of Illinois at Urbana-Champaign can request access through Tech Services

Documenting the Now 

  • Documenting the Now collects tweet data (tweet IDs) and publishes them as an Open Access data sets. They also maintain a tool called Hydrator that turns the tweet IDs into full tweets.

NodeXL 

  • NodeXLis a complex Excel template that can retrieve social media data and supports basic network analysis visualization. Retrieving Twitter data is free; for a fee, researchers can also retrieve Facebook. Youtube, or Flickr data.

NVivo

  • NVivo includes a tool called NCapture for gathering and analyzing Twitter data. This software program is available in the Scholarly Commons.

Social Media Macroscope

  • Developed at Illinois, the Social Media Macroscope has the goal of providing social media analytics tools and data to students and researchers. Check out tools like the Brand Analytics Environment to see how the public interacts with brands, or download a dataset.

TAGS (Twitter Archiving Google Sheet)

  • TAGS is a complex Google Sheets template to retrieve Twitter data. This platform supports basic network analysis visualization.

Each tool above has its own limitations and strengths; visit the links for more information, or see the Commons Knowledge blog post on social media analysis tools.

Song Lyrics

Genius

  • Genius, formerly Rap Genius, is a reliable web source of song lyrics from all genres. They also publish news, interviews with artists, and other content related to popular music.
  • Data Mining Instructions: Access the Genius API. An API key and Genius account is required. The Genius API does not have downloading functionality; you will need to use another method to download the data (try this Python package!).

Case Law Documents

Case.law

  • Case.law is a project aiming to make caselaw more publicly accessible. Over six million court documents have been digitized from the Harvard Law Library's collections, covering cases from 1658 to 2018.
  • Data Mining Instructions: Download text files in bulk, or use the API. Only cases from two jurisdictions (Arkansas and Illinois) are publicly available for bulk download and API retrieval; for the remainder, the researcher will need to create an account, and request an API key or unlimited bulk data access.