LibGuides: Introduction to Digital Humanities: Text and Data Mining

What is Text Mining?

Text mining centers on identifying patterns and trends in unstructured texts. This often involves using a program or software to “read” text files and provide data about them, including data on word frequencies, common word patterns, tone indicators, and more. It is sometimes referred to as a "distant reading" method, in which you take a step back to identify patterns in language across a large group of texts.

Many research questions and methods fall within the scope of text and data mining, including:

Identifying word frequencies
Concordance (what passages mention specific key terms)
Keyness (how often key terms appear in certain texts when compared to others)
Topic modelling (grouping key terms together to identify common themes and topics)
Named entity recognition (identifying names of people, places, things across texts)
Sentiment analysis (identifying positive or negative tone)

Tools and Software

Voyant
Text mining tool recommended for beginners. No coding experience is required to use Voyant, and it has several pre-uploaded corpus you can play with. Free to access via the link - you do not need to download any software to access it.
AntConc
A text mining software that does not require coding experience. AntConc has a variety of useful text mining tools that help identify word frequencies, common terms and phrases in a text, and more. Downloadable for free from the creator's website.
Mallet
A text mining software often used for topic modeling, or identifying themes across a text. Software is downloadable for free from the developers' website.

For more advanced text mining techniques, such as sentiment analysis (identifying the tone of a text or texts) or named entity recognition (identifying people, places, and names in a text or texts), researchers often have to code their own text mining environments. R and Python are two commonly used programming software for text mining. Further resources for using programming software for text mining are linked below.

Resources

Text Mining Tools and Methods
This library guide contains a variety of text mining resources. It covers common text mining methods, different text mining software, and more.
The Data-Sitters Club
The Data-Sitters Club is a group of researchers who publish a variety of articles on computational text analysis and text mining. They have published a variety of articles on different text mining software and techniques.
Programming Historian
The Programming Historian has a variety of lessons covering different DH techniques. You can search their lessons for text-mining specific tutorials by using the pre-set filter "distant reading". They also have a variety of lessons on using Python and R.
AntConc Tutorials
The library has created a series of tutorials about how to use AntConc (v 4.3.1). These cover the basics of how to download AntConc, how to add texts or corpuses to it, and how to start using AntConc's various features.

Example Text Mining Projects

DSC #4: AntConc Saves the Day
This article uses text mining to explore common terms and phrases used in the "Baby-Sitters Club" book series, as well as provides a good overview of common text mining issues. This project does use an earlier version of AntConc and should not be used as a tutorial on how to use the software.
The Viral Texts Project
The Viral Texts Project uses a variety of digital humanities techniques to consider how stories went "viral" in nineteenth-century newspapers and periodicals. You can access their data, publications, and data visualizations from their website.