Skip to main content

University Library, University of Illinois at Urbana-Champaign

Text Mining Tools and Methods

This guide contains resources for researching with text mining

Choosing a method

The text analysis method you choose will depend on your research question. When choosing a method to use, first consider what you expect to learn from your research and what form you would like your results to take. The methods described below can be combined in different ways during the course of a research project. For example, natural language processing algorithms might reveal the names of people in your text, to which you could apply network analysis to study how the actors are connected. 

Word Frequencies

Computing word frequencies is a basic building block of higher level textual analysis algorithms, although they can sometimes be revealing in themselves. This can include raw word counts, or calculating the percentage of words in a text or set of texts and comparing that across texts or time. Frequencies can also be counted for "n-grams," or phrases with a certain number (n) of words.

Related Tools in the Scholarly Commons:      

Word frequencies generated using HathiTrust bookworm
word frequencies of cat vs. dog in texts 1800-2000

Related Tools Available Online:

Related Library Guides:

Example Project Using Word Frequencies

Machine Learning

Text analysis often relies on machine learning, a branch of computer science that trains computers to recognize patterns. There are two kinds of machine learning used in text analysis: supervised learning, where a human helps to train the pattern-detecting model, and unsupervised learning, where the computer finds patterns in text with little human intervention. An example of supervised learning is Naive Bayes Classification. See Natural Language Processing and Topic Modeling for examples of unsupervised machine learning.

Example Project Using Classification (Supervised Machine Learning):

Topic Modeling

Topic modeling, a form of machine learning, is a way of identifying patterns and themes in a body of text.  Topic modeling is done by statistical algorithms, such as Latent Dirichlet Allocation, which groups words into "topics" based on which words frequently co-occur in a text.

Related Tools in the Scholarly Commons: 

Credit: Visualization by Digital Environmental Humanities available by CC BY-NC-SA 3.0
topic model visualization from Digital Environmental Humanities

Related Tools Available Online:

Related Library Guides:

 

Example Project using Topic Modeling:

Natural Language Processing

Natural language processing, a kind of machine learning, is the attempt to use computational methods to extract meaning from free text. Among other things, natural language processing algorithms can derive names of people and places, dates, sentiment, and parts of speech. 

Related Tools Available in the Scholarly Commons:

Related Tools Available Online:     

Related Library Guides:

Example Project using Natural Language Processing:

Network and Citation Analysis

Network analysis is a method for finding connections between nodes representing people, concepts, sources, and more. These networks are usually visualized into graphs that show the interconnectedness of the nodes.

Citation analysis can be used to discover connections and relationships between various citations of documents and then visualized.

Related Tools Available Online:

  • Gephi (network analysis)
  • VOSViewer (citation and network analysis)

Example Project:

Visualizations

Generating visualizations is a way to "see" your data.  Text mining visualization can help researchers see relationships between certain concepts.  An example of a visualization of data can be word clouds, graphs, maps, and other graphics that produce a visual depiction the data.

Related Tools in the Scholarly Commons:

             Word cloud of Jane Austen's Pride and Prejudice created in Wordle

word cloud of Jane Austen's Pride and Prejudice

 

 

Related Tools Available Online:        

Related Library Guides: