Skip to main content

University Library, University of Illinois at Urbana-Champaign

Introduction to Topic Models: Home

This is a guide for Topic Models theory and practice

What is a Topic Model?

  • A topic model is a form of text mining that identifies underlying patterns in a set of texts. 
  • Words that co-occur in a text are statistically likely to be related to one another, so topic modeling maps this occurrence and makes a prediction about sets of words that are related to one another as topics. 
  • Topic modeling further assumes that individual documents can contain a mixture of these topical patterns, and will map the occurrence of topics back onto the texts you provide in your corpus. 
  • Topic models identify patterns latently, from the text itself rather than from categories identified from the researcher in advance. 
  • Further Reading:

Constructing Topic Models
 
1. Preparing your Data:
 
The first step to working with topic models is to pre-process, or clean, your data. However, the decisions that you make at this stage will have an effect on the patterns recognizable to a topic model, so make sure to consider the consequences of your cleaning. 
  • Remove punctuation and uncapitalize. Topic modeling does not “read” a text in a semantic sense. Because of this, capitalized words will be read as “different” from uncapitalized words, and punctuation can erroneously be considered in your topics.
  • Remove stop-words (like I, and, the). Sometimes these words are superfluous to the patterns that your research is attempting to identify, and because they are so common, it can make sense to remove them from consideration in the model.
  • Ensure your texts can be imported into your tool. Some programs will import an entire file directory, so make sure that only the files you want analyzed are in the location you point to.

2. Constructing the Model

There are a variety of tools that you can use to construct topic models, with MALLET being a fairly standard implementation. Regardless of your tool or programming language of choice, there are several parameters that you can change before running the model on your data. 

How many topics (k)?
  • There is usually no single correct number of topics for your data, but too few clusters will produce topics that are overly broad and too many clusters will result in overlapping or too similar topics. Regardless, you will need to tell the model how many topics to identify. Some larger analyses have over one hundred topics, where smaller ones may have 10-20.
Iterations?
  • Topic modeling runs its analysis several times over the entire set of data to have a better probability in identifying patterns. How many times should your test run to improve its validity? Greater than 200 is a safe bet.

3. Evaluating your topics

After you run your model, you will have to read and evaluate your results. In short, how do you know that your topic model found a pattern that is reflective of your data? What insights does this pattern offer into your analysis of the corpus or texts? There are a couple of ways to check:

  • Do your results make sense to you? If your topics are sets of words that seem to have no relationship to one another, you may have to refine your parameters.
  • Do your results make sense to other humans? You can use other readers to confirm the validity of your topics.
  • Do they make sense to a computer? There are several computational predictions of a topic's coherence (or how much it would make sense to a human), including the Palmetto project analysis tool.
  • How do different parameters change your results? In order to make an argument for your topics, you should probably compare different sets of possible topics. 

Scholarly Commons

Scholarly Commons's picture
Scholarly Commons
Contact:
306 Main Library
Drop-ins welcome
Monday-Friday 8:30am-6:00pm
Phone: 217-244-1331
Website