Constructing Topic Models
1. Preparing your Data:
The first step to working with topic models is to pre-process, or clean, your data. However, the decisions that you make at this stage will have an effect on the patterns recognizable to a topic model, so make sure to consider the consequences of your cleaning.
- Remove punctuation and uncapitalize. Topic modeling does not “read” a text in a semantic sense. Because of this, capitalized words will be read as “different” from uncapitalized words, and punctuation can erroneously be considered in your topics.
- Remove stop-words (like I, and, the). Sometimes these words are superfluous to the patterns that your research is attempting to identify, and because they are so common, it can make sense to remove them from consideration in the model.
- Ensure your texts can be imported into your tool. Some programs will import an entire file directory, so make sure that only the files you want analyzed are in the location you point to.
2. Constructing the Model
There are a variety of tools that you can use to construct topic models, with MALLET being a fairly standard implementation. Regardless of your tool or programming language of choice, there are several parameters that you can change before running the model on your data.
How many topics (k)?
There is usually no single correct number of topics for your data, but too few clusters will produce topics that are overly broad and too many clusters will result in overlapping or too similar topics. Regardless, you will need to tell the model how many topics to identify. Some larger analyses have over one hundred topics, where smaller ones may have 10-20.
- Topic modeling runs its analysis several times over the entire set of data to have a better probability in identifying patterns. How many times should your test run to improve its validity? Greater than 200 is a safe bet.
3. Evaluating your topics
After you run your model, you will have to read and evaluate your results. In short, how do you know that your topic model found a pattern that is reflective of your data? What insights does this pattern offer into your analysis of the corpus or texts? There are a couple of ways to check:
- Do your results make sense to you? If your topics are sets of words that seem to have no relationship to one another, you may have to refine your parameters.
- Do your results make sense to other humans? You can use other readers to confirm the validity of your topics.
- Do they make sense to a computer? There are several computational predictions of a topic's coherence (or how much it would make sense to a human), including the Palmetto project analysis tool.
- How do different parameters change your results? In order to make an argument for your topics, you should probably compare different sets of possible topics.