LibGuides: OpenRefine: Clustering

What is Clustering?

One of OpenRefine’s most powerful features is the “Clustering” function. With the support of several types of key collision and nearest neighbor algorithms, the Clustering function can help you to identify inconsistencies in your data from misspellings, to non-standardized value formatting, or input error.

Clustering works by using what is called “fuzzy matching” on the values within a chosen column using the algorithm of your choice to determine if possible cell values “look similar” enough to be possible matches. The algorithms supported by OpenRefine are of two types:

Key collision
Nearest neighbor

For more information on the specific types of algorithms you can choose from, see the OpenRefine documentation on Clustering In Depth.

How to Cluster

There are two ways to open the clustering window:
1. On the column of your choice, perform a “Text facet.” At the top of the facet window, select the “Cluster” option. OR
2. Go to the column you would like to cluster and click the arrow button on the column header, then select the “Edit cells” option and choose “Cluster and edit.”
In the Clustering window, you will see several options:
1. At the top of the window is where you can choose the type of algorithm to run.
2. In the center of the window is a list of the suggested clusters, the current values, and suggested new value.
3. On the right are several sliding scales which can be used to narrow the criteria for clustering: by number of choices in the cluster, number of total rows in the cluster, average length of the values in a cluster, or the variation in length of the values in a cluster.

Choose the algorithm which best suits your needs and then consider the suggested clusters.
1. For clusters you would like to keep, select the “Merge?” box and confirm that the text box in the “New Cell Value” column is consistent with the value you want to change the clustered values to.
2. Repeat for each suggested cluster.

When you are ready, select either “Merge Selected & Re-Cluster” or “Merge Selected & Close.”
1. “Merge Selected & Re-Cluster” edits the selected values and then automatically re-runs the clustering algorithm on the same column.
2. “Merge Selected & Close” edits the selected values and then closes out of the Clustering window.

Helpful Tips:

It can be helpful to have a subject specialist assist in this part of the data cleaning to account for possible errors. For example:
- A data set includes a “Location” column which has the values “Savoy Hotel” and “Hotel Savoy.” A clustering algorithm might suggest merging these two values, but a subject specialist would be able to identify that these values actually refer to two different establishments, Hotel Savoy in New York and Savoy Hotel in London.

5/21/2018 - Brinna Michael