LibGuides: OpenRefine: Removing Duplicates

Removing Duplicates

Your data may include duplicate information which is unnecessary to keep as part of the data set.

NOTE: Make sure that you are in ROW mode for best results.

Run the sort function on the column which contains duplicates. For instructions on sorting, see the Sorting Data section.
After you have sorted the column, choose "Sort" and then "reorder rows permanently."
Go to the column with duplicates and click on the arrow button in the column header.
Choose “Edit cells” and then select “Blank down.”
1. “Blank down” will detect if two rows following each other have the same content. If they do, the second row will be “blanked out” and the cell values removed.

After you have used the “Blank down” function, use the “Facet by blank” to identify rows with blank cell values for that column.
From the facet window, select the “true” option.
Go to the column labeled “All” and click on the arrow button, then select “Edit rows” and choose “Remove all matching rows.”
All rows with the identified duplicates will be removed. To restore the full data view, simply reset the facets

Helpful Tips:

It is important to make sure that all cells in the column you are applying this transformation to have values. If there are cells that were originally blank before applying the “Blank down” function, they will be deleted along with the duplicate rows.
It is important to be aware that the entire row of values will be deleted along with the duplicate value in a given column. Be careful and check before deleting rows or applying the "Blank down" function to make sure that data meant to be kept is not accidentally deleted.

5/21/2018 - Brinna Michael