Skip to Main Content

University Library

LibGuides

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

Wait!

Before you start cleaning, lock down your original data file and only make changes to a copy. This is important! This way, if you accidentally delete anything, you can always go back to the original. Gathering data takes a lot of time and effort. You don't want to have to redo it because you made a mistake during cleaning.

The Cleaning Process

No matter what kind of data you have, or what cleaning tool you use, these basic steps will help you organize your process:

  1. Don't change the original data!
    Keep your original data as is and make changes to a copy (e.g., DataFileName_clean.csv).
  2. Create a change log.
    Take notes on your changes, and save the change log along with your other data documentation like your ReadMe file and Data Dictionary or Codebook.
  3. Save a new version at important stages.
    Before you make a big or complex change, create a new version of your file in case you need to backtrack (e.g., DataFileName_clean_v02.csv). Make a note about why you created a new version.

How to Spot Messy Data

Almost every dataset needs some kind of cleaning, but most people don't realize that until their analysis goes wrong. You can save time and effort by doing some spot checks and exploratory visualizations to find mistakes before you start your analysis.

Scan your data for:

  • Blanks or unexpected line breaks
  • Cells with multiple values
  • Columns with more than one kind of data (e.g., numbers and text)
  • Text formatting that doesn't look right

If you spot any of the above, your data needs cleaning. See the sections on Spreadsheet Data and Text Data for more advice on what to look for.

Exploratory Visualizations

Exploratory visualizations are a great way to get to know your data better. Experiment with different kinds of visualizations. Exploratory visualizations don't have to look nice, but they can help you identify groupings, patterns, outliers, and any surprising values that might indicate mistakes that should be cleaned up.

Handy tools for exploring your data include:

The image below shows an example of a text analysis dashboard created by pasting plain text into Voyant Tools.