Skip to main content

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

About this Guide

This guide is to help the non-data scientist through anticipating and handling "dirty" data. Did you download your data from elsewhere? Is your data from a physical book or something similar? Read on for suggestions and tools to help you get your data in shape for analysis. 

When we say data, this is not limited to scientific or numerical data! For humanists, it can seem nonsensical to think of your research material as "data." However, if you are using computational methods or digital tools on your primary and secondary sources, that is considered data!

Already know the concepts and ready to dive right in with cleaning your data? Check out the guide on OpenRefine for spreadsheet data. 

Note: This guide discusses concepts of data cleaning, best practices, and special considerations. 

What is data cleaning?

Unless you’ve collected your data yourself, and even if you have, chances are that the data you want to analyze will need to be cleaned up. Data cleaning refers to the process of preparing data for analysis, and often includes steps like normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need.

Note: “Cleaning” is the most widely accepted term for this process; other terms include “tidying,” which refers specifically to the process of reorganizing data, and “carpentry,” which is often associated with the Data Carpentry organization, who develops training on data cleaning and management skills.

Why clean data?

To ensure accurate analysis and to avoid misrepresentations of data. 

Let's look at an example. Look at the word cloud below. Word clouds show word frequencies by showing more common words larger than less common words. This word cloud represents a dataset of newspaper articles from 1923.

word cloud. full description in next paragraph.

What issues do you notice? It looks normal at first glance, but closer inspection reveals that one of our most common words is "chroniclingamerica.loc.gov." We may also see that "https", the common prefix for URLs, appears 72 times. If our dataset is newspaper articles from 1923, why are URL elements appearing? It's because in addition to our plain text newspaper articles, there is also a spreadsheet detailing from where each article was retrieved. All the articles come from the Chronicling America project through the Library of Congress, hence the URL elements. This is an instance in which some exploratory data analysis can reveal issues with the data that require cleaning. 

How do you know if it needs cleaning?

Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:

  • Did you collect the data yourself or is it from somewhere else? If you’re re-using data, it’s likely that it’s not already formatted in the best way for your research and the tools you want to use.  
  • Do you know what all the columns or variables are?
  • What kinds of data you should include your analysis, and how they are useful?
  • Do you know if there are any missing values or possible errors?
  • Have you looked for outliers? If outliers are present, you will need to decide how to handle them.