Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.


Gathering data can be a very lengthy process. Before you start cleaning data at all, please make sure you have an unedited, backup copy of your entire dataset stored elsewhere. This is important! This way, if you accidentally delete a bunch of data, you always have a fresh copy to work from. 

Even better, use something like GitHub to approve or deny changes to your data as you work, protecting your data from accidental changes. This is a good method if you anticipate changing your data often, like doing a lot of cleaning or reorganizing of data, or if you have a very large dataset.

You should also set up a method to document changes you make to the dataset. Best practices for preserving and disseminating data include creation of a README file, to give important information about the dataset. Being able to discuss how you cleaned it and why increases the validity of your research, and helps you, should you need to repeat a previous step further down the line. 

The library's Research Data Service consults with researchers on a variety of data management concerns, and can provide more information on data documentation, management, and preservation. 

Get Started

How you clean your data will depend on its format. Data is usually structured or unstructured. Structured data is often in spreadsheet format, which is considered tabular. Unstructured data is often plain text, images, or other data that does not have a defined structure. 

Basic Steps

  1. Make a copy of your data.
  2. Choose a documentation method. 
  3. Determine your data type.
  4. Determine what you need to do.
  5. Choose a tool to use. 
  6. Clean!