Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

Common Errors

Data in spreadsheets is often called "tabular" or "structured" data. Common errors include:

  • Data entry errors like typos, misspellings
  • Non-normalized values (i.e., some values are "USA" and some are "United States")
  • Handling of null values (i.e., N/A, 0, blanks)
    • A null just means "we don't know." However, this is indicated in many different ways depending on who entered the data. 
  • Differences in character encoding 
  • Structure of data does not match what you need 

Formats

Spreadsheet Formats

The most common and compatible file formats for tabular data is CSV (comma separated values) and TSV (tab separated values). Avoid using proprietary formats like those found with Microsoft Excel (XLSX and XLS), as those files can only be read by that program. A CSV or TSV file can be read in both software programs like Excel, and in any text editor. 

Non-spreadsheet Formats

Other data formats include HTML, XML, JSON, RData, and others. Data is these formats is still considered structured, as they have a predefined structure the data fits into. Commons errors will be similar to those listed above, though there may be special considerations with these formats. For example, JSON data often has multiple values per field, so transforming that to a CSV will require splitting them up. 

Not sure of your format?

Some basic examples are below, but if you're not sure of your data's format, we can help! Contact the Scholarly Commons (sc@library.illinois.edu) to get started. 

Here is an example of XML data, which looks similar to HTML.
 

<cat>
  <name>Zeus</name>
  <color>tabby</color>
  <age>6</age>
</cat>

XML will often have an associated "namespace" or "schema," which denotes the data's structure.

Here is an example of JSON data, which looks similar to a Python dictionary. 

"name":"Zeus""color":"tabby", "age":}

Tools

Software Programs

OpenRefine is an excellent tool to clean up structured data. It works with spreadsheet formats like CSV or TSV, as well as JSON, XML, Excel, and others. It offers powerful algorithms to detect errors, and has a simple interface. Check out our library guide on OpenRefine for help with using the program.

If you'd rather do your data cleaning by hand, you can edit your data directly and easily in a spreadsheet program like Microsoft Excel or OpenOffice Calc

Regular Expressions 

Regular expressions, also called regex, is a powerful tool for searching, and can be used to clean up data. Check out this lesson on the Programming Historian for more information. Regex can be used to find common errors and correct them with a programming language. 

Programming Languages

Programming languages like Python and R allow those familiar with coding to edit large amounts of data at once. Programming languages can be combined with regular expressions to perform large-scale operations, like finding all errors of a certain type and replacing them with the correct word. 

Text Editors

You could also use any basic text editor to clean up data, though it may be more time intensive. Notepad++ is a good option and is able to read many formats.