Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

Spreadsheet Cleaning Tools

OpenRefine

If you have spreadsheets with a lot of data or a lot of inconsistencies, it is worth your time to learn how to use OpenRefine. It's a free and highly powerful tool that can identify and fix common errors (like near matches that are similar but not exactly the same). The OpenRefine LibGuide linked below is a great place to get started.

Excel

Although it's not as powerful as OpenRefine, you can use Excel for many spreadsheet cleaning tasks, but be careful. It is easy to make mistakes in Excel. You can help prevent mistakes by locking cells that shouldn't be changed and by using data validation for prescribed values.

Google Sheets

While Excel has more data cleaning functionality, Google Sheets makes it easy to split columns, remove duplicates, and trim whitespace (the unnecessary spaces in your spreadsheet that can throw off your analysis). Keep in mind that, because it is online, Google Sheets is not an appropriate tool for sensitive data. For anything sensitive, Excel or OpenRefine would be better because you can run them locally on your computer.

Text Cleaning Tools

Notepad++ and Regular Expressions

Regular expression syntax, or regex, works kind of like an extremely powerful version of find and replace. While find and replace can look for specific words, regex can search for patterns in text like capital letters, the first and last letters of words, numbers, punctuation, spaces, and line breaks. Using clever combinations, you can use regex to search for complex patterns in text like proper nouns, addresses, phone numbers, date ranges, the beginnings of sentences, etc. Notepad++ is a free Windows program for editing code that allows you to use regex to search plain text files. 

Other Resources

Want to learn more about data cleaning? Check out these resources:

Questions?

We can help. Contact the Research Data Service (researchdata@library.illinois.edu) to be put in touch with an expert.