Skip to Main Content

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

Formats

Most analysis tools work best with simple tables saved as CSV (comma separated values) or TSV (tab separated values) files. CSV and TSV files are machine readable, meaning they can be imported into just about any data analysis or visualization program. They're very versatile because they're relatively simple. So simple that they can even be opened in any text editor. They don't contain charts, graphs, fancy formatting, or multiple sheets.

CSV and TSV files are also great for preserving spreadsheet data. Since they're compatible with so many different programs, there are lots of free tools you can use to open them. You also don't have to worry about not being able to open your files because your software license ran out. You have access to proprietary software while you're at the University, but what about after you leave? To be safe, save your spreadsheet data as a CSV or TSV.  

Not sure which one to choose? CSVs are more common.

Each column in your spreadsheet should contain one kind of data (e.g., numbers, text, dates). Make sure each column is formatted for the type of data stored there. Numbers stored as text can be particularly problematic when you go to analyze your data. The most common way people mix formats is by adding notes to a column where they don't belong. Move notes to a separate column so they don't interfere with your analysis. 

The recommended format is YYYY-MM-DD because it sorts in chronological order. Make sure your dates are formatted consistently. Don't get discouraged if it takes time and several attempts to get the dates formatted correctly. Date formats are one of the most common data cleaning challenges with tabular data.

Does your data have dates before 1900? Watch out! Excel's built-in formats and formulas don't work on dates before 1900 because Excel stores those dates as text. If you have earlier dates, it may be better to use separate columns for year, month, and day.

Common Spreadsheet Errors

No matter which tool you use to clean your spreadsheet data, correcting these small mistakes can have a big impact on your analysis.

Duplicates

Make sure unique identifiers are truly unique. Duplicates of unique IDs usually indicate an error. You'll need to investigate further to decide whether to merge records, remove duplicates, or reassign an ID.

Combined Values

Stick to the 1:1:1 rule - one variable per column, one instance per row, one value per cell. Are there cells in your spreadsheet with more than one value in them? They should be in separate columns. 

Units of Measurement

Each column should only contain one unit of measurement. That unit should be noted in the column header, not typed into the cell with the measurement.

Blanks and Nulls

Pick a way to represent nulls and NAs (you might need both) and be consistent. Is a blank cell in your spreadsheet a null value or missing data? It's hard to tell, which is why it's better to explicitly mark null values.

Instead of blanks use: 

  • NULL
  • NA
  • 0
  • Check if your analysis program has a preferred format

Data Normalization

Data normalization also helps prepare your data for analysis by making sure your data is recorded consistently and within the correct range, if one applies. 

Prescribed Values

Quantitative text data might have prescribed values (e.g., True/False, multiple choice questions, or state names). Check your text columns to make sure these are recorded consistently. Also, eliminate extra spaces since that might interfere with your analysis.

Ranges and Outliers

Is there a set range of values for any of your variables? These could be definite ranges where there are no exceptions (e.g., a Likert scale question with choices 1-5), or probably ranges where there might be some exceptions (e.g., the ages of high school students). Explore the ranges in your data and check highest values, lowest values, and averages. Do they match what you expected? If not, investigate to find out if there are errors in your data or just some surprising but valid values.