This guide is to help the non-data scientist through anticipating and handling "dirty" data. Did you download your data from elsewhere? Is your data from a physical book or something similar? Read on for suggestions and tools to help you get your data in shape for analysis.
When we say data, this is not limited to scientific or numerical data! For humanists, it can seem nonsensical to think of your research material as "data." However, if you are using computational methods or digital tools on your primary and secondary sources, that is considered data!
Already know the concepts and ready to dive right in with cleaning your data? Check out the guide on OpenRefine for spreadsheet data.
Note: This guide discusses concepts of data cleaning, best practices, and special considerations.
Unless you’ve collected your data yourself, and even if you have, chances are that the data you want to analyze will need to be cleaned up. Data cleaning refers to the process of preparing data for analysis, and often includes steps like normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need.
Note: “Cleaning” is the most widely accepted term for this process; other terms include “tidying,” which refers specifically to the process of reorganizing data, and “carpentry,” which is often associated with the Data Carpentry organization, who develops training on data cleaning and management skills.
To ensure accurate analysis and to avoid misrepresentations of data.
Let's look at an example. Look at the word cloud below. Word clouds show word frequencies by showing more common words larger than less common words. This word cloud represents a dataset of newspaper articles from 1923.
What issues do you notice? It looks normal at first glance, but closer inspection reveals that one of our most common words is "chroniclingamerica.loc.gov." We may also see that "https", the common prefix for URLs, appears 72 times. If our dataset is newspaper articles from 1923, why are URL elements appearing? It's because in addition to our plain text newspaper articles, there is also a spreadsheet detailing from where each article was retrieved. All the articles come from the Chronicling America project through the Library of Congress, hence the URL elements. This is an instance in which some exploratory data analysis can reveal issues with the data that require cleaning.
Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:
Except where otherwise indicated, original content in this guide is licensed under a Creative Commons Attribution (CC BY) 4.0 license. You are free to share, adopt, or adapt the materials. We encourage broad adoption of these materials for teaching and other professional development purposes, and invite you to customize them for your own needs.