Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

University Library, University of Illinois at Urbana-Champaign

Data Cleaning for the Non-Data Scientist

Considering how to clean data up when it's not part of your regular workflow.

Common Errors

Text is sometimes referred to as "unstructured" data, because there is no predefined format or structure to the data. Common issues in text data include:

  • Misspellings (caused by OCR or manually entered text)
  • Punctuation/special characters
  • Inconsistencies in abbreviations or capitalization
  • Extra spaces

With text, "data cleaning" can be using a plain-text editor to correct errors, or using tools like a programming language to strip punctuation, normalize capitalization, and remove extra characters/whitespace. 

Formats

Is your text machine-readable?

If your text is a non-tagged PDF or an image of text (JPG, TIFF, PNG, etc.), your text data is not machine-readable! This just means you'll need to convert the data into text that a computer understands. We recommend using an OCR (Optical Character Recognition) program. Check out our guide on OCR for tutorials and tools. 

When converting your text to machine-readable text, use plain text (TXT) format. Other formats, like Microsoft Word (DOCX or DOC) or rich-text format (RTF), are difficult to transfer between programs. Plain text formats can be read with any text editor, and are compatible with more software programs. 

Tools

Regular Expressions 

Regular expressions, also called regex, is a powerful tool for searching your text, and can help you clean text data. Check out this lesson on the Programming Historian for more information. Regex can be used to find common errors and correct them with a programming language. 

Programming Languages

Programming languages like Python and R allow those familiar with coding to edit large amounts of data at once. Programming languages can be combined with regular expressions to perform large-scale operations, like finding all errors of a certain type and replacing them with the correct word. 

Text Editors

You could also use any basic text editor to clean up your text, though it may be more time intensive. Notepad++ is a good option and is able to read many file formats.

OCR Programs

Check out our guide on Optical Character Recognition for other tools like ABBYY FineReader and Adobe Acrobat.

Next Steps

Perhaps you're embarking on a text analysis project.

If so, you may need to perform other pre-processing steps like tokenization, stemming words (i.e., making "library" and "libraries" the same token), normalizing contractions (making "don't" into "do not"), or other steps to prepare your text for analysis. 

Tokenization

Tokenization is the process of preparing free text for analysis by putting it into a structured format. Each unique word in the text corpus is a type, and each occurrence of that word is a token. Consider the sentence "the quick brown fox jumps over the lazy dog." That sentence split into tokens and represented in XML would look like this:

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

Example from Wikipedia.

Stemming

Stemming is the process of returning a word to its base or root. An example would be returning the words "catty" and "cats" to the base form, "cat." This allows processing of related words. 

Lemmatization

Lemmatization is similar to stemming in that it returns words to their bases or roots, but is able to consider context. An example would be returning the word "better" to its base form, "good." 

To see more pre-processing steps using the Python package NLTK, check out this tutorial