LibGuides: Data Cleaning for the Non-Data Scientist: Text Data

Text Formats

Most text analysis tools work best with plain text. You can identify plain text files by their .txt file extensions. Plain text is exactly what it sounds like - just text without any of the formatting options that make documents look nice. If your text is in Word, Google Docs, or something similar, you can convert it to plain text by saving a new version of your file in a text editor like Notepad (on a PC) or TextEdit (on a Mac).

This example shows the same poem in a Word file and as plain text. What differences do you see?

When the formatted poem was converted to plain text it lost the styling that made some text italicized, bolded, larger, centered, and indented. Although the styling was lost, the plain text version retained all of the characters, capitalization, spaces, punctuation, and line breaks.

Benefits of plain text:

can be imported into any text analysis or visualization tool
relatively small files
easy to preserve
can be opened with free software

Common Text Errors

Most errors in text data are really just inconsistencies. Text analysis usually looks at word frequency in one form or another, and inconsistencies make that harder. Here are some of the most common ways inconsistencies show up in text data.

Extra Spaces

Make sure there aren't any trailing spaces before or after words. This is especially troublesome if your text is stored in a spreadsheet where extra spaces make it harder to find matches and split columns. For paragraph text, make sure your spacing is consistent. Try to only have one space between words and sentences.

Prescribed Values

Quantitative text data might have prescribed values (e.g., True/False, answers to multiple choice questions, state names). Check your text to make sure these are recorded correctly.

Spelling and Punctuation

If you have spell check, use it. Check the spelling of proper nouns and acronyms that spell check might miss. Above all, be consistent with spelling, abbreviations, and punctuation. Also check the documentation for the analysis tool you plan on using to see if it has any trouble recognizing special characters. You can save time by finding and replacing those before you import your text.

Casing

Make sure your text uses consistent capitalization (e.g., lowercase or sentence case) to make it easier to find matches.

Artifacts

When you transform text from one format to another, that process sometimes leaves behind artifacts - oddities in your text that can throw off your analysis. This often happens with text that has been scraped from the internet, copied from PDFs, or retrieved through Optical Character Recognition (OCR).

To spot artifacts, look for:

unnecessary line breaks
URLs
numbers or symbols in the middle of words
out of place punctuation or capitalization

For example, see the article below is displayed in two ways: as a PDF and as a spreadsheet of text extracted from the PDF through OCR.

The PDF is clear and legible to humans but the text isn't machine readable. It can't be imported into most analysis programs. The spreadsheet of text extracted through OCR is machine readable, but the transformation process created artifacts that need to be cleaned up. Notice how an umlaut in the PDF text (highlighted) is replaced with a combination of letters, symbols and numbers in the OCR'd text.