Skip to Main Content

University Library

LibGuides

Research & Publication in Medicine & Health

Using Existing Data Sets

Using poor quality data can lead to erroneous results and even reputational damage. Before deciding to trust a dataset, assess data quality by considering both the source and the data itself.

Consider the Source

Where did you get the data files? From a source that can ensure the files haven't been changed or corrupted? Are you using the latest, most complete version of the data?

What do you know about the research methodology? Is there an associated article you can read to assess the research design? Do the data collection and processing methods meet the standards for high quality research in your field?

Source: UIUC Research Service Data Nudge 2024-09

 

Data Quality

 

6 Dimensions of Data Quality

Once you're assured the data comes from a quality source, assess the data itself. Although assessment tools for specific types of data might include additional measures, most data scientists agree that these six dimensions are essential for assessing data quality.

bullseye

Accuracy
data correctly represents the source material

All records in the Customer Table must have accurate Customer Name, Customer Birthdate, and Customer Address fields when compared to the Tax Form.

puzzle

Completeness
expected values are fully present and known nulls are clearly marked

Completeness measures the degree to which all expected records in a dataset are present. At a data element level, completeness is the degree to which all records have data populated when expected.

toy soldiers

Consistency
data is recorded uniformly both within variables and across the dataset

Consistency is a data quality dimension that measures the degree to which data is the same across all instances of the data. Consistency can be measured by setting a threshold for how much difference there can be between two datasets.

clock

Timeliness
data represents a time period appropriate for research purposes

Timeliness is the degree to which a dataset is available when expected and depends on service level agreements being set up between technical and business resources.

fingerprint

Uniqueness
data records are not unnecessarily duplicated

Uniqueness measures the degree to which the records in a dataset are not duplicated.

checklist

Validity
data values fit within defined ranges or categories

Validity measures the degree to which the values in a data element are valid.

 

Source: UIUC Research Service Data Nudge 2024-09 

 

Examples of data quality dimensions can be found at datacamp's  "Data Quality Dimensions Cheat Sheet".