Data Science Data Cleaning

Many issues arise in ensuring the sensible analysis of data from the field, including:

Data needs to be carefully massaged to make “apple to apple” comparisons:

NASA’s Mars Climate Orbiter exploded in 1999 due to a metric-to-English conversion issue.

Vigilance in data integration is essential.

The Ariane 5 rocket exploded in 1996 due to a bad 64-bit float to 16-bit integer conversion.

Measurements should be integers.
Counts should be integers.
Fractional quantities should be decimal, not (q, r) like (pounds, oz) or (feet, inches)

A particularly nasty cleaning issue in textual data is unifying character code representations:

Use simple transformations to unify names, like lower case, removing middle names, etc.
Consider phonetic hashing methods like Soundex and Metaphone.

Tradeoff between false positives and negatives.

Aligning temporal events from different datasets/systems can be problematic.

Use Coordinated Universal Time (UTC), a modern standard subsuming GMT.
Financial time series are tricky because of weekends and holidays: how do you correlate stock prices and temperatures?

Currency conversion uses exchange rates.
Correct stock prices for splits and dividends.
Use returns/percentage change instead of absolute price changes.
The time value of money needs correction for inflation for fair long-term comparisons.

Why do stock/oil prices correlate over 30 years?

An important aspect of data cleaning is properly representing missing data:

Setting such values to zero is generally wrong

With enough training data, one might drop all records with missing values, but we may want to use the model on records with missing fields.

Often it is better to estimate or imput missing values instead of leaving them blank.

A good guess for your death year is birth + 80.

Mean value imputation -

leaves mean same.
Random value imputation -

repreatedly selecting random values permits statistical evaluation of the impact of imputation.
Imputation by interpolation -

using linear regression to predict missing values works well if few fields are missing per record.

The largest reported dinosaur vertebra is 50% larger than all others: presumably a data error.

Look critically at the maximum and minimum values for all variables.
Normally distributed data should not have large outliers, k sigma from the mean.

Fix why you have an outlier. Don’t just delete.

Visually, it is easy to detect outliers, but only in low dimensional spaces.
It can be thought of as an unsupervised learning problem, like clustering.
Points which are far from their cluster center are good candidates for outliers.

Deleting outliers prior to fitting can yield better models, e.g. if these points correspond to measurement error.
Deleting outliers prior to fitting can yield worse models. e.g., if you are simply deleting points which are not explained by your simple model.

[1]

[2]

[3]