Principles of Visualizing Data

Exploratory Data Analysis

Looking carefully at your data is important:

to identify mistakes in collection/processing
to find violations of statistical assumptions
to observe patterns in the data
to make hypothesis

Feeding unvisualized datta to a machine learning algorithm is asking for trouble.

Why Data Visualization?

Exploratory data analysis: what does your data really look like?
Error detection: did you do something stupid?
Presenting what you have learned to others.

A large fraction of the graphs and charts I see are terrible: visualization is harder than it looks.

Ascombe’s Quartet

All four data sets have exactly the same mean variance, correlation, and regression line

=> Plot the Ascombe’s Quartet

Appreciating Art: Which is Better?

Sensible appreciation of art requires developing a particular visual asethetic.

Tufte’s Visualization Aesthetic

Distinguishing good/bad visualizations requires a design aesthetic, and a vocabulary to talk about data representations:

Maximize data ink-ratio
Data-Ink Ratio = $\frac{\text{Data ink}}{\text{Total ink used in graphic}}$
Minimize lie factor
$\frac{\text{(Size of effect in graphic)}}{\text{Size of effect in data}}$

The fixing a two- or three-dimensional representation by a single parameter yields a lie, because area or volume increase non-proportionally to length.

Graphical Integrity: Scale Distortion

Always start bar graphs at zero.

Always properly label your axes.

Use continuous scales: linear or labelled.

Aspect Ratios and Lie Factors

The steepness of apparent cliffs is a function of aspect ratio. Aim for 45 degree lines or Golden ratio as most interpretable.
Minimize chartjunk
Extraneous visual elements distract from the message the data is trying to tell.
- Extra dimensionality
- Uninformative coloring
- Excessive grids and figurative decoration
In an exciting graphic, the data tells the story, not the chartjunk
Use proper scales and clear labeling

Which Chart to Use

Tabular Data

Tables can advantages over plots:

Representation of numerical precision
Understandable multivariate visualization:

each column is a different dimension.
Representation of heterogeneous data
Compactness for small numbers of points

Always Think this - Can this Table be Improved

Dimensions for Improvement

Order rows to invite comparisons.
Order rows to highlight importance or pairwise relationships.
Right justify uniform-precision numbers.
Use emphasis, font, or color to highlight important entries
Avoid excessive-length column descriptors.

Line Charts

Show data points, not just fits.
Line segments show connections, so do not use in categorical data.
Connecting points by lines is often chartjunk. Better is usually a trend line or fit with the data points.

Scatter Plots/Multivariate Data

Scatter plots show the values of each point, and are a great way to present 2D data sets.

Higher dimensional datasets are often best projected to 2D, through self-organizing maps or principle component analysis, although can be represented through bubble plots.

Reduce Overplotting by Small Points

Heatmaps Reveal Finer Structure

Color points on the basis of frequency

Bubble Charts for Extra Dimensions

Using color, shape, size, and shading of ‘dots’ enables dot plots to represent additional dimensions.

Bar Plots vs. Pie Charts

Bar plots show the frequency of proportion of categorical variables. Pie charts use more space and are harder to read and compare.

Partitioning each bar into pieces yields the stacked bar chart.
Pie charts are arguably better for showing percentages of totality, and people do seem to like them, so they may be harmless in small amounts.

Principles of Visualizing Data

Exploratory Data Analysis

Why Data Visualization?

Ascombe’s Quartet

Appreciating Art: Which is Better?

Tufte’s Visualization Aesthetic

Maximize data ink-ratio

Minimize lie factor

Minimize chartjunk

Which Chart to Use

Tabular Data

Dimensions for Improvement

Line Charts

Scatter Plots/Multivariate Data

Reduce Overplotting by Small Points

Heatmaps Reveal Finer Structure

Bubble Charts for Extra Dimensions

Bar Plots vs. Pie Charts

Histograms

Histograms: Bin Size/ Count Matters

Frequency vs. Density Histograms

Box and Whisker Plots