Exploratory Data Analysis

Jon Reades

Epicyclic Feedback

Peng and Matsui, The Art of Data Science, p.8

Set Expectations Collect Information Revise Expectations
Question Question is of interest to audience Literature search/experts Sharpen question
EDA Data are appropriate for question Make exploratory plots Refine question or collect more data
Modelling Primary model answers question Fit secondary models / analysis Revise model to include more predictors
Interpretation Interpretation provides specific and meaningful answer Interpret analyses with focus on effect and uncertainty Revise EDA and/or models to provide more specific answers
Communication Process & results are complete and meaningful Seek feedback Revises anlyses or approach to presentation

Approaching EDA

There’s no hard and fast way of doing EDA, but as a general rule you’re looking to:

  • Clean
  • Canonicalise
  • Clean More
  • Visualise & Describe
  • Review
  • Clean Some More

The ‘joke’ is that 80% of Data Science is data cleaning.

Another Take

Here’s another view of how to do EDA:

  1. Preview data randomly and substantially
  2. Check totals such as number of entries and column types
  3. Check nulls such as at row and column levels
  4. Check duplicates: do IDs recurr, did the servers fail
  5. Plot distribution of numeric data (univariate and pairwise joint distribution)
  6. Plot count distribution of categorical data
  7. Analyse time series of numeric data by daily, monthly and yearly frequencies

Signal & Noise

What is it?

What is it?

What is it?

Start with a Chart

The problem of relying on statistics alone was amply illustrated by Anscombe’s Quartet (1973)…

  • We are not very good at looking at spreadsheets.
  • We are very good at spotting patterns visually.

Sometimes, we are too good; that’s where the stats comes in. Think of it as the ‘tiger in the jungle’ problem..

Anscombe’s Quartet

X1 Y1 X2 Y2 X3 Y3 X4 Y4
10.0 8.04 10.0 9.14 10.0 7.46 10.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 13.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 9.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 11.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 14.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 6.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 4.0 12.5
12.0 10.84 12.0 9.13 12.0 8.15 12.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 7.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 5.0 6.89

Summary Statistics for the Quartet

Property Value
Mean of x 9.0
Variance of x 11.0
Mean of y 7.5
Variance of y 4.12
Correlation between x and y 0.816
Linear Model y = 3 + 0.5x

But What do They Look Like?

The Tiger that Isn’t

I would argue that the basic purpose of charts and of statistics as a whole is to help us untangle signal from noise. We are ‘programmed’ to see signals, so we need to set the standard for ‘it’s a tiger!’ quite high in research & in policy-making.

Think it Through

You can make a lot of progress in your research without any advanced statistics!

  • A ‘picture’ isn’t just worth 1,000 words, it could be a whole dissertation!
  • The right chart makes your case eloquently and succinctly.

Always ask yourself:

  • What am I trying to say?
  • How can I say it most effectively?
  • Is there anything I’m overlooking in the data?

A good chart is a good way to start!

What Makes a Good Plot?

A good chart or table:

  1. Serves a purpose — it is clear how it advances the argument in a way that could not be done in the text alone.
  2. Contains only what is relevant — zeroes in on what the reader needs and is not needlessly cluttered.
  3. Uses precision that is meaningful — doesn’t clutter the chart with needless numbers.

For Example…

How much precision is necessary in measuring degrees at the equator?

Decimal Places Degrees Distance
0 1 111km
1 0.1 11.1km
2 0.01 1.11km
3 0.001 111m
4 0.0001 11.1m
5 0.00001 1.11m
6 0.000001 11.1cm
7 0.0000001 1.11cm
8 0.00000001 1.11mm

Goals by World Cup Final

Goals by World Cup Final

Average Goals by World Cup Final

How far from Equality?

How far from Equality?

The Purpose of a Chart

The purpose of a graph is to show that there are relationships within the data set that are not trivial/expected.

Choose the chart to highlight relationships, or the lack thereof:

  • Think of a chart or table as part of your ‘argument’ – if you can’t tell me how a figure advances your argument (or if your explanation is more concise than the figure) then you probably don’t need it.
  • Identify & prioritise the relationships in the data.
  • Choose a chart type/chart symbology that gives emphasis to the most important relationships.

If a picture is worth 1,000 words, make sure those words aren’t “blah, blah, blah…”

Beyond the Chart

Not Everyone Likes Tables

Getting information from a table is like extracting sunlight from a cucumber. Arthur & Henry Fahrquhar 1891)

Real Numbers

Consider the difference in emphasis between:

  • 11316149
  • 11,316,149
  • 11.3 million
  • 11 x 10\(^{6}\)
  • 22%
  • 22.2559%

Always keep in mind the purpose of the number.

There’s Still a Role for Tables

Why a table is sometimes better than a chart:

  • You need to present data values with greater detail
  • You need to enable readers to draw comparisons between data values
  • You need to present the same data in multiple ways (e.g. raw number and percentage)
  • You want to show many dimensions for a small number of observations

Undergraduate Tables (Failing Grade)

Undergraduate Tables (Passing Grade)

Postgraduate Tables (Failing Grade)

Postgraduate Tables (Passing Grade)

Design for Tables

Principles:

  • Reduce the number of lines to a minimum (and you should almost never need vertical lines).
  • Use ‘white-space’ to create visual space between groups of unrelated (or less related) elements.
  • Remove redundancy (if you find yourself typing ‘millions’ or ‘GBP’ or ‘Male’ repeatedly then you’ve got redundancy).
  • Ensure that meta-data is clearly separate from, but attached to, the graph (i.e. source, title, etc.).

In Practice

Getting Started

You can follow along by loading the Inside Airbnb sample:

import pandas as pd
import geopandas as gpd
url='https://bit.ly/3I0XDrq'
df = pd.read_csv(url)
df.set_index('id', inplace=True)
df['price'] = df.price.str.replace('$','',regex=False).astype('float')
gdf = gpd.GeoDataFrame(df, 
            geometry=gpd.points_from_xy(
                        df['longitude'], 
                        df['latitude'], 
                        crs='epsg:4326'
            )
      )
gdf.to_file('Airbnb_Sample.gpkg', driver='GPKG')

What Can We Do? (Series)

This is by no means all that we can do…

Series-level Methods.
Command Returns
print(f"Host count is {gdf.host_name.count()}")
print(f"Mean is {gdf.price.mean():.0f}")
print(f"Max price is {gdf.price.max()}")
print(f"Min price is {gdf.price.min()}")
print(f"Median price is {gdf.price.median()}")
print(f"Standard dev is {gdf.price.std():.2f}")
print(f"25th quantile is {gdf.price.quantile(q=0.25)}")
Count of non-nulls
Mean
Highest value
Lowest value
Median
Standard deviation
25th quantile

What Can We Do? (Data Frame)

Command Returns
print(df.mean())
print(df.count())
print(df.max())
# ...
print(df.corr())
print(df.describe())
Mean of each column
Number of non-null values in each column
Highest value in each column
$\vdots$
Correlation between columns
Summarise

Measures

So pandas provides functions for commonly-used measures:

print(f"{df.price.mean():.2f}")
print(f"{df.price.median():.2f}")
print(f"{df.price.quantile(0.25):.2f}")

Output:

118.4542
80.50
40.75

More Complex Measures

But Pandas also makes it easy to derive new variables… Here’s the z-score:

\[ z = \frac{x - \bar{x}}{s}\]

df['zscore'] = (df.price - df.price.mean())/df.price.std()
df.plot.box(column='zscore')

And Even More Complex

And here’s the Interquartile Range Standardised score:

\[ x_{iqrs} = \frac{x - \widetilde{x}}{Q_{75} - Q_{25}} \]

df['iqr_std'] = (df.price - df.price.median())/ \
      (df.price.quantile(q=0.75)-df.price.quantile(q=0.25))
df.plot.box(column='iqr_std')

The Plot Thickens

We’ll get to more complex plotting over the course of the term, but here’s a good start for exploring the data! All plotting depends on matplotlib which is the ogre in the attic to R’s ggplot.

import matplotlib.pyplot as plt

Get used to this import as it will allow you to save and manipulate the figures created in Python. It is not the most intuitive approach (unless you’ve used MATLAB before) but it does work.

Confession Time: I do like ggplot and sometimes even finish off graphics for articles in R just so that I can use ggplot; however, it is possible to generate great-looking figures in matplotlib but it is often more work because it’s a lot less intuitive.

Boxplot

df.price.plot.box()
plt.savefig('pboxplot.png', dpi=150, transparent=True)

Frequency

df.room_type.value_counts().plot.bar()
plt.savefig('phistplot.png', dpi=150, transparent=True)

A Correlation Heatmap

We’ll get to these in more detail in a couple of weeks, but here’s some output…

A ‘Map’

df.plot.scatter(x='longitude',y='latitude')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

A Fancy ‘Map’

df.plot.scatter(x='longitude',y='latitude',
                c='price',colormap='viridis',
                figsize=(10,5),title='London',
                grid=True,s=24,marker='x')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

An Actual ‘Map’

gdf.plot(column='price', cmap='viridis', 
         scheme='quantiles', markersize=8, legend=True)

Resources

There’s so much more to find, but: