Exploratory Data Analysis

Jon Reades

Epicyclic Feedback

Peng and Matsui, The Art of Data Science, p.8

	Set Expectations	Collect Information	Revise Expectations
Question	Question is of interest to audience	Literature search/experts	Sharpen question
EDA	Data are appropriate for question	Make exploratory plots	Refine question or collect more data
Modelling	Primary model answers question	Fit secondary models / analysis	Revise model to include more predictors
Interpretation	Interpretation provides specific and meaningful answer	Interpret analyses with focus on effect and uncertainty	Revise EDA and/or models to provide more specific answers
Communication	Process & results are complete and meaningful	Seek feedback	Revises anlyses or approach to presentation

Approaching EDA

There’s no hard and fast way of doing EDA, but as a general rule you’re looking to:

Clean
Canonicalise
Clean More
Visualise & Describe
Review
Clean Some More
…

The ‘joke’ is that 80% of Data Science is data cleaning.

Another Take

Here’s another view of how to do EDA:

Preview data randomly and substantially
Check totals such as number of entries and column types
Check nulls such as at row and column levels
Check duplicates: do IDs recurr, did the servers fail
Plot distribution of numeric data (univariate and pairwise joint distribution)
Plot count distribution of categorical data
Analyse time series of numeric data by daily, monthly and yearly frequencies

Signal & Noise

What is it?

Start with a Chart

The problem of relying on statistics alone was amply illustrated by Anscombe’s Quartet (1973)…

We are not very good at looking at spreadsheets.
We are very good at spotting patterns visually.

Sometimes, we are too good; that’s where the stats comes in. Think of it as the ‘tiger in the jungle’ problem..

Anscombe’s Quartet

X1	Y1	X2	Y2	X3	Y3	X4	Y4
10.0	8.04	10.0	9.14	10.0	7.46	10.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	13.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	9.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	11.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	14.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	6.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	4.0	12.5
12.0	10.84	12.0	9.13	12.0	8.15	12.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	7.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	5.0	6.89

Summary Statistics for the Quartet

Property	Value
Mean of `x`	9.0
Variance of `x`	11.0
Mean of `y`	7.5
Variance of `y`	4.12
Correlation between `x` and `y`	0.816
Linear Model	`y = 3 + 0.5x`

But What do They Look Like?

The Tiger that Isn’t

I would argue that the basic purpose of charts and of statistics as a whole is to help us untangle signal from noise. We are ‘programmed’ to see signals, so we need to set the standard for ‘it’s a tiger!’ quite high in research & in policy-making.

Think it Through

You can make a lot of progress in your research without any advanced statistics!

A ‘picture’ isn’t just worth 1,000 words, it could be a whole dissertation!
The right chart makes your case eloquently and succinctly.

Always ask yourself:

What am I trying to say?
How can I say it most effectively?
Is there anything I’m overlooking in the data?

A good chart is a good way to start!

What Makes a Good Plot?

A good chart or table:

Serves a purpose — it is clear how it advances the argument in a way that could not be done in the text alone.
Contains only what is relevant — zeroes in on what the reader needs and is not needlessly cluttered.
Uses precision that is meaningful — doesn’t clutter the chart with needless numbers.

For Example…

How much precision is necessary in measuring degrees at the equator?

Decimal Places	Degrees	Distance
0	1	111km
1	0.1	11.1km
2	0.01	1.11km
3	0.001	111m
4	0.0001	11.1m
5	0.00001	1.11m
6	0.000001	11.1cm
7	0.0000001	1.11cm
8	0.00000001	1.11mm

Goals by World Cup Final

Average Goals by World Cup Final

How far from Equality?

The Purpose of a Chart

The purpose of a graph is to show that there are relationships within the data set that are not trivial/expected.

Choose the chart to highlight relationships, or the lack thereof:

Think of a chart or table as part of your ‘argument’ – if you can’t tell me how a figure advances your argument (or if your explanation is more concise than the figure) then you probably don’t need it.
Identify & prioritise the relationships in the data.
Choose a chart type/chart symbology that gives emphasis to the most important relationships.

If a picture is worth 1,000 words, make sure those words aren’t “blah, blah, blah…”

Beyond the Chart

Not Everyone Likes Tables

Getting information from a table is like extracting sunlight from a cucumber. Arthur & Henry Fahrquhar 1891)

Real Numbers

Consider the difference in emphasis between:

11316149
11,316,149
11.3 million
11 x 10\(^{6}\)
22%
22.2559%

Always keep in mind the purpose of the number.

There’s Still a Role for Tables

Why a table is sometimes better than a chart:

You need to present data values with greater detail
You need to enable readers to draw comparisons between data values
You need to present the same data in multiple ways (e.g. raw number and percentage)
You want to show many dimensions for a small number of observations

Undergraduate Tables (Failing Grade)

Undergraduate Tables (Passing Grade)

Postgraduate Tables (Failing Grade)

Postgraduate Tables (Passing Grade)

Design for Tables

Principles:

Reduce the number of lines to a minimum (and you should almost never need vertical lines).
Use ‘white-space’ to create visual space between groups of unrelated (or less related) elements.
Remove redundancy (if you find yourself typing ‘millions’ or ‘GBP’ or ‘Male’ repeatedly then you’ve got redundancy).
Ensure that meta-data is clearly separate from, but attached to, the graph (i.e. source, title, etc.).

In Practice

Getting Started

You can follow along by loading the Inside Airbnb sample:

import pandas as pd
import geopandas as gpd
url='https://bit.ly/3I0XDrq'
df = pd.read_csv(url)
df.set_index('id', inplace=True)
df['price'] = df.price.str.replace('$','',regex=False).astype('float')
gdf = gpd.GeoDataFrame(df, 
            geometry=gpd.points_from_xy(
                        df['longitude'], 
                        df['latitude'], 
                        crs='epsg:4326'
            )
      )
gdf.to_file('Airbnb_Sample.gpkg', driver='GPKG')

What Can We Do? (Series)

This is by no means all that we can do…

Command Returns

Series-level Methods.
Command	Returns
`print(f"Host count is {gdf.host_name.count()}") print(f"Mean is {gdf.price.mean():.0f}") print(f"Max price is {gdf.price.max()}") print(f"Min price is {gdf.price.min()}") print(f"Median price is {gdf.price.median()}") print(f"Standard dev is {gdf.price.std():.2f}") print(f"25th quantile is {gdf.price.quantile(q=0.25)}")`	`Count of non-nulls Mean Highest value Lowest value Median Standard deviation 25th quantile`

print(f"Host count is {gdf.host_name.count()}")
print(f"Mean is {gdf.price.mean():.0f}")
print(f"Max price is {gdf.price.max()}")
print(f"Min price is {gdf.price.min()}")
print(f"Median price is {gdf.price.median()}")
print(f"Standard dev is {gdf.price.std():.2f}")
print(f"25th quantile is {gdf.price.quantile(q=0.25)}")

Count of non-nulls
Mean
Highest value
Lowest value
Median
Standard deviation
25th quantile

What Can We Do? (Data Frame)

Command Returns

Command	Returns
`print(df.mean()) print(df.count()) print(df.max()) # ... print(df.corr()) print(df.describe())`	`Mean of each column Number of non-null values in each column Highest value in each column $\vdots$ Correlation between columns Summarise`

print(df.mean())
print(df.count())
print(df.max())
# ...
print(df.corr())
print(df.describe())

Mean of each column
Number of non-null values in each column
Highest value in each column
$\vdots$
Correlation between columns
Summarise

Measures

So pandas provides functions for commonly-used measures:

print(f"{df.price.mean():.2f}")
print(f"{df.price.median():.2f}")
print(f"{df.price.quantile(0.25):.2f}")

Output:

118.4542
80.50
40.75

More Complex Measures

But Pandas also makes it easy to derive new variables… Here’s the z-score:

\[ z = \frac{x - \bar{x}}{s}\]

df['zscore'] = (df.price - df.price.mean())/df.price.std()
df.plot.box(column='zscore')

And Even More Complex

And here’s the Interquartile Range Standardised score:

\[ x_{iqrs} = \frac{x - \widetilde{x}}{Q_{75} - Q_{25}} \]

df['iqr_std'] = (df.price - df.price.median())/ \
      (df.price.quantile(q=0.75)-df.price.quantile(q=0.25))
df.plot.box(column='iqr_std')

The Plot Thickens

We’ll get to more complex plotting over the course of the term, but here’s a good start for exploring the data! All plotting depends on matplotlib which is the ogre in the attic to R’s ggplot.

import matplotlib.pyplot as plt

Get used to this import as it will allow you to save and manipulate the figures created in Python. It is not the most intuitive approach (unless you’ve used MATLAB before) but it does work.

Confession Time: I do like `ggplot` and sometimes even finish off graphics for articles in R just so that I can use `ggplot`; however, it is possible to generate great-looking figures in `matplotlib` but it is often more work because it’s a lot less intuitive.

Boxplot

df.price.plot.box()
plt.savefig('pboxplot.png', dpi=150, transparent=True)

Frequency

df.room_type.value_counts().plot.bar()
plt.savefig('phistplot.png', dpi=150, transparent=True)

A Correlation Heatmap

We’ll get to these in more detail in a couple of weeks, but here’s some output…

A ‘Map’

df.plot.scatter(x='longitude',y='latitude')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

A Fancy ‘Map’

df.plot.scatter(x='longitude',y='latitude',
                c='price',colormap='viridis',
                figsize=(10,5),title='London',
                grid=True,s=24,marker='x')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

An Actual ‘Map’

gdf.plot(column='price', cmap='viridis', 
         scheme='quantiles', markersize=8, legend=True)

Resources

There’s so much more to find, but:

Pandas Reference
A Guide to EDA in Python (Looks very promising)
EDA with Pandas on Kaggle
EDA Visualisation using Pandas
Python EDA Analysis Tutorial
Better EDA with Pandas Profiling [Requires module installation]
EDA: DataPrep.eda vs Pandas-Profiling [Requires module installation]

Exploratory Data Analysis

Epicyclic Feedback

Approaching EDA

A Related Take

Another Take

Signal & Noise

What is it?

What is it?

What is it?

Start with a Chart

Anscombe’s Quartet

Summary Statistics for the Quartet

But What do They Look Like?

The Tiger that Isn’t

Think it Through

What Makes a Good Plot?

For Example…

Goals by World Cup Final

Goals by World Cup Final

Average Goals by World Cup Final

How far from Equality?

How far from Equality?

The Purpose of a Chart

If a picture is worth 1,000 words, make sure those words aren’t “blah, blah, blah…”

Beyond the Chart

Not Everyone Likes Tables

Real Numbers

There’s Still a Role for Tables

Undergraduate Tables (Failing Grade)

Undergraduate Tables (Passing Grade)

Postgraduate Tables (Failing Grade)

Postgraduate Tables (Passing Grade)

Design for Tables

In Practice

Getting Started

What Can We Do? (Series)

What Can We Do? (Data Frame)

Measures

More Complex Measures

And Even More Complex

The Plot Thickens

Confession Time: I do like ggplot and sometimes even finish off graphics for articles in R just so that I can use ggplot; however, it is possible to generate great-looking figures in matplotlib but it is often more work because it’s a lot less intuitive.

Boxplot

Frequency

A Correlation Heatmap

A ‘Map’

A Fancy ‘Map’

An Actual ‘Map’

Resources

Confession Time: I do like `ggplot` and sometimes even finish off graphics for articles in R just so that I can use `ggplot`; however, it is possible to generate great-looking figures in `matplotlib` but it is often more work because it’s a lot less intuitive.