Peng and Matsui, The Art of Data Science, p.8
Set Expectations | Collect Information | Revise Expectations | |
---|---|---|---|
Question | Question is of interest to audience | Literature search/experts | Sharpen question |
EDA | Data are appropriate for question | Make exploratory plots | Refine question or collect more data |
Modelling | Primary model answers question | Fit secondary models / analysis | Revise model to include more predictors |
Interpretation | Interpretation provides specific and meaningful answer | Interpret analyses with focus on effect and uncertainty | Revise EDA and/or models to provide more specific answers |
Communication | Process & results are complete and meaningful | Seek feedback | Revises anlyses or approach to presentation |
There’s no hard and fast way of doing EDA, but as a general rule you’re looking to:
The ‘joke’ is that 80% of Data Science is data cleaning.
Here’s another view of how to do EDA:
The problem of relying on statistics alone was amply illustrated by Anscombe’s Quartet (1973)…
Sometimes, we are too good; that’s where the stats comes in. Think of it as the ‘tiger in the jungle’ problem..
X1 | Y1 | X2 | Y2 | X3 | Y3 | X4 | Y4 |
---|---|---|---|---|---|---|---|
10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 10.0 | 6.58 |
8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 13.0 | 7.71 |
9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 9.0 | 8.84 |
11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 11.0 | 8.47 |
14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 14.0 | 7.04 |
6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 6.0 | 5.25 |
4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 4.0 | 12.5 |
12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 12.0 | 5.56 |
7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 7.0 | 7.91 |
5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 5.0 | 6.89 |
Property | Value |
---|---|
Mean of x |
9.0 |
Variance of x |
11.0 |
Mean of y |
7.5 |
Variance of y |
4.12 |
Correlation between x and y |
0.816 |
Linear Model | y = 3 + 0.5x |
I would argue that the basic purpose of charts and of statistics as a whole is to help us untangle signal from noise. We are ‘programmed’ to see signals, so we need to set the standard for ‘it’s a tiger!’ quite high in research & in policy-making.
You can make a lot of progress in your research without any advanced statistics!
Always ask yourself:
A good chart is a good way to start!
A good chart or table:
How much precision is necessary in measuring degrees at the equator?
Decimal Places | Degrees | Distance |
---|---|---|
0 | 1 | 111km |
1 | 0.1 | 11.1km |
2 | 0.01 | 1.11km |
3 | 0.001 | 111m |
4 | 0.0001 | 11.1m |
5 | 0.00001 | 1.11m |
6 | 0.000001 | 11.1cm |
7 | 0.0000001 | 1.11cm |
8 | 0.00000001 | 1.11mm |
The purpose of a graph is to show that there are relationships within the data set that are not trivial/expected.
Choose the chart to highlight relationships, or the lack thereof:
Getting information from a table is like extracting sunlight from a cucumber. Arthur & Henry Fahrquhar 1891)
Consider the difference in emphasis between:
Always keep in mind the purpose of the number.
Why a table is sometimes better than a chart:
Principles:
You can follow along by loading the Inside Airbnb sample:
import pandas as pd
import geopandas as gpd
url='https://bit.ly/3I0XDrq'
df = pd.read_csv(url)
df.set_index('id', inplace=True)
df['price'] = df.price.str.replace('$','',regex=False).astype('float')
gdf = gpd.GeoDataFrame(df,
geometry=gpd.points_from_xy(
df['longitude'],
df['latitude'],
crs='epsg:4326'
)
)
gdf.to_file('Airbnb_Sample.gpkg', driver='GPKG')
This is by no means all that we can do…
Command | Returns |
---|---|
|
So pandas provides functions for commonly-used measures:
print(f"{df.price.mean():.2f}")
print(f"{df.price.median():.2f}")
print(f"{df.price.quantile(0.25):.2f}")
Output:
118.4542
80.50
40.75
But Pandas also makes it easy to derive new variables… Here’s the z-score:
\[ z = \frac{x - \bar{x}}{s}\]
And here’s the Interquartile Range Standardised score:
\[ x_{iqrs} = \frac{x - \widetilde{x}}{Q_{75} - Q_{25}} \]
We’ll get to more complex plotting over the course of the term, but here’s a good start for exploring the data! All plotting depends on matplotlib
which is the ogre in the attic to R’s ggplot
.
Get used to this import as it will allow you to save and manipulate the figures created in Python. It is not the most intuitive approach (unless you’ve used MATLAB before) but it does work.
ggplot
and sometimes even finish off graphics for articles in R just so that I can use ggplot
; however, it is possible to generate great-looking figures in matplotlib
but it is often more work because it’s a lot less intuitive.We’ll get to these in more detail in a couple of weeks, but here’s some output…
There’s so much more to find, but:
Exploratory Data Analysis • Jon Reades