Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Peng and Matsui, The Art of Data Science, p.8
Set Expectations | Collect Information | Revise Expectations | |
---|---|---|---|
Question | Question is of interest to audience | Literature search/experts | Sharpen question |
EDA | Data are appropriate for question | Make exploratory plots | Refine question or collect more data |
Modelling | Primary model answers question | Fit secondary models / analysis | Revise model to include more predictors |
Interpretation | Interpretation provides specific and meaningful answer | Interpret analyses with focus on effect and uncertainty | Revise EDA and/or models to provide more specific answers |
Communication | Process & results are complete and meaningful | Seek feedback | Revises anlyses or approach to presentation |
There’s no hard and fast way of doing EDA, but as a general rule you’re looking to:
The ‘joke’ is that 80% of Data Science is data cleaning.
Here’s another view of how to do EDA:
You can follow along by loading the Inside Airbnb sample:
import pandas as pd
import geopandas as gpd
url='https://bit.ly/3I0XDrq'
df = pd.read_csv(url)
df.set_index('id', inplace=True)
df['price'] = df.price.str.replace('$','',regex=False).astype('float')
gdf = gpd.GeoDataFrame(df,
geometry=gpd.points_from_xy(
df['longitude'],
df['latitude'],
crs='epsg:4326'
)
)
gdf.to_file('Airbnb_Sample.gpkg', driver='GPKG')
This is by no means all that we can do…
Command | Returns |
---|---|
|
So pandas provides functions for commonly-used measures:
print(f"{df.price.mean():.2f}")
print(f"{df.price.median():.2f}")
print(f"{df.price.quantile(0.25):.2f}")
Output:
118.4542
80.50
40.75
But Pandas also makes it easy to derive new variables… Here’s the z-score:
\[ z = \frac{x - \bar{x}}{s}\]
And here’s the Interquartile Range Standardised score:
\[ x_{iqrs} = \frac{x - \widetilde{x}}{Q_{75} - Q_{25}} \]
We’ll get to more complex plotting over the course of the term, but here’s a good start for exploring the data! All plotting depends on matplotlib
which is the ogre in the attic to R’s ggplot
.
Get used to this import as it will allow you to save and manipulate the figures created in Python. It is not the most intuitive approach (unless you’ve used MATLAB before) but it does work.
I do like
ggplot
and sometimes even finish off graphics for articles in R just so that I can useggplot
; however, it is possible to generate great-looking figures inmatplotlib
but it is often more work because it’s a lot less intuitive.
We’ll get to these in more detail in a couple of weeks, but here’s some output…
There’s so much more to find, but: