Exploratory
Data Analysis

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Epicyclic Feedback

Peng and Matsui, The Art of Data Science, p.8

Set Expectations Collect Information Revise Expectations
Question Question is of interest to audience Literature search/experts Sharpen question
EDA Data are appropriate for question Make exploratory plots Refine question or collect more data
Modelling Primary model answers question Fit secondary models / analysis Revise model to include more predictors
Interpretation Interpretation provides specific and meaningful answer Interpret analyses with focus on effect and uncertainty Revise EDA and/or models to provide more specific answers
Communication Process & results are complete and meaningful Seek feedback Revises anlyses or approach to presentation

Approaching EDA

There’s no hard and fast way of doing EDA, but as a general rule you’re looking to:

  • Clean
  • Canonicalise
  • Clean More
  • Visualise & Describe
  • Review
  • Clean Some More

The ‘joke’ is that 80% of Data Science is data cleaning.

Another Take

Here’s another view of how to do EDA:

  1. Preview data randomly and substantially
  2. Check totals such as number of entries and column types
  3. Check nulls such as at row and column levels
  4. Check duplicates: do IDs recurr, did the servers fail
  5. Plot distribution of numeric data (univariate and pairwise joint distribution)
  6. Plot count distribution of categorical data
  7. Analyse time series of numeric data by daily, monthly and yearly frequencies

In Practice

Getting Started

You can follow along by loading the Inside Airbnb sample:

import pandas as pd
import geopandas as gpd
url='https://bit.ly/3I0XDrq'
df = pd.read_csv(url)
df.set_index('id', inplace=True)
df['price'] = df.price.str.replace('$','',regex=False).astype('float')
gdf = gpd.GeoDataFrame(df, 
            geometry=gpd.points_from_xy(
                        df['longitude'], 
                        df['latitude'], 
                        crs='epsg:4326'
            )
      )
gdf.to_file('Airbnb_Sample.gpkg', driver='GPKG')

What Can We Do? (Series)

This is by no means all that we can do…

Series-level Methods.
Command Returns
print(f"Host count is {gdf.host_name.count()}")
print(f"Mean is {gdf.price.mean():.0f}")
print(f"Max price is {gdf.price.max()}")
print(f"Min price is {gdf.price.min()}")
print(f"Median price is {gdf.price.median()}")
print(f"Standard dev is {gdf.price.std():.2f}")
print(f"25th quantile is {gdf.price.quantile(q=0.25)}")
Count of non-nulls
Mean
Highest value
Lowest value
Median
Standard deviation
25th quantile

What Can We Do? (Data Frame)

Command Returns
print(df.mean())
print(df.count())
print(df.max())
# ...
print(df.corr())
print(df.describe())
Mean of each column
Number of non-null values in each column
Highest value in each column
$\vdots$
Correlation between columns
Summarise

Measures

So pandas provides functions for commonly-used measures:

print(f"{df.price.mean():.2f}")
print(f"{df.price.median():.2f}")
print(f"{df.price.quantile(0.25):.2f}")

Output:

118.4542
80.50
40.75

More Complex Measures

But Pandas also makes it easy to derive new variables… Here’s the z-score:

\[ z = \frac{x - \bar{x}}{s}\]

df['zscore'] = (df.price - df.price.mean())/df.price.std()
df.plot.box(column='zscore')

And Even More Complex

And here’s the Interquartile Range Standardised score:

\[ x_{iqrs} = \frac{x - \widetilde{x}}{Q_{75} - Q_{25}} \]

df['iqr_std'] = (df.price - df.price.median())/ \
      (df.price.quantile(q=0.75)-df.price.quantile(q=0.25))
df.plot.box(column='iqr_std')

The Plot Thickens

We’ll get to more complex plotting over the course of the term, but here’s a good start for exploring the data! All plotting depends on matplotlib which is the ogre in the attic to R’s ggplot.

import matplotlib.pyplot as plt

Get used to this import as it will allow you to save and manipulate the figures created in Python. It is not the most intuitive approach (unless you’ve used MATLAB before) but it does work.

Confession Time

I do like ggplot and sometimes even finish off graphics for articles in R just so that I can use ggplot; however, it is possible to generate great-looking figures in matplotlib but it is often more work because it’s a lot less intuitive.

Boxplot

df.price.plot.box()
plt.savefig('pboxplot.png', dpi=150, transparent=True)

Frequency

df.room_type.value_counts().plot.bar()
plt.savefig('phistplot.png', dpi=150, transparent=True)

A Correlation Heatmap

We’ll get to these in more detail in a couple of weeks, but here’s some output…

A ‘Map’

df.plot.scatter(x='longitude',y='latitude')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

A Fancy ‘Map’

df.plot.scatter(x='longitude',y='latitude',
                c='price',colormap='viridis',
                figsize=(10,5),title='London',
                grid=True,s=24,marker='x')
plt.savefig('pscatterplot.png', dpi=150, transparent=True)

An Actual ‘Map’

gdf.plot(column='price', cmap='viridis', 
         scheme='quantiles', markersize=8, legend=True)

Additional Resources

There’s so much more to find, but:

Thank You

References