# Set download URL
= '2024-06-14'
ymd = 'https://orca.casa.ucl.ac.uk'
host = f'{host}/~jreades/data/{ymd}-listings.geoparquet' url
Practical 12: Grouping Data
Aggregation, Classification & Clustering
A common challenge in data analysis is how to group observations in a data set together in a way that allows for generalisation: this group of observations are similar to one another, that group is dissimilar to this group. Sometimes we have a label that we can use as part of the process (in which case we’re doing classification), and somtimes we don’t (in which case we’re doing clustering). But what defines similarity and difference? There is no one answer to that question and so there are many different ways to cluster or classify data, each of which has strengths and weaknesses that make them more, or less, appropriate in different contexts.
🔗 Connections: This practical pulls together many topics covered in other modules, and many of the ideas covered elsewhere in this module: clustering, reproducibility, dimensionality reduction… but, above all, this practical is about the importance of judgement. Do not take what we’ve done here as the ONE RIGHT WAY: a number of these results are questionnable at best because we haven’t developed or defined an underlying hypothesis informed by a critical appraisal of the data. You should be much more selective in how you deploy the data and the algorithms, as last week’s session on dimensionality should have shown.
Preamble
import warnings # This suppresses some meaningless errors from Seaborn and Pandas
='ignore', category=FutureWarning) warnings.simplefilter(action
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib as mpl
import re
import os
from matplotlib.colors import ListedColormap
# All of these are potentially useful, though
# not all have been used in this practical --
# I'd suggest exploring the use of different
# Scalers/Transformers as well as clustering
# algorithms...
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer
from sklearn.cluster import KMeans, DBSCAN, OPTICS
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import silhouette_samples, silhouette_score
import random
42) # For reproducibility
random.seed(42) # For reproducibility
np.random.seed(
# Make numeric display a bit neater
'display.float_format', lambda x: '{:,.2f}'.format(x)) pd.set_option(
Initialise the Scaler(s)
Remember that you can set up the sklearn
transformers in advance, and then fit
them before transform
-ing them.
= MinMaxScaler(feature_range=(-1,1))
mms = StandardScaler()
stds = RobustScaler()
rbs = PowerTransformer() pts
Set Up Plotting Functions
🔗 Connections: Here’s an example of how you can use a function to do something a little more complex than just locally save some data. This is still largely a kind of ‘stub’, but if you are going to be producing a lot of plots of London why not automate away some of the pain of producing a good-looking basemap each time: use a function to apply the formatting and then just return f
and ax
as if you’d done this all yourself.
def plt_ldn(w, b):
"""
Creates a new figure of a standard size with the
water (w) and boundary (b) layers set up for easy
plotting. Right now this function assumes that you're
looking at London, but you could parameterise it in
other ways ot allow it to work for other areas.
w: a water layer for London
b: a borough (or other) boundary layer for London
"""
= plt.subplots(1, figsize=(14, 12))
fig, ax =ax, color='#79aef5', zorder=2)
w.plot(ax=ax, edgecolor='#cc2d2d', facecolor='None', zorder=3)
b.plot(ax502000,563000])
ax.set_xlim([155000,201500])
ax.set_ylim(['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines[return fig, ax
########################
# These may no longer be relevant because of changes to geopandas API
def default_cmap(n, outliers=False):
= mpl.cm.get_cmap('viridis_r', n)
cmap = cmap(np.linspace(0,1,n))
colors if outliers:
= np.array([225/256, 225/256, 225/256, 1])
gray = np.insert(colors, 0, gray, axis=0)
colors return ListedColormap(colors)
# mappable = ax.collections[-1] if you add the geopandas
# plot last.
def add_colorbar(mappable, ax, cmap, norm, breaks, outliers=False):
= fig.colorbar(mappable, ax=ax, cmap=cmap, norm=norm,
cb =breaks,
boundaries=('min' if outliers else 'neither'),
extend='uniform',
spacing='horizontal',
orientation=0.05, shrink=0.5, pad=0.05)
fraction"Cluster Number") cb.set_label(
Set up Caching Function
import os
from requests import get
from urllib.parse import urlparse
def cache_data(src:str, dest:str) -> str:
"""Downloads and caches a remote file locally.
The function sits between the 'read' step of a pandas or geopandas
data frame and downloading the file from a remote location. The idea
is that it will save it locally so that you don't need to remember to
do so yourself. Subsequent re-reads of the file will return instantly
rather than downloading the entire file for a second or n-th itme.
Parameters
----------
src : str
The remote *source* for the file, any valid URL should work.
dest : str
The *destination* location to save the downloaded file.
Returns
-------
str
A string representing the local location of the file.
"""
= urlparse(src) # We assume that this is some kind of valid URL
url = os.path.split(url.path)[-1] # Extract the filename
fn = os.path.join(dest,fn) # Destination filename
dfn
# Check if dest+filename does *not* exist --
# that would mean we have to download it!
if not os.path.isfile(dfn) or os.path.getsize(dfn) < 1:
print(f"{dfn} not found, downloading!")
# Convert the path back into a list (without)
# the filename -- we need to check that directories
# exist first.
= os.path.split(dest)
path
# Create any missing directories in dest(ination) path
# -- os.path.join is the reverse of split (as you saw above)
# but it doesn't work with lists... so I had to google how
# to use the 'splat' operator! os.makedirs creates missing
# directories in a path automatically.
if len(path) >= 1 and path[0] != '':
*path), exist_ok=True)
os.makedirs(os.path.join(
# Download and write the file
with open(dfn, "wb") as file:
= get(src)
response file.write(response.content)
print('Done downloading...')
else:
print(f"Found {dfn} locally!")
return dfn
Load Data
London Data Layers
= 'https://github.com/jreades/fsds/blob/master/data/src/' # source path
spath = os.path.join('data','geo') # destination directory
ddir = gpd.read_file( cache_data(spath+'Water.gpkg?raw=true', ddir) )
water = gpd.read_file( cache_data(spath+'Boroughs.gpkg?raw=true', ddir) )
boros = gpd.read_file( cache_data(spath+'Greenspace.gpkg?raw=true', ddir) )
green
= gpd.read_file( cache_data('http://orca.casa.ucl.ac.uk/~jreades/data/MSOA-2011.gpkg', ddir) )
msoas = msoas.to_crs(epsg=27700)
msoas
# I don't use this in this practical, but it's a
# really useful data set that gives you 'names'
# for MSOAs that broadly correspond to what most
# Londoners would think of as a 'neighbourhood'.
= gpd.read_file( cache_data('http://orca.casa.ucl.ac.uk/~jreades/data/MSOA-2011-Names.gpkg', ddir) )
msoa_nms = msoa_nms.to_crs(epsg=27700)
msoa_nms print("Done.")
Reduced Dimensionality MSOA Data
You should have this locally from last week, but just in case…
= 'http://orca.casa.ucl.ac.uk'
host = '~jreades/data'
path = gpd.read_parquet( cache_data(f'{host}/{path}/Reduced_Dimension_Data.geoparquet', ddir) )
rddf print(f"Data frame is {rddf.shape[0]:,} x {rddf.shape[1]}")
You should have: Data frame is 983 x 93
.
And below you should see both the components and the dimensions from last week’s processing.
0:3, -7:] rddf.iloc[
I get the results below, but note that the Dimension values may be slightly different:
Component 5 | Component 6 | Component 7 | Borough | Dimension 1 | Dimension 2 | Subregion | |
---|---|---|---|---|---|---|---|
E02000001 | 1.44 | 3.95 | -1.52 | City of London | 7.74 | 3.36 | Inner West |
E02000002 | -0.28 | 0.89 | 0.26 | Barking and Dagenham | 2.04 | 7.59 | Outer East and North East |
E02000003 | -0.11 | 1.12 | 0.83 | Barking and Dagenham | 2.20 | 6.87 | Outer East and North East |
Listings Data
Let’s also get the listings data from a few weeks back:
= gpd.read_parquet( cache_data(url, ddir) )
listings = listings.to_crs(epsg=27700)
listings print(f"Data frame is {listings.shape[0]:,} x {listings.shape[1]}")
You should have: Data frame is 85,134 x 31
.
And a quick plot of the price to check:
='plasma', scheme='quantiles', k=10,
listings.plot(???, cmap=.5, alpha=0.15, figsize=(10,7)); markersize
Aggregate Listings by MSOA
Join Listings to MSOA
First, let’s link all this using the MSOA Geography that we created last week and a mix or merge and sjoin!
**🔗 Connections**: Notice a few things going on here! We are calling `gpd.sjoin` because pandas (`pd`) doesn't know about spatial joins, only geopandas (`gpd`) does. More on this [next week](https://jreades.github.io/fsds/sessions/week10.html#lectures). Also see how we drop some columns _at the point where we do the join_ by taking advantage of the fact that most pandas/geopandas operations return a _copy_ of the (Geo)DataFrame. That allows us to get back from the spatial join a neat, tidy data frame ready for further analysis. If you're struggling to make sense of this, try removing the `drop` operations and see what your data frame looks like afterwards. This should all be old hat, but in case you need a refresher there's always [Week 5](https://jreades.github.io/fsds/sessions/week5.html) on pandas.
# Before the spatial join
listings.columns
= gpd.sjoin(???, msoas.drop(
msoa_listings =['MSOA11NM', 'LAD11CD', 'LAD11NM', 'RGN11CD', 'RGN11NM',
columns'USUALRES', 'HHOLDRES', 'COMESTRES', 'POPDEN', 'HHOLDS',
'AVHHOLDSZ']), predicate='???').drop(
=['latitude','longitude','index_right']
columns )
# All we've added is the MSOA11CD
msoa_listings.columns
All being well you should now have:
Index(['listing_url', 'last_scraped', 'name', 'description', 'host_id',
'host_name', 'host_since', 'host_location', 'host_is_superhost',
'host_listings_count', 'host_total_listings_count',
'host_verifications', 'property_type', 'room_type', 'accommodates',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'availability_365',
'number_of_reviews', 'first_review', 'last_review',
'review_scores_rating', 'reviews_per_month', 'geometry', 'MSOA11CD'],
dtype='object')
Price by MSOA
Let’s calculate the median price by MSOA… Notice that we have to specify the column we want after the groupby
so the we don’t get the median of every column returned
**🔗 Connections**: I find `groupby` to be a complex operation and often need a couple of gos before I get back waht I want. The thing to take away is that: 1) anything in the `groupby` will become part of the `index` afterwards (so if you group on multiple things you get a multi-part index); 2) aggregating functions apply to _all_ columns unless you filter them some way. Here we filter by selecting only the `price` column to aggregate. You can also filter for `numeric only`.
# *m*soa *l*istings *g*rouped by *p*rice
= msoa_listings.groupby('???')['price'].agg('???')
mlgp mlgp.head()
You should get something like:
MSOA11CD
E02000001 170.00
E02000002 97.00
E02000003 80.00
E02000004 54.00
E02000005 100.00
Name: price, dtype: float64
Room Type by MSOA
Now let’s calculate the count of room types by MSOA and compare the effects of reset_index
on the outputs below. And notice too that we can assign the aggregated value to a column name!
# *m*soa *l*istings *g*rouped *c*ount
= msoa_listings.groupby(['???','???'], observed=False).listing_url.agg(Count='???')
mlgc mlgc.head()
You should get something resembling this:
MSOA11CD | room_type | Count |
---|---|---|
E02000001 | Entire home/apt | 466 |
Hotel room | 0 | |
Private room | 61 | |
Shared room | 1 | |
E02000002 | Entire home/apt | 4 |
# *m*soa *l*istings *g*rouped *c*ount *r*eset index
= msoa_listings.groupby(['???','???'], observed=False).listing_url.agg(Count='???').reset_index() # msoa listings grouped counts
mlgcr mlgcr.head()
You should get something like:
MSOA11CD | room_type | Count | |
---|---|---|---|
0 | E02000001 | Entire home/apt | 466 |
1 | E02000001 | Hotel room | 0 |
2 | E02000001 | Private room | 61 |
3 | E02000001 | Shared room | 1 |
4 | E02000002 | Entire home/apt | 4 |
Price by Room Type
But perhaps median price/room type would make more sense? And do we want to retain values where there are no listings? For example, there are no hotel rooms listed for E02000001, how do we ensure that these NAs are dropped?
# *m*soa *l*istings *g*rouped *r*oom *p*rice
= msoa_listings.???(???, observed=True
mlgrp 'price'].agg('???').reset_index()
)[ mlgrp.head()
You should get something like:
MSOA11CD | room type | price | |
---|---|---|---|
0 | E02000001 | Entire home/apt | 177.00 |
2 | E02000001 | Private room | 100.00 |
3 | E02000001 | Shared room | 120.00 |
4 | E02000002 | Entire home/apt | 117.00 |
6 | E02000002 | Private room | 42.00 |
Explore Outlier Per-MSOA Prices
Are there MSOAs what look like they might contain erroneous data?
Plot MSOA Median Prices
=200); mlgp.hist(bins
Examine Listings from High-Priced MSOAs
Careful, this is showing the listings from MSOAs whose median price is above $300/night:
msoa_listings[> 300].index)
msoa_listings.MSOA11CD.isin(mlgp[mlgp ='price', ascending=False).head(7)[
].sort_values(by'price','room_type','name','description']
[ ]
Some of these look legi (4, 5, and… 8 bedroom ‘villas’?), though not every one…
And how about these?
msoa_listings[> 300].index)) & (msoa_listings.room_type!='Entire home/apt')
(msoa_listings.MSOA11CD.isin(mlgp[mlgp ='price', ascending=False).head(7)[
].sort_values(by'price','room_type','property_type','name','description']
[ ]
If we wanted to be rigorous then we’d have to investigate further: properties in Mayfair and Westminster are going to be expensive, but are these plausible nightly prices? In some cases, yes. In others…
msoa_listings[< 100].index)) & (msoa_listings.room_type!='Entire home/apt')
(msoa_listings.MSOA11CD.isin(mlgp[mlgp ='price', ascending=False).head(7)[
].sort_values(by'price','room_type','name','description']
[ ]
On the whole, let’s take a guess that there are a small number of implausibly high prices for individual units that aren’t in very expensive neighbourhoods and that these are either erroneous/deliberately incorrect, or represent a price that is not per-night.
**🔗 Connections**: What's the right answer here? There isn't one. You could probably spend _months_ figuring out what's a real, available-to-let listing and waht isn't. I would argue that assuming all listings are 'legit' without doing some additional EDA and ESDA is negligent. You could also look at how some of the methods of standardisation/normalisation work and use those to identify improbable listings (but remember that a £10,000 in Mayfair _might_ be legit, while a $5,000 listing in Barking _probably_ isn't!). Or you could look at the inter-decile range (or just define your own range: 1%-99%?).
Filter Unlikely Listings
See if you can filter out these less likely listings on the following criteria:
- Listings are priced above $300/night AND
- Room type is not
'Entire home/apt'
AND - Listings do not contain the words: suite, luxury, loft, stunning, prime, historic, or deluxe.
I found 901 rows to drop this way.
= r'(?:suite|luxury|loft|stunning|prime|historic|deluxe|boutique)'
target_regex = msoa_listings[
to_drop &
(???) &
(???) ~(???(target_regex, flags=re.IGNORECASE, regex=True, na=True))]
print(f"Have found {to_drop.shape[0]:,} rows to drop on the basis of unlikely per night prices.")
='price', ascending=False)[['price','room_type','name','description']] to_drop.sort_values(by
Plot Unlikely Listings
Here we use the plt_ldn
function – notice how it’s designed to return f,ax
in the same way that plt.subplots
(which we’re already familiar with) does!
= plt_ldn(???, ???)
f,ax ='price', markersize=10, alpha=0.7, cmap='viridis', ax=ax); to_drop.plot(column
… And Drop
Some might be legitimate, but I’m feeling broadly ok with the remainder.
= msoa_listings.drop(index=to_drop.???)
cleaned print(f"Cleaned data has {cleaned.shape[0]:,} rows.")
After this I had 84,308 rows.
I would normally, at this point, spend quite a bit of time validating this cleaning approach, but right now we’re going to take a rough-and-ready approach.
Questions
- What data type did Task 2.2 return?
- What is the function of
reset_index()
in Task 2.3 and when might you choose to reset (or not)?
Pivot Tables & ‘Wide Data’
The group_by
operation is one way to organise and aggregate our data, but pivot tables are a second common way to achieve this. We typically use a pivot table to go from long to wide data frames – it’s often seen as one of Excel’s main benefits, but Pandas can do that too!
**🔗 Connections**: Notice that a pivot table is just a different kind of aggregation. Principally, it's about going from long to wide in your data frame. These issues should be familiar from R and it's `mutate` and other `data.table` methods. We just write them different in Python.
Create Pivot Table
We can make use of the pivot table function to generate counts by MSOA in a ‘wide’ format.
= cleaned.groupby(
pivot 'MSOA11CD','room_type'], observed=False
[='count').reset_index().pivot(
).listing_url.agg(Count='???', columns=['???'], values=['???'])
index3) pivot.head(
The formatting will look a tiny bit different, but you should get something like this:
Count | |||||
---|---|---|---|---|---|
room_type | Entire home/apt | Hotel room | Private room | Shared room | |
MSOA11CD | |||||
E02000001 | 466 | 0 | 55 | 1 | |
E02000002 | 4 | 0 | 2 | 0 | |
E02000003 | 12 | 0 | 13 | 0 |
Check Counts
sum() pivot.
Just to reassure you that the pivot results ‘make sense’:
print(cleaned[cleaned.room_type=='Entire home/apt'].listing_url.count())
print(cleaned[cleaned.room_type=='Private room'].listing_url.count())
Tidy & Normalise
My instinct at this point is that, looking at the pivot table, we see quite different levels of Airbnb penetration and it is hard to know how handle this difference: share would be unstable because of the low counts in some places and high counts in others; a derived variable that tells us something about density or mix could be interesting (e.g. HHI or LQ) but wouldn’t quite capture the pattern of mixing.
Tidy
Personally, based on the room type counts above I think we can drop Hotel Rooms and Shared Rooms from this since the other two categories are so dominant.
# Flatten the column index
= ['Entire home/apt','Hotel room','Private room','Shared room']
pivot.columns # Drop the columns
=True)
pivot.drop(???, inplace pivot.head()
You should have only the Entire home/apt and Private room columns now.
Normalise
= pd.DataFrame(index=pivot.index)
pivot_norm for c in pivot.columns.to_list():
# Power Transform
= pts.???(pivot[c].to_numpy().reshape(???,???))
pivot_norm[c]
pivot_norm.head()
You should have something like:
Entire home/apt | Private room | |
---|---|---|
MSOA11CD | ||
E02000001 | 2.20 | 1.06 |
E02000002 | -1.29 | -1.85 |
Plot
= pd.merge(msoas.set_index('MSOA11CD'), pivot_norm, left_index=True, right_index=True)
pnm ='Entire home/apt', cmap='viridis', edgecolor='none', legend=True, figsize=(12,8)); pnm.plot(column
PCA
You can merge the output of this next step back on to the rddf
data frame as part of a clustering process, though we’d really want to do some more thinking about what this data means and what transformations we’d need to do in order to make them meaningful.
For instance, if we went back to last week’s code, we could have appended this InsideAirbnb data before doing the dimensionality reduction, or we could apply it now to create a new measure that could be used as a separate part of the clustering process together with the reduced dimensionality of the demographic data.
Perform Reduction
= PCA(n_components=???, random_state=42)
pcomp = pcomp.???(pivot_norm)
rd print(f"The explained variance of each component is: {', '.join([f'{x*100:.2f}%' for x in pcomp.explained_variance_ratio_])}")
Take the first component and convert to a series to enable the merge:
= pd.DataFrame(
airbnb_pca 'Airbnb Component 1': mms.fit_transform(rd[:,1].reshape(-1,1)).reshape(1,-1)[0]},
{=pivot.index)
index
airbnb_pca.head()
You should have something like: | | Airbnb Component 1 | | :—- | —-: | | MSOA11CD | | | E02000001 | 0.47 | | E02000002 | 0.19
= pd.merge(msoas.set_index('MSOA11CD'), airbnb_pca, left_index=True, right_index=True)
pcanm ='Airbnb Component 1', cmap='viridis', edgecolor='none', legend=True, figsize=(12,8)); pcanm.plot(column
Write to Data Frame
# Result Set from merge
= pd.merge(rddf, airbnb_pca, left_index=True, right_index=True) rs
Grab the PCA, UMAP, and Airbnb outputs for clustering and append rescaled price:
# Merge the reducded dimensionality data frame with the PCA-reduced Airbnb data
# to create the *cl*uster *d*ata *f*rame
= pd.merge(rddf.loc[:,'Component 1':], airbnb_pca,
cldf =True, right_index=True)
left_index
# Append median price from cleaned listings grouped by MSOA too!
= cleaned.groupby(by='MSOA11CD').price.agg('median')
s1 'median_price'] = pd.Series(np.squeeze(mms.fit_transform(s1.values.reshape(-1,1))), index=s1.index)
cldf[
# Append mean price from cleaned listings grouped by MSOA too!
= cleaned.groupby(by='MSOA11CD').price.agg('mean')
s2 'mean_price'] = pd.Series(np.squeeze(mms.fit_transform(s2.values.reshape(-1,1))), index=s2.index)
cldf[
=['Subregion','Borough'], inplace=True)
cldf.drop(columns
cldf.head()
Questions
- Have a think about why you might want to keep the Airbnb data separate from the MSOA data when doing PCA (or any other kind of dimensionality reduction)!
- Why might it be interesting to add both mean and median MSOA prices to the clustering process? Here’s a hint (but it’s very subtle):
sns.jointplot(x=s1, y=s2, s=15, alpha=0.6)
First K-Means Clustering
Perform Clustering
= 'KMeans' # Clustering name
c_nm = ??? # Number of clusters
k_pref
= KMeans(n_clusters=k_pref, n_init=25, random_state=42).fit(cldf.drop(columns=['Dimension 1','Dimension 2'])) # The process kmeans
Here are the results:
print(kmeans.labels_) # The results
Save Clusters to Data Frame
Write Series and Assign
Now capture the labels (i.e. clusters) and write them to a data series that we store on the result set df (rs
):
= pd.Series(kmeans.labels_, index=cldf.index) rs[c_nm]
Histogram of Cluster Members
How are the clusters distributed?
=???, x=c_nm, bins=k_pref); sns.histplot(data
Map Clusters
And here’s a map!
= plt_ldn(water, boros)
fig, ax f"{c_nm} Results (k={k_pref})", fontsize=20, y=0.92)
fig.suptitle(=???, ax=ax, linewidth=0, zorder=0, categorical=???, legend=True); rs.plot(column
Questions
- What critical assumption did we make when running this analysis?
- Why did I not use the UMAP dimensions here?
- Why do we have the
c_nm='kMeans'
when we know what kind of clustering we’re doing?
- Does this look like a good clustering?
Second K-Means Clustering
What’s the ‘Right’ Number of Clusters?
There’s more than one way to find the ‘right’ number of clusters. In Singleton’s Geocomputation chapter they use WCSS to pick the ‘optimal’ number of clusters. The idea is that you plot the average WCSS for each number of possible clusters in the range of interest (2…n) and then look for a ‘knee’ (i.e. kink) in the curve. The principle of this approach is that you look for the point where there is declining benefit from adding more clusters. The problem is that there is always some benefit to adding more clusters (the perfect clustering is k==n), so you don’t always see a knee.
Another way to try to make the process of selecting the number of clusters a little less arbitrary is called the silhouette plot and (like WCSS) it allows us to evaluate the ‘quality’ of the clustering outcome by examining the distance between each observation and the rest of the cluster. In this case it’s based on Partitioning Around the Medoid (PAM).
Either way, to evaluate this in a systematic way, we want to do multiple k-means clusterings for multiple values of k and then we can look at which gives the best results…
= cldf.drop(columns=['Dimension 1','Dimension 2']) kcldf
Repeated Clustering
Let’s try clustering across a wider range. Because we repeatedly re-run the clustering code (unlike with Hierarchical Clustering) this can take a few minutes. I got nearly 5 minutes on a M2 Mac.
%%time
# Adapted from: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
= []
x = []
y
# For resolutions of 'k' in the range 2..40
for k in range(2,41):
#############
# Do the clustering using the main columns
= KMeans(n_clusters=k, n_init=25, random_state=42).fit(kcldf)
kmeans
# Calculate the overall silhouette score
= silhouette_score(kcldf, kmeans.labels_)
silhouette_avg
y.append(k)
x.append(silhouette_avg)
print('.', end='')
Plot Silhouette Scores
print()
print(f"Largest silhouette score was {max(x):6.4f} for k={y[x.index(max(x))]}")
plt.plot(y, x)True);
plt.gca().xaxis.grid("Average Silhouette Scores"); plt.gcf().suptitle(
**⚠ Note**: Had we used the UMAP dimensions here you'd likely see more instability in the silhouette plot because the distribution is not remotely Gaussian, though a lot depends on the magnitude of the columns and the number of UMAP vs. PCA components.
We can use the largest average silhouette score to determine the ‘natural’ number of clusters in the data, but that that’s only if we don’t have any kind of underlying theory, other empirical evidence, or even just a reason for choosing a different value… Again, we’re now getting in areas where your judgement and your ability to communicate your rationale to readers is the key thing.
Final Clustering
So although we should probably pick the largest silhouette scores, that’s k=3
which kind of defeats the purpose of clustering in the first place. In the absence of a compelling reason to pick 2 or 3 clusters, let’s have a closer look at the next maximum silhouetted score:
Perform Clustering
=???
k_pref
#############
# Do the clustering using the main columns
= KMeans(n_clusters=k_pref, n_init=25, random_state=42).fit(kcldf)
kmeans
# Convert to a series
= pd.Series(kmeans.labels_, index=kcldf.index, name=c_nm)
s
# We do this for plotting
= s
rs[c_nm]
# Calculate the overall silhouette score
= silhouette_score(kcldf, kmeans.labels_)
silhouette_avg
# Calculate the silhouette values
= silhouette_samples(kcldf, kmeans.labels_) sample_silhouette_values
Plot Diagnostics
#############
# Create a subplot with 1 row and 2 columns
= plt.subplots(1, 2)
fig, (ax1, ax2) 9, 5)
fig.set_size_inches(
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
-1.0, 1.0]) # Changed from -0.1, 1
ax1.set_xlim([
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
0, kcldf.shape[0] + (k_pref + 1) * 10])
ax1.set_ylim([
= 10
y_lower
# For each of the clusters...
for i in range(k_pref):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
= \
ith_cluster_silhouette_values == i]
sample_silhouette_values[kmeans.labels_
ith_cluster_silhouette_values.sort()
= ith_cluster_silhouette_values.shape[0]
size_cluster_i = y_lower + size_cluster_i
y_upper
# Set the color ramp
= plt.cm.Spectral(i/k_pref)
color
ax1.fill_betweenx(np.arange(y_lower, y_upper),0, ith_cluster_silhouette_values,
=color, edgecolor=color, alpha=0.7)
facecolor
# Label the silhouette plots with their cluster numbers at the middle
-0.05, y_lower + 0.5 * size_cluster_i, str(i))
ax1.text(
# Compute the new y_lower for next plot
= y_upper + 10 # 10 for the 0 samples
y_lower
"The silhouette plot for the clusters.")
ax1.set_title("The silhouette coefficient values")
ax1.set_xlabel("Cluster label")
ax1.set_ylabel(
# The vertical line for average silhouette score of all the values
=silhouette_avg, color="red", linestyle="--", linewidth=0.5)
ax1.axvline(x
# Clear the yaxis labels / ticks
ax1.set_yticks([]) -1.0, 1.1, 0.2)) # Was: [-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]
ax1.set_xticks(np.arange(
# 2nd Plot showing the actual clusters formed --
# we can only do this for the first two dimensions
# so we may not see fully what is causing the
# resulting assignment
= plt.cm.Spectral(kmeans.labels_.astype(float) / k_pref)
colors 0]], kcldf[kcldf.columns[1]],
ax2.scatter(kcldf[kcldf.columns[='.', s=30, lw=0, alpha=0.7, c=colors)
marker
# Labeling the clusters
= kmeans.cluster_centers_
centers
# Draw white circles at cluster centers
0], centers[:, 1],
ax2.scatter(centers[:, ='o', c="white", alpha=1, s=200)
marker
for i, c in enumerate(centers):
0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.scatter(c[
"Visualization of the clustered data")
ax2.set_title("Feature space for the 1st feature")
ax2.set_xlabel("Feature space for the 2nd feature")
ax2.set_ylabel(
"Silhouette results for KMeans clustering "
plt.suptitle(("with %d clusters" % k_pref),
=14, fontweight='bold')
fontsize
plt.show()
**⚠ Stop**: Make sure that you understand how the silhouette plot and value work, and why your results _may_ diverge from mine.
Map Clusters
= plt_ldn(water, boros)
fig, ax f"{c_nm} Results (k={k_pref})", fontsize=20, y=0.92)
fig.suptitle(=c_nm, ax=ax, linewidth=0, zorder=0, categorical=True, legend=True); rs.plot(column
‘Representative’ Centroids
To get a sense of how these clusters differ we can try to extract ‘representative’ centroids (mid-points of the multi-dimensional cloud that constitutes a cluster). In the case of k-means this will work quite will since the clusters are explicitly built around mean centroids. There’s also a k-medoids clustering approach built around the median centroid.
These are columns that we want to suppress from our sample:
=['OBJECTID', 'BNG_E', 'BNG_N', 'LONG', 'LAT',
to_suppress'Shape__Are', 'Shape__Len', 'geometry', 'Component 1',
'Component 2', 'Component 3', 'Component 4', 'Component 5',
'Component 6', 'Component 7', 'Dimension 1', 'Dimension 2',
'Airbnb Component 1']
Take a sample of the full range of numeric columns:
= random.sample(rs.select_dtypes(exclude='object').drop(columns=to_suppress).columns.to_list(), 12)
cols print(cols)
Calculate the mean of these columns for each cluster:
# Empty data frame with the columns we'll need
= pd.DataFrame(columns=cols)
centroids
# For each cluster...
for k in sorted(rs[c_nm].unique()):
print(f"Processing cluster {k}")
# Select rows where the cluster name matches the cluster number
= rs[rs[c_nm]==k]
clust
# Append the means to the centroids data frame
= clust[cols].mean() centroids.loc[k]
centroids
= pd.DataFrame(columns=['Variable','Cluster','Std. Value'])
centroids_long for i in range(0,len(centroids.index)):
= centroids.iloc[i,:]
row for r in row.index:
= pd.DataFrame({'Variable':r, 'Cluster':i, 'Std. Value':row[r]}, index=[1])
d = pd.concat([centroids_long, d], ignore_index=True) centroids_long
= sns.FacetGrid(centroids_long, col="Variable", col_wrap=3, height=3, aspect=1.5, margin_titles=True, sharey=True)
g = g.map(plt.bar, "Cluster", "Std. Value") g
**🔗 Connections**: The above centroid outputs are a way to think about how each cluster is 'loaded' on to the data. We can't show all of the variables in the data, so we've randomly selected a subset and can then look at how different clusters are more (or less) associated with the standardised value of a particular column/variable.
DBSCAN
For what it’s worth, I’ve had enormous trouble with DBSCAN and this kind of data. I don’t think it deals very well with much more than three dimensions, so the flexbility to not have to specify the number of clusters is balanced with a density-based approach that is severely hampered by high-dimensional distance-inflation.
# Drop the PCA dimensions
= cldf.loc[:,'Dimension 1':].copy()
cldf2 for c in [x for x in cldf.columns.to_list() if x.startswith('Dimension ')]:
= pd.Series(np.squeeze(mms.fit_transform(cldf2[c].to_numpy().reshape(-1,1))), index=cldf2.index)
cldf2[c] cldf2.head()
Work out the Neighbour Distance
We normally look for some kind of ‘knee’ to set the distance.
= NearestNeighbors(n_neighbors=6).fit(cldf2)
nbrs = nbrs.kneighbors(cldf2)
distances, indices
= np.sort(distances, axis=0)
distances = distances[:,1] distances
Derive Approximate Knee
from kneed import knee_locator
= knee_locator.KneeLocator(np.arange(distances.shape[0]), distances, S=12,
kn ='convex', direction='increasing')
curveprint(f"Knee detected at: {kn.knee}")
kn.plot_knee() kn.plot_knee_normalized()
print(f"Best guess at epsilon for DBSCAN is {distances[kn.knee]:0.4f}")
Explore Epsilons
There are two values that need to be specified: eps
and min_samples
. Both seem to be set largely by trial and error, though we can use the above result as a target. It’s easiest to set min_samples
first since that sets a floor for your cluster size and then eps
is basically a distance metric that governs how far away something can be from a cluster and still be considered part of that cluster.
Iterate Over Range
**⚠ Warning**: Depending on the data volume, this next step may take quite a lot of time since we are iterating through many, many values of Epsilon to explore how the clustering result changes and how well this matches up with (or doesn't) the graph above.
%%time
= 'DBSCAN'
c_nm
# Make numeric display a bit neater
'display.float_format', lambda x: '{:,.4f}'.format(x))
pd.set_option(
= []
el
= 10
max_clusters = 1
cluster_count
= 0
iters
for e in np.arange(0.025, 0.76, 0.01): # <- You might want to adjust these!
if iters % 25==0: print(f"{iters} epsilons explored.")
# Run the clustering
= DBSCAN(eps=e, min_samples=cldf2.shape[1]+1).fit(cldf2)
dbs
# See how we did
= pd.Series(dbs.labels_, index=cldf2.index, name=c_nm)
s
= [e]
row = s.value_counts()
data
for c in range(-1, max_clusters+1):
try:
if np.isnan(data[c]):
None)
row.append(else:
row.append(data[c])except KeyError:
None)
row.append(
el.append(row)+=1
iters
= pd.DataFrame(el, columns=['Epsilon']+["Cluster " + str(x) for x in list(range(-1,max_clusters+1))])
edf
# Make numeric display a bit neater
'display.float_format', lambda x: '{:,.2f}'.format(x))
pd.set_option(
print("Done.")
Examine Clusters
# Notice the -1 cluster for small epsilons edf.head()
= pd.DataFrame(columns=['Epsilon','Cluster','Count'])
epsilon_long
for i in range(0,len(edf.index)):
= edf.iloc[i,:]
row for c in range(1,len(edf.columns.values)):
if row[c] != None and not np.isnan(row[c]):
= pd.DataFrame({'Epsilon':row[0], 'Cluster':f"Cluster {c-2}", 'Count':row[c]}, index=[1])
d = pd.concat([epsilon_long, d], ignore_index=True)
epsilon_long
'Count'] = epsilon_long.Count.astype(float) epsilon_long[
Plot Cluster Sizes
One of the really big problems with DBSCAN and this kind of data is that you have no practical way of specifying epsilon (whereas if you were doing walkability analysis then you could cluster on walking distance!). So you can look at the data (as above) to get a reasoanble value, but look what the output below shows about the stability of the clusters for different values of epsilon!
= plt.subplots(figsize=(12,8))
fig, ax =epsilon_long, x='Epsilon', y='Count', hue='Cluster');
sns.lineplot(data=distances[kn.knee], ymin=0, ymax=epsilon_long.Count.max(), color=(1, .7, .7, .8), linestyles='dashed')
plt.vlines(xf"Cluster sizes for various realisations of Epsilon");
plt.gcf().suptitle( plt.tight_layout()
Final Clustering
###: Perform Clustering
Use the value from kneed…
= DBSCAN(eps=distances[kn.knee], min_samples=cldf2.shape[1]+1).fit(cldf2.values)
dbs = pd.Series(dbs.labels_, index=cldf2.index, name=c_nm)
s = s
rs[c_nm] print(s.value_counts())
###: Map Clusters
= plt_ldn(water, boros)
fig, ax f"{c_nm} Results", fontsize=20, y=0.92)
fig.suptitle(=c_nm, ax=ax, linewidth=0, zorder=0, legend=True, categorical=True); rs.plot(column
‘Representative’ Centroids
=['OBJECTID', 'BNG_E', 'BNG_N', 'LONG', 'LAT',
to_suppress'Shape__Are', 'Shape__Len', 'geometry', 'Component 1',
'Component 2', 'Component 3', 'Component 4', 'Component 5',
'Component 6', 'Component 7', 'Dimension 1', 'Dimension 2',
'Airbnb Component 1']
Take a sample of the full range of numeric columns:
= random.sample(rs.select_dtypes(exclude='object').drop(columns=to_suppress).columns.to_list(), 12)
cols print(cols)
Calculate the mean of these columns for each cluster:
# Empty data frame with the columns we'll need
= pd.DataFrame(columns=cols)
centroids
# For each cluster...
for k in sorted(rs[c_nm].unique()):
print(f"Processing cluster {k}")
# Select rows where the cluster name matches the cluster number
= rs[rs[c_nm]==k]
clust
# Append the means to the centroids data frame
= clust[cols].mean() centroids.loc[k]
# Drop the unclustered records (-1)
=[-1], axis=0, inplace=True)
centroids.drop(labels centroids
= pd.DataFrame(columns=['Variable','Cluster','Std. Value'])
centroids_long for i in range(0,len(centroids.index)):
= centroids.iloc[i,:]
row for r in row.index:
= pd.DataFrame({'Variable':r, 'Cluster':i, 'Std. Value':row[r]}, index=[1])
d = pd.concat([centroids_long, d], ignore_index=True) centroids_long
= sns.FacetGrid(centroids_long, col="Variable", col_wrap=3, height=3, aspect=1.5, margin_titles=True, sharey=True)
g = g.map(plt.bar, "Cluster", "Std. Value") g
Self-Organising Maps
SOMs offer a third type of clustering algorithm. They are a relatively ‘simple’ type of neural network in which the ‘map’ (of the SOM) adjusts to the data. We’re not going to do this in this practical, but the main thing is that, unlike the above approaches, SOMs build a 2D map of a higher-dimensional space and use this as a mechanism for subsequently clustering the raw data. In this sense there is a conceptual link between SOMs and PCA or t-SNE or UMAP. Tehy are used quite a lot for text-clustering using keywords (where you have high-dimensionality).
There are a lot of SOM implementations in Python but the one I used to use, called SOMPY, appears to have been abandonned.
Classification
And now for something completely different! This is section is completely optional, but I thought that you might find it helpful to have a look at how supervised learning (classification) differs from unsupervised learning (clustering). Here we’re going to perform a fairly straightforward classification: predicting the room_type
for randomly-selected listings. Of course we know the true answer, but this is for demonstration purposes!
Additional Setup
Import Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.inspection import permutation_importance
Set Up Data
I’m taking a fairly brutal approach here: anything that is not inherently numeric is gone (bye, bye, text), and I’m not bothering to convert implicitly numeric values either: dates could be converted to ‘months since last review’, for instance, while amenities could be One-Hot Encoded after some pruning of rare amenities. This leaves us with a much smaller number of columns to feed in to the classifier.
print(f"Cleaned columns: {', '.join(cleaned.columns.to_list())}.")
= cleaned.drop(columns=['listing_url','last_scraped','name','description',
classifier_in 'host_name', 'host_location', 'property_type',
'bathrooms_text', 'amenities', 'geometry', 'MSOA11CD',
'host_since', 'first_review', 'last_review',
'host_verifications', 'review_scores_rating',
'reviews_per_month'])
Remove NAs
Not all classifiers have this issue, but some will struggle to make predictions (or not be able to do so at all) if there are NAs in the data set. The classifier we’re using can’t deal with NAs, so we have to strip these out, but before we do let’s check the effect:
sum() classifier_in.isna().
We can safely drop these now, and you should end up with about 54,000 rows to work with.
= classifier_in.dropna(axis=0, how='any')
classifier_in print(f"Now have {classifier_in.shape[0]:,} rows of data to work with (down from {cleaned.shape[0]:,}).")
print()
print(f"Classifier training columns: {', '.join(classifier_in.columns.to_list())}.")
classifier_in.head()
Remap Non-Numeric Columns
We do still have a couple of non-numeric columns to deal with: booleans and the thing we’re actually trying to predict (the room type)!
'host_is_superhost'] = classifier_in.host_is_superhost.replace({True:1, False:0}).astype('int') classifier_in[
= LabelEncoder()
le 'room_class'] = le.fit_transform(classifier_in.room_type) classifier_in[
A quick check: we should only have one type per class and vice versa.
=['room_type','room_class']).host_id.agg('count').reset_index() classifier_in.groupby(by
Random Forest Classification
We’re going to use a Random Forest Classifier but the nice thing about sklearn
is that you can quite easily swap in other classifiers if you’d like to explore further. This is one big advantage of Python over R in my book: whereas R tends to get new algorithms first, they are often implemented independently by many people and you can end up with incompatible data structures that require a lot of faff to reorganise for a different algorithm. Python is a bit more ‘managed’ and the dominance of numpy
and sklearn
and pandas
means that people have an incentive to contribute to this library or, if it’s genuinely novel, to create an implementation that works like it would if it were part of sklearn
!
**🔗 Connections**: So here's an _actual_ Machine Learning implementation, but you'll have seen a lot of parts of the code before!
Train/Test Split
= train_test_split(classifier_in, test_size=0.2, random_state=42)
train, test print(f"Train contains {train.shape[0]:,} records.")
print(f"Test contains {test.shape[0]:,} records.")
= train.room_class
y_train = train.drop(columns=['room_class','room_type'])
X_train
= test.room_class
y_test = test.drop(columns=['room_class','room_type']) X_test
Classifier Setup
= RandomForestClassifier(
rfc =8,
max_depth=7,
min_samples_split=4,
n_jobs=42
random_state )
Fit and Predict
rfc.fit(X_train, y_train)
= rfc.predict(X_test) y_hat
Validate
Confusion Matrix
= pd.DataFrame(confusion_matrix(y_test, y_hat))
c_matrix = le.inverse_transform(c_matrix.index)
c_matrix.index = le.inverse_transform(c_matrix.columns)
c_matrix.columns c_matrix
Feature Importance
Compare the Random Forest’s built-in ‘feature importance’ with Permutation Feature Importance as documented here.
**🔗 Connections**: This next section is the reason you shouldn't blindly run ML algorithms on your data. It turns out that the Random Forest is seriously affected by the scale of different variables, and of binary variables in particular. You will tend to get erroneous feature importance values back from `sklearn`'s RF implementation and should _normally_ look at Permutation Feature Importance values instead. But this little demonstration also shows (above) a more subtle issue: imbalanced data. There are far fewer hotels than there are private rooms, and far fewer of _those_ than there are entire home/apt listings in the sample. So you'll see that the RF has trouble predicting the classes correctly: that's because with a data set like this it's hard to to _better_ than just predicting entire home/apt _Every Single Time_.
= pd.Series(
mdi_importances =rfc.feature_names_in_
rfc.feature_importances_, index=True)
).sort_values(ascending
= mdi_importances.plot.barh()
ax "Random Forest Feature Importances (MDI)")
ax.set_title( ax.figure.tight_layout()
Permutation Feature Importance
= permutation_importance(
result =10, random_state=42, n_jobs=2
rfc, X_test, y_test, n_repeats )
= result.importances_mean.argsort()
sorted_importances_idx = pd.DataFrame(
importances
result.importances[sorted_importances_idx].T,=X_test.columns[sorted_importances_idx],
columns
)= importances.plot.box(vert=False, whis=10)
ax "Permutation Importances (Test Set)")
ax.set_title(=0, color="k", linestyle="--")
ax.axvline(x"Decrease in accuracy score")
ax.set_xlabel( ax.figure.tight_layout()
Shapely Values
Shapely values are a big part of explainable AI and they work (very broadly) by permuting the data to explore how sensitive the predictions made by the model are to the results that you see. For these we need to install two libraries: shap
(to do the heavy lifting) and slicer
to deal with the data.
Install Libraries
We should now have this already available in Docker, but just in case…
try:
import shap
except ModuleNotFoundError:
! pip install slicer shap
import shap
Check for Data Types
You are looking for anything other than int64
or float64
for the most part. Boolean should be fine, but pandas’ internal, nullable integer type will give you a ufunc
error.
X_test.info()
'beds'] = X_test.beds.astype('int') X_test[
Plot Partial Dependence
shap.partial_dependence_plot("price", rfc.predict, X_test, ice=False,
=True, feature_expected_value=True
model_expected_value )
Calculate Shapely Values
This can take a long time: 4-5 hours (!!!) without developing a strategy for tackling it. See the long discussion here. I’ve taken the approach of subsetting the data substantially (the model is already trained so it won’t impact the model’s predictions) with a 20% fraction of the test data and an explainer sample of 5%. On my laptop the ‘Permutation explainer’ stage took about 14 minutes, but your results may obviously be rather different.
= shap.utils.sample(X_test.sample(frac=0.2, random_state=41), 10)
Xsample = shap.Explainer(rfc.predict, Xsample) explainer
Now we calculate the shap values for the 5% sample from X_test
.
**⚠ Warning**: This next block is the one that takes a long time to run. I got between 3mn and 4mn.
%%time
= explainer(X_test.sample(frac=.05, random_state=42)) shap_values
Single Observation
Now you can take any random record (sample_ind
) and produce a shap plot to show the role that each attribute played in its classification. Note that getting these plots to save required some searching on GitHub.
=250
sample_ind=14, show=False);
shap.plots.waterfall(shap_values[sample_ind], max_displayf"Shapely values for observation #{sample_ind} ({X_test.sample(frac=.05, random_state=42).iloc[sample_ind].name})")
plt.title(
plt.tight_layout()#plt.savefig('practical-09-waterfall.png', dpi=150)
All Observations
=False)
shap.plots.beeswarm(shap_values, showf"Shapely Swarm Plot for Sample")
plt.title(
plt.tight_layout()'practical-09-swarm.png', dpi=150) plt.savefig(
Wrap-Up
- Find the appropriate eps value: Nearest Neighbour Distance Functions or Interevent Distance Functions
- Clustering Points
- Regionalisation algorithms with Aglomerative Clustering
You’ve reached the end, you’re done…
Er, no. This is barely scratching the surface! I’d suggest that you go back through the above code and do three things: 1. Add a lot more comments to the code to ensure that really have understood what is going on. 2. Try playing with some of the parameters (e.g. my thresholds for skew, or non-normality) and seeing how your results change. 3. Try outputting additional plots that will help you to understand the quality of your clustering results (e.g. what is the makeup of cluster 1? Or 6? What has it picked up? What names would I give these clsuters?).
If all of that seems like a lot of work then why not learn a bit more about machine learning before calling it a day?