import os
import re
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer
import umap
from kneed import knee_locator
Practical 11: Dimensions in Data
Transformation & Dimensionality Reduction
In this session the focus is on MSOA-level Census data from 2011. We’re going to explore this as a possible complement to the InsideAirbnb data. Although it’s not ideal to use 2011 data with scraped from Airbnb this year, we:
- Have little choice as the 2021 data is only just starting to come through from the Census and the London Data Store hasn’t been updated (still!); and
- Could usefully do a bit of thinking about whether the situation in 2011 might in some way help us to ‘predict’ the situation now…
Ultimately, however, you don’t need to use this for your analysis, this practical is intended as a demonstration of how transformation and dimensionality reduction work in practice and the kinds of issues that come up.
There are a lot of links across sessions now, as well as some forward links to stuff we’ve not yet covered (see: pandas.merge
). We’ll pick these up as we move through the notebook.
Preamble
Let’s start with the usual bits of code to ensure plotting works, to import packages and load the data into memory.
Notice here that we’ve moved the function from last week’s notebook to a separate file cache.py
from which we import the cache_data
function. This should give you some ideas about how to work from script -> function -> library effectively.
from cache import cache_data
Loading MSOA Census Data
By this point you should be fairly familiar with the UK’s census geographies as you’ll have encountered them in your GIS module. But in case you need a refresher, here’s what the Office for National Statistics says.
We’re going to mix in the London’s MSOA ‘Atlas’ from the London Data Store. I would strongly suggest that you have a look around the London Data Store as you develop your thinking for the group assessment – you will likely find useful additional data there!
Once you see how we deal with this MSOA Atlas data you will be in a position to work with any other similarly complex (in terms of the headings and indexes) data set. If you’re feeling particularly ambitious you can actually do this same work at the LSOA scale using the LSOA Atlas and LSOA boundaries… the process should be the same, though you will have smaller samples in each LSOA than you do in the MSOAs and calculations will take a bit longer to complete. You could also search on the ONS Census web site for data from 2021.
There is a CSV file for the MSOA Atlas that would be easier to work with; however, the Excel file is useful for demonstrating how to work with multi-level indexes (an extension of last week’s work). Notice that below we do two new things when reading the XLS file:
- We have to specify a sheet name because the file contains multiple sheets.
- We have to specify not just one header, we actually have to specify three of them which generates a multi-level index (row 0 is the top-level, row 1 is the second-level, etc.).
Load MSOA Excel File
You might like to load the cached copy of the file into Excel so that you can see how the next bit works. You can find the rest of the MSOA Atlas here.
= 'https://data.london.gov.uk/download/msoa-atlas/39fdd8eb-e977-4d32-85a4-f65b92f29dcb/msoa-data.xls'
src_url = os.path.join('data','msoa') dest_path
= pd.read_excel(
excel_atlas
cache_data(src_url, dest_path), # Which sheet is the data in?
???, =[0,1,2]) # Where are the column names... there's three of them! header
= pd.read_excel(
excel_atlas
cache_data(src_url, dest_path),='iadatasheet1', # Which sheet is the data in?
sheet_name=[0,1,2]) # Where are the column names... there's three of them! header
Found data/msoa/msoa-data.xls locally!
Size is 2 MB (1,979,904 bytes)
Notice the format of the output and notice that all of the empty cells in the Excel sheet have come through as Unnamed: <col_no>_level_<level_no>
:
1) excel_atlas.head(
Unnamed: 0_level_0 | Unnamed: 1_level_0 | Age Structure (2011 Census) | Mid-year Estimate totals | ... | Road Casualties | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 0_level_1 | Unnamed: 1_level_1 | All Ages | 0-15 | 16-29 | 30-44 | 45-64 | 65+ | Working-age | All Ages | ... | 2010 | 2011 | 2012 | ||||||||
MSOA Code | MSOA Name | Unnamed: 2_level_2 | Unnamed: 3_level_2 | Unnamed: 4_level_2 | Unnamed: 5_level_2 | Unnamed: 6_level_2 | Unnamed: 7_level_2 | Unnamed: 8_level_2 | 2002 | ... | Slight | 2010 Total | Fatal | Serious | Slight | 2011 Total | Fatal | Serious | Slight | 2012 Total | |
0 | E02000001 | City of London 001 | 7375.0 | 620.0 | 1665.0 | 2045.0 | 2010.0 | 1035.0 | 5720.0 | 7280.0 | ... | 334.0 | 374.0 | 0.0 | 46.0 | 359.0 | 405.0 | 2.0 | 51.0 | 361.0 | 414.0 |
1 rows × 207 columns
print(f"Shape of the MSOA Atlas data frame is: {excel_atlas.shape[0]:,} x {excel_atlas.shape[1]:,}")
You should get: Shape of the MSOA Atlas data frame is: 984 x 207
, but how on earth are you going to access the data?
Accessing MultiIndexes
The difficulty is conceptual, not technical.
Until now we have understood the pandas index as a single column-like ‘thing’ in a data frame, but pandas also supports hierarchical and grouped indexes that allow us to interact with data in more complex ways… should we need it. Generally:
- MultiIndex == hierarchical index on columns
- DataFrameGroupBy == iterable pseudo-hierarchical index on rows
We’ll be looking at Grouping Data in much more detail in next week, so the main thing to remember is that grouping is for rows, multi-indexing is about columns.
Direct Access
Of course, one way to get at the data is to use .iloc[...]
since that refers to columns by position and ignores the complexity of the index. Try printing out the the first five rows of the first column using iloc
:
excel_atlas.iloc[???]
You should get:
0 E02000001
1 E02000002
2 E02000003
3 E02000004
4 E02000005
Name: (Unnamed: 0_level_0, Unnamed: 0_level_1, MSOA Code), dtype: object
Named Access
But to do it by name is a little trickier:
5] excel_atlas.columns.tolist()[:
[('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'MSOA Code'),
('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'MSOA Name'),
('Age Structure (2011 Census)', 'All Ages', 'Unnamed: 2_level_2'),
('Age Structure (2011 Census)', '0-15', 'Unnamed: 3_level_2'),
('Age Structure (2011 Census)', '16-29', 'Unnamed: 4_level_2')]
Notice how asking for the first five columns has given us a list of… what exactly?
So to get the same output by column name what do you need to copy from above:
0:5, ???] excel_atlas.loc[
0:5, ('Unnamed: 0_level_0','Unnamed: 0_level_1','MSOA Code')] excel_atlas.loc[
0 E02000001
1 E02000002
2 E02000003
3 E02000004
4 E02000005
5 E02000007
Name: (Unnamed: 0_level_0, Unnamed: 0_level_1, MSOA Code), dtype: object
The answer is really awkward, so we’re going to look for a better way…
Grouped Access
Despite this, one way that MultiIndexes can be useful is for accessing column-slices from a ‘wide’ dataframe. We can, for instance, select all of the Age Structure columns in one go and it will be simpler than what we did above.
0:5, ('Age Structure (2011 Census)')] excel_atlas.loc[
All Ages | 0-15 | 16-29 | 30-44 | 45-64 | 65+ | Working-age | |
---|---|---|---|---|---|---|---|
Unnamed: 2_level_2 | Unnamed: 3_level_2 | Unnamed: 4_level_2 | Unnamed: 5_level_2 | Unnamed: 6_level_2 | Unnamed: 7_level_2 | Unnamed: 8_level_2 | |
0 | 7375.0 | 620.0 | 1665.0 | 2045.0 | 2010.0 | 1035.0 | 5720.0 |
1 | 6775.0 | 1751.0 | 1277.0 | 1388.0 | 1258.0 | 1101.0 | 3923.0 |
2 | 10045.0 | 2247.0 | 1959.0 | 2300.0 | 2259.0 | 1280.0 | 6518.0 |
3 | 6182.0 | 1196.0 | 1277.0 | 1154.0 | 1543.0 | 1012.0 | 3974.0 |
4 | 8562.0 | 2200.0 | 1592.0 | 1995.0 | 1829.0 | 946.0 | 5416.0 |
5 | 8791.0 | 2388.0 | 1765.0 | 1867.0 | 1736.0 | 1035.0 | 5368.0 |
Understanding Levels
This works because the MultiIndex tracks the columns using levels, with level 0 at the ‘top’ and level 2 (in our case) at the bottom. These are the unique values for the top level (‘row 0’):
0] excel_atlas.columns.levels[
Index(['Adults in Employment (2011 Census)', 'Age Structure (2011 Census)',
'Car or van availability (2011 Census)',
'Central Heating (2011 Census)', 'Country of Birth (2011)',
'Dwelling type (2011)', 'Economic Activity (2011 Census)',
'Ethnic Group (2011 Census)', 'Health (2011 Census)', 'House Prices',
'Household Composition (2011)', 'Household Income Estimates (2011/12)',
'Household Language (2011)', 'Households (2011)', 'Incidence of Cancer',
'Income Deprivation (2010)', 'Land Area', 'Life Expectancy',
'Lone Parents (2011 Census)', 'Low Birth Weight Births (2007-2011)',
'Mid-year Estimate totals', 'Mid-year Estimates 2012, by age',
'Obesity', 'Population Density', 'Qualifications (2011 Census)',
'Religion (2011)', 'Road Casualties', 'Tenure (2011)',
'Unnamed: 0_level_0', 'Unnamed: 1_level_0'],
dtype='object')
These are the values for those levels across the actual columns in the data frame, notice the repeated ‘Age Structure (2011 Census)’:
0)[:10] excel_atlas.columns.get_level_values(
Index(['Unnamed: 0_level_0', 'Unnamed: 1_level_0',
'Age Structure (2011 Census)', 'Age Structure (2011 Census)',
'Age Structure (2011 Census)', 'Age Structure (2011 Census)',
'Age Structure (2011 Census)', 'Age Structure (2011 Census)',
'Age Structure (2011 Census)', 'Mid-year Estimate totals'],
dtype='object')
And here are the values for the second level of the index (‘row 1’ in the Excel file):
1)[:10] excel_atlas.columns.get_level_values(
Index(['Unnamed: 0_level_1', 'Unnamed: 1_level_1', 'All Ages', '0-15', '16-29',
'30-44', '45-64', '65+', 'Working-age', 'All Ages'],
dtype='object')
By extension, if we drop a level 0 index then all of the columns that it supports at levels 1 and 2 are also dropped: so when we drop Mid-year Estimate totals
from level 0 then all 11 of the ‘Mid-year Estimate totals (2002…2012)’ columns are dropped in one go.
'Mid-year Estimate totals']].head(3) excel_atlas[[
Mid-year Estimate totals | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
All Ages | |||||||||||
2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
0 | 7280.0 | 7115.0 | 7118.0 | 7131.0 | 7254.0 | 7607.0 | 7429.0 | 7472.0 | 7338.0 | 7412.0 | 7604.0 |
1 | 6333.0 | 6312.0 | 6329.0 | 6341.0 | 6330.0 | 6323.0 | 6369.0 | 6570.0 | 6636.0 | 6783.0 | 6853.0 |
2 | 9236.0 | 9252.0 | 9155.0 | 9072.0 | 9144.0 | 9227.0 | 9564.0 | 9914.0 | 10042.0 | 10088.0 | 10218.0 |
= excel_atlas.drop(columns=['Mid-year Estimate totals'], axis=1, level=0)
test
print(f"Excel source had {excel_atlas.shape[1]} columns.")
print(f"Test now has {test.shape[1]} columns.")
Excel source had 207 columns.
Test now has 196 columns.
# Tidy up if the variable exists
if 'test' in locals():
del(test)
Questions
- What data type is used for storing/accessing MultiIndexes?
- Why is this is the appropriate data type?
- How (conceptually) are the header rows in Excel are mapped on to levels in pandas?
Tidying Up
Although there’s a lot of dealing with column names.
Dropping Named Levels
There’s a lot of data in the data frame that we don’t need for our Airbnb work, so let’s go a bit further with the dropping of column-groups using the MultiIndex.
= ['Mid-year Estimate totals','Mid-year Estimates 2012, by age','Religion (2011)',
to_drop 'Land Area','Lone Parents (2011 Census)','Central Heating (2011 Census)','Health (2011 Census)',
'Low Birth Weight Births (2007-2011)','Obesity','Incidence of Cancer','Life Expectancy',
'Road Casualties']
= excel_atlas.drop(to_drop, axis=1, level=0)
tidy print(f"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}")
Shape of the MSOA Atlas data frame is now: 984 x 111
This should drop you down to 984 x 111. Notice below that the multi-level index has not changed but the multi-level values remaining have!
print(f"There are {len(tidy.columns.levels[0].unique())} categories.") # The categories
print(f"But only {len(tidy.columns.get_level_values(0).unique())} values.") # The actual values
There are 30 categories.
But only 18 values.
Selecting Columns using a List Comprehension
Now we need to drop all of the percentages from the data set. These can be found at level 1, though they are specified in a number of different ways so you’ll need to come up with a way to find them in the level 1 values using a list comprehension…
I’d suggest looking for: “(%)”, “%”, and “Percentages”. You may need to check both start and end of the string. You could also use a regular expression here instead of multiple tests. Either way works, but have a think about the tradeoffs between intelligibility, speed, and what you understand…
Selection using multiple logical tests:
= [x for x in tidy.columns.get_level_values(1) if (???)]
to_drop print(to_drop)
Selection using a regular expression:
print([x for x in tidy.columns.get_level_values(1) if re.search(???, x)])
= [x for x in tidy.columns.get_level_values(1) if (
to_drop "(%)") or x.startswith("%") or x.endswith("Percentages") or x.endswith("%"))]
x.endswith(print(to_drop)
['Percentages', 'Percentages', 'Percentages', 'Percentages', 'Percentages', 'White (%)', 'Mixed/multiple ethnic groups (%)', 'Asian/Asian British (%)', 'Black/African/Caribbean/Black British (%)', 'Other ethnic group (%)', 'BAME (%)', 'United Kingdom (%)', 'Not United Kingdom (%)', '% of people aged 16 and over in household have English as a main language', '% of households where no people in household have English as a main language', 'Owned: Owned outright (%)', 'Owned: Owned with a mortgage or loan (%)', 'Social rented (%)', 'Private rented (%)', 'Household spaces with at least one usual resident (%)', 'Household spaces with no usual residents (%)', 'Whole house or bungalow: Detached (%)', 'Whole house or bungalow: Semi-detached (%)', 'Whole house or bungalow: Terraced (including end-terrace) (%)', 'Flat, maisonette or apartment (%)', 'Economically active %', 'Economically inactive %', '% of households with no adults in employment: With dependent children', '% living in income deprived households reliant on means tested benefit', '% of people aged over 60 who live in pension credit households', 'No cars or vans in household (%)', '1 car or van in household (%)', '2 cars or vans in household (%)', '3 cars or vans in household (%)', '4 or more cars or vans in household (%)']
print([x for x in tidy.columns.get_level_values(1) if re.search("(?:%|Percentages)", x)])
['Percentages', 'Percentages', 'Percentages', 'Percentages', 'Percentages', 'White (%)', 'Mixed/multiple ethnic groups (%)', 'Asian/Asian British (%)', 'Black/African/Caribbean/Black British (%)', 'Other ethnic group (%)', 'BAME (%)', 'United Kingdom (%)', 'Not United Kingdom (%)', '% of people aged 16 and over in household have English as a main language', '% of households where no people in household have English as a main language', 'Owned: Owned outright (%)', 'Owned: Owned with a mortgage or loan (%)', 'Social rented (%)', 'Private rented (%)', 'Household spaces with at least one usual resident (%)', 'Household spaces with no usual residents (%)', 'Whole house or bungalow: Detached (%)', 'Whole house or bungalow: Semi-detached (%)', 'Whole house or bungalow: Terraced (including end-terrace) (%)', 'Flat, maisonette or apartment (%)', 'Economically active %', 'Economically inactive %', '% of households with no adults in employment: With dependent children', '% living in income deprived households reliant on means tested benefit', '% of people aged over 60 who live in pension credit households', 'No cars or vans in household (%)', '1 car or van in household (%)', '2 cars or vans in household (%)', '3 cars or vans in household (%)', '4 or more cars or vans in household (%)']
With both you should get:
['Percentages',
'Percentages',
'Percentages',
'Percentages',
'Percentages',
'White (%)',
'Mixed/multiple ethnic groups (%)',
'Asian/Asian British (%)',
'Black/African/Caribbean/Black British (%)',
'Other ethnic group (%)',
'BAME (%)',
'United Kingdom (%)',
'Not United Kingdom (%)',
'% of people aged 16 and over in household have English as a main language',
'% of households where no people in household have English as a main language',
'Owned: Owned outright (%)',
'Owned: Owned with a mortgage or loan (%)',
'Social rented (%)',
'Private rented (%)',
'Household spaces with at least one usual resident (%)',
'Household spaces with no usual residents (%)',
'Whole house or bungalow: Detached (%)',
'Whole house or bungalow: Semi-detached (%)',
'Whole house or bungalow: Terraced (including end-terrace) (%)',
'Flat, maisonette or apartment (%)',
'Economically active %',
'Economically inactive %',
'% of households with no adults in employment: With dependent children',
'% living in income deprived households reliant on means tested benefit',
'% of people aged over 60 who live in pension credit households',
'No cars or vans in household (%)',
'1 car or van in household (%)',
'2 cars or vans in household (%)',
'3 cars or vans in household (%)',
'4 or more cars or vans in household (%)']
See how regular expressions keep coming baaaaaaaaack? That said, you can also often make use of simple string functions like startswith
and endswith
for this problem.
Drop by Level
You now need to drop these columns using the level
keyword as part of your drop command. You have plenty of examples of how to drop values in place, but I’d suggest first getting the command correct (maybe duplicate the cell below and change the code so that the result is saved to a dataframe called test
before overwriting tidy
?) and then saving the change.
= tidy.drop(to_drop, axis=1, level=???)
tidy print(f"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}")
=1, level=1, inplace=True)
tidy.drop(to_drop, axisprint(f"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}")
Shape of the MSOA Atlas data frame is now: 984 x 76
The data frame should now be 984 x 76
. This is a bit more manageable though still a lot of data columns. Depending on what you decide to do for your final project you might want to revisit some of the columns that we dropped above…
Flattening the Index
Although this ia big improvement, you’ll have trouble saving or linking this data to other inputs. The problem is that Level 2 of the multi-index is mainly composed of ‘Unnamed’ values and so we need to merge it with Level 1 to simplify our data frame, and then merge that with level 0…
3] tidy.columns.values[:
array([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'MSOA Code'),
('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'MSOA Name'),
('Age Structure (2011 Census)', 'All Ages', 'Unnamed: 2_level_2')],
dtype=object)
Let’s use code to sort this out!
= []
new_cols for c in tidy.columns.values:
#print(f"Column label: {c}")
= f"{c[0]}"
l1 = f"{c[1]}"
l2 = f"{c[2]}"
l3
# The new column label
= ''
clabel
# Assemble new label from the levels
if not l1.startswith("Unnamed"):
= l1.replace(" (2011 Census)",'').replace(" (2011)",'').replace("Household ",'').replace("House Prices",'').replace("Car or van availability",'Vehicles').replace(' (2011/12)','')
l1 = l1.replace('Age Structure','Age').replace("Ethnic Group",'').replace('Dwelling type','').replace('Income Estimates','')
l1 += l1
clabel if not l2.startswith("Unnamed"):
= l2.replace("Numbers",'').replace(" House Price (£)",'').replace("Highest level of qualification: ",'').replace("Annual Household Income (£)",'hh Income').replace('Whole house or bungalow: ','').replace(' qualifications','')
l2 = l2.replace('At least one person aged 16 and over in household has English as a main language',"1+ English as a main language").replace("No people in household have English as a main language","None have English as main language")
l2 += ('-' if clabel != '' else '') + l2
clabel if not l3.startswith("Unnamed"):
+= ('-' if clabel != '' else '') + l3
clabel
# Replace other commonly-occuring verbiage that inflates column name width
= clabel.replace('--','-').replace(" household",' hh').replace('Owned: ','')
clabel
#clabel = clabel.replace(' (2011 Census)','').replace(' (2011)','').replace('Sales - 2011.1','Sales - 2012')
#clabel = clabel.replace('Numbers - ','').replace(' (£)','').replace('Car or van availability','Vehicles')
#clabel = clabel.replace('Household Income Estimates (2011/12) - ','').replace('Age Structure','Age')
new_cols.append(clabel)
new_cols
['MSOA Code',
'MSOA Name',
'Age-All Ages',
'Age-0-15',
'Age-16-29',
'Age-30-44',
'Age-45-64',
'Age-65+',
'Age-Working-age',
'Households-All Households',
'Composition-Couple hh with dependent children',
'Composition-Couple hh without dependent children',
'Composition-Lone parent hh',
'Composition-One person hh',
'Composition-Other hh Types',
'White',
'Mixed/multiple ethnic groups',
'Asian/Asian British',
'Black/African/Caribbean/Black British',
'Other ethnic group',
'BAME',
'Country of Birth-United Kingdom',
'Country of Birth-Not United Kingdom',
'Language-1+ English as a main language',
'Language-None have English as main language',
'Tenure-Owned outright',
'Tenure-Owned with a mortgage or loan',
'Tenure-Social rented',
'Tenure-Private rented',
'Household spaces with at least one usual resident',
'Household spaces with no usual residents',
'Detached',
'Semi-detached',
'Terraced (including end-terrace)',
'Flat, maisonette or apartment',
'Population Density-Persons per hectare (2012)',
'Median-2005',
'Median-2006',
'Median-2007',
'Median-2008',
'Median-2009',
'Median-2010',
'Median-2011',
'Median-2012',
'Median-2013 (p)',
'Sales-2005',
'Sales-2006',
'Sales-2007',
'Sales-2008',
'Sales-2009',
'Sales-2010',
'Sales-2011',
'Sales-2011.1',
'Sales-2013(p)',
'Qualifications-No',
'Qualifications-Level 1',
'Qualifications-Level 2',
'Qualifications-Apprenticeship',
'Qualifications-Level 3',
'Qualifications-Level 4 and above',
'Qualifications-Other',
'Qualifications-Schoolchildren and full-time students: Age 18 and over',
'Economic Activity-Economically active: Total',
'Economic Activity-Economically active: Unemployed',
'Economic Activity-Economically inactive: Total',
'Economic Activity-Unemployment Rate',
'Adults in Employment-No adults in employment in hh: With dependent children',
'Total Mean hh Income',
'Total Median hh Income',
'Vehicles-No cars or vans in hh',
'Vehicles-1 car or van in hh',
'Vehicles-2 cars or vans in hh',
'Vehicles-3 cars or vans in hh',
'Vehicles-4 or more cars or vans in hh',
'Vehicles-Sum of all cars or vans in the area',
'Vehicles-Cars per hh']
Make sure you understand what is happening here before just moving on to the next thing. Try adding print()
statements if it will help it to make sense. This sort of code comes up a lot in the real world.
= new_cols # <- Blow away complex index, replace with simple
tidy.columns tidy.head()
MSOA Code | MSOA Name | Age-All Ages | Age-0-15 | Age-16-29 | Age-30-44 | Age-45-64 | Age-65+ | Age-Working-age | Households-All Households | ... | Adults in Employment-No adults in employment in hh: With dependent children | Total Mean hh Income | Total Median hh Income | Vehicles-No cars or vans in hh | Vehicles-1 car or van in hh | Vehicles-2 cars or vans in hh | Vehicles-3 cars or vans in hh | Vehicles-4 or more cars or vans in hh | Vehicles-Sum of all cars or vans in the area | Vehicles-Cars per hh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E02000001 | City of London 001 | 7375.0 | 620.0 | 1665.0 | 2045.0 | 2010.0 | 1035.0 | 5720.0 | 4385.0 | ... | 38.0 | 59728.481886 | 46788.295472 | 3043.0 | 1100.0 | 173.0 | 51.0 | 18.0 | 1692.0 | 0.385861 |
1 | E02000002 | Barking and Dagenham 001 | 6775.0 | 1751.0 | 1277.0 | 1388.0 | 1258.0 | 1101.0 | 3923.0 | 2713.0 | ... | 319.0 | 31788.185996 | 27058.703760 | 1020.0 | 1186.0 | 424.0 | 66.0 | 17.0 | 2305.0 | 0.849613 |
2 | E02000003 | Barking and Dagenham 002 | 10045.0 | 2247.0 | 1959.0 | 2300.0 | 2259.0 | 1280.0 | 6518.0 | 3834.0 | ... | 268.0 | 43356.931547 | 36834.528738 | 1196.0 | 1753.0 | 691.0 | 155.0 | 39.0 | 3766.0 | 0.982264 |
3 | E02000004 | Barking and Dagenham 003 | 6182.0 | 1196.0 | 1277.0 | 1154.0 | 1543.0 | 1012.0 | 3974.0 | 2318.0 | ... | 122.0 | 46701.436554 | 39668.206433 | 556.0 | 1085.0 | 515.0 | 128.0 | 34.0 | 2650.0 | 1.143227 |
4 | E02000005 | Barking and Dagenham 004 | 8562.0 | 2200.0 | 1592.0 | 1995.0 | 1829.0 | 946.0 | 5416.0 | 3183.0 | ... | 307.0 | 34293.820288 | 29155.683536 | 1080.0 | 1423.0 | 551.0 | 109.0 | 20.0 | 2937.0 | 0.922714 |
5 rows × 76 columns
You might want to have a look at what the code below drops first before just running it… remember that you can pull apart any complex code into pieces:
'MSOA Code'].isna()
tidy['MSOA Code'].isna()].index tidy[tidy[
Index([983], dtype='int64')
=tidy[tidy['MSOA Code'].isna()].index, inplace=True) tidy.drop(index
Add Inner/Outer London Mapping
We touched on lambda
functions last week; it’s a ‘trivial’ function that we don’t even want to bother defining with def
. We also used the lambda function in the context of apply
so this is just another chance to remind yourself how this works. This is quite advanced Python, so don’t panic if you don’t get it right away and have to do some Googling…
We want to add the borough name and a ‘subregion’ name. We already have the borough name buried in a separate column, so step 1 is to extract that from the MSOA Name. Step 2 is to use the borough name as a lookup to the subregion name using a lambda function. The format for a lambda function is usually lambda x: <code that does something with x and returns a value>
. Hint: you’ve got a dictionary and you know how to use it!
Add Boroughs
We first need to extract the borough names from one of the existing fields in the data frame… a regex that does replacement would be fastest and easiest: focus on what you don’t need from the MSOA Name string and replacing that using a regex…
'Borough'] = tidy['MSOA Name'].???
tidy[ tidy.Borough.unique()
'Borough'] = tidy['MSOA Name'].str.replace(r' \d+$','',regex=True)
tidy[ tidy.Borough.unique()
array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',
'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',
'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',
'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',
'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',
'Lewisham', 'Merton', 'Newham', 'Redbridge',
'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',
'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)
You should get:
array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',
'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',
'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',
'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',
'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',
'Lewisham', 'Merton', 'Newham', 'Redbridge',
'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',
'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)
Map Boroughs to Subregions
And now you need to understand how to apply the mapping ot the Borough field using a lambda
function. It’s fairly straightforward once you know the syntax: just a dictionary lookup. But as usual, you might want to first create a new cell and experiment with the output from the apply function before using it to write the Subregion
field of the data frame…
= {}
mapping for b in ['Enfield','Waltham Forest','Redbridge','Barking and Dagenham','Havering','Greenwich','Bexley']:
='Outer East and North East'
mapping[b]for b in ['Haringey','Islington','Hackney','Tower Hamlets','Newham','Lambeth','Southwark','Lewisham']:
='Inner East'
mapping[b]for b in ['Bromley','Croydon','Sutton','Merton','Kingston upon Thames']:
='Outer South'
mapping[b]for b in ['Wandsworth','Kensington and Chelsea','Hammersmith and Fulham','Westminster','Camden','City of London']:
='Inner West'
mapping[b]for b in ['Richmond upon Thames','Hounslow','Ealing','Hillingdon','Brent','Harrow','Barnet']:
='Outer West and North West' mapping[b]
'Subregion'] = tidy.Borough.apply(???) tidy[
'Subregion'] = tidy.Borough.apply(lambda x: mapping[x]) tidy[
And Save
There’s a little snipped of useful code to work out here: we need to check if the clean
directory exists in the data
directory; if we don’t then the tidy.to_parquet()
call will fail.
if not os.path.exists(os.path.join('data','clean')):
'data','clean'))
os.makedirs(os.path.join('data','clean','MSOA_Atlas.parquet'))
tidy.to_parquet(os.path.join(print("Done.")
Done.
Questions
- What are the advantages to
apply
andlambda
functions over looping and named functions? - When might you choose a named function over a lambda function?
Merge Data & Geography
We’ll cover joins (of which a merge
is just one type) in the final week’s lectures, but between what you’d done in GIS and what we have here there should be enough here for you to start being able to make sense of how they work so that you don’t have to wait until Week 10 to think about how this could help you with your Group Assessment.
First, we need to download the MSOA source file, which is a zipped archive of a Shapefile:
# Oh look, we can read a Shapefile without needing to unzip it!
= gpd.read_file(
msoas 'https://github.com/jreades/fsds/blob/master/data/src/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip?raw=true',
cache_data('data','geo')), driver='ESRI Shapefile') os.path.join(
Found data/geo/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip locally!
Size is 7 MB (7,381,177 bytes)
Identifying Matching Columns
Looking at the first few columns of each data frame, which one might allow us to link the two files together? You’ve done this in GIS. Remember: the column names don’t need to match for us to use them in a join, it’s the values that matter.
print(f"Column names: {', '.join(tidy.columns.tolist()[:5])}")
3,:5] tidy.iloc[:
Column names: MSOA Code, MSOA Name, Age-All Ages, Age-0-15, Age-16-29
MSOA Code | MSOA Name | Age-All Ages | Age-0-15 | Age-16-29 | |
---|---|---|---|---|---|
0 | E02000001 | City of London 001 | 7375.0 | 620.0 | 1665.0 |
1 | E02000002 | Barking and Dagenham 001 | 6775.0 | 1751.0 | 1277.0 |
2 | E02000003 | Barking and Dagenham 002 | 10045.0 | 2247.0 | 1959.0 |
Merge
One more thing: if you’ve got more than one choice I’d always go with a code over a name because one is intended for matching and other is not…
= pd.merge(msoas, tidy, left_on=???, right_on=???, how='inner')
gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])
gdf
print(f"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}")
= pd.merge(msoas, tidy, left_on='MSOA11CD', right_on='MSOA Code', how='inner')
gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])
gdf
print(f"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}")
Final MSOA Atlas data frame has shape 983 x 86
You should get Final data frame has shape 983 x 86
.
Plot Choropleth
Let’s plot the median income in 2011 column using the plasma
colour ramp… The rest is to show you how to customise a legend.
= 'Median-2011'
col = gdf.plot(column=???, cmap='???',
fig ='FisherJenks', k=7, edgecolor='None',
scheme=True, legend_kwds={'frameon':False, 'fontsize':8},
legend=(8,7));
figsize'-',' '));
plt.title(col.replace(
# Now to modify the legend: googling "geopandas format legend"
# brings me to: https://stackoverflow.com/a/56591102/4041902
= fig.get_legend()
leg = 3
leg._loc
for lbl in leg.get_texts():
= lbl.get_text()
label_text = label_text.split(', ')
[low, hi] = f'£{float(low):,.0f} - £{float(hi):,.0f}'
new_text
lbl.set_text(new_text)
; plt.show()
Save
'data','geo','MSOA_Atlas.geoparquet')) gdf.to_geoparquet(os.path.join(
Questions
- Try changing the colour scheme, classification scheme, and number of classes to see if you feel there’s a better opeion than the one shown above… Copy the cell (click on anywhere outside the code and then hit
C
to copy. Then click on this cell once, and hitV
to paste.
Splitting into Test & Train
**🔗 Connections**: Here you will be using a standard approach in Machine Learningknown as _test/train split_, although we _won't_ be doing any actual ML (😅). Here we are just going to use it to explore some issues raised by normalisation and standardisation in [this week's lectures](https://jreades.github.io/fsds/sessions/week8.html#lectures).
A standard approach to Machine Learning, and something that is becoming more widely used elsewhere, is the splitting of a large data into set into testing and training components. Typically, you would take 80-90% of your data to ‘train’ your algorithm and withold between 10-20% for validation (‘testing’). An even ‘stricter’ approach, in the sense of trying to ensure the robustness of your model against outlier effects, is cross validation such as k-folds cross-validation.
Sci-Kit Learn is probably the most important reason Python has become the de fact language of data science. Test/train-split is used when to avoid over-fitting when we are trying to predict something; so here Sci-Kit Learn expects that you’ll have an X
which is your predictors (the inputs to your model) and a y
which is the thing you’re trying to predict (because: \(y = \beta X + \epsilon\)).
We’re not building a model here (that’s for Term 2!) so we’ll just ‘pretend’ that we’re trying to predict the price of a listing and will set that up as our y
data set so that we can see how the choice of normalisation/standardisation technique affects the robustness of the model against ‘new’ data. Notice too that you can pass a data frame directly to Sci-Kit Learn and it will split it for you.
Reload
In future 'runs' of this notebook you can now just pick up here and skip all of Task 1.
On subsequent runs of this notebook you might just want to start here!
# Notice this handy code: we check if the data is already
# in memory. And notice this is just a list comprehension
# to see what is locally loaded.
if 'gdf' not in locals():
= gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))
gdf print(gdf.shape)
= ['Borough','Subregion']
categoricals for c in categoricals:
= gdf[c].astype('category') gdf[c]
Split
For our purposes this is a little bit overkill as you could also use pandas’ sample(frac=0.2)
and the indexes, but it’s useful to see how this works. You use test/train split to get four data sets out: the training data gives you two (predictors + target as separate data sets) and the testing data gives you two as well (predictors + target as separate data sets). These are sized accoridng to the test_size
specfied in the test_train_split
parameters.
from sklearn.model_selection import train_test_split
= gdf['Median-2011'].copy() # pdf for Median *P*rice b/c we need *something*
pdf
# df == *data* frame (a.k.a. predictors or independent variables)
# pr == *predicted* value (a.k.a. dependent variable)
# Notice we don't want the median price included in our training data
= train_test_split(
df_train, df_test, pr_train, pr_test =['Median-2011']), pdf, test_size=0.2, random_state=44) gdf.drop(columns
Below you should see that the data has been split roughly on the basis of the test_size
parameter.
print(f"Original data size: {gdf.shape[0]:,} x {gdf.shape[1]}")
print(f" Training data size: {df_train.shape[0]:,} x {df_train.shape[1]} ({(df_train.shape[0]/gdf.shape[0])*100:.0f}%)")
print(f" Testing data size: {df_test.shape[0]:,} x {df_test.shape[1]} ({(df_test.shape[0]/gdf.shape[0])*100:.0f}%)")
Also notice the indexes of each pair of data sets match:
print(", ".join([str(x) for x in df_train.index[:10]]))
print(", ".join([str(x) for x in pr_train.index[:10]]))
Plot Test/Train Data
= gpd.read_file(os.path.join('data','geo','Boroughs.gpkg')) boros
= plt.subplots(1,2, figsize=(12,5))
f,axes =???)
df_train.plot(ax=???)
df_test.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)
boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)
boros.plot(ax0].set_title('Training Data')
axes[1].set_title('Testing Data');
axes[0].set_ylim(150000,210000)
axes[1].set_ylim(150000,210000)
axes[0].set_xlim(500000,565000)
axes[1].set_xlim(500000,565000)
axes[1].set_yticks([]); axes[
Questions
- Why might it be useful to produce a map of a test/train split?
- Why might it matter more if you were dealing with user locations or behaviours?
Normalisation
The developers of SciKit-Learn define normalisation as “scaling individual samples to have unit norm.” There are a lot of subtleties to this when you start dealing with ‘sparse’ data, but for the most part it’s worthwhile to think of this as a rescaling of the raw data to have similar ranges in order achieve some kind of comparison. This is such a common problem that sklearn offers a range of such (re)scalers including: MinMaxScaler
.
Let’s see what effect this has on the data!
# Sets some handy 'keywords' to tweak the Seaborn plot
= dict(s=7,alpha=0.95,edgecolor="none")
kwds
# Set the *hue order* so that all plots have the *same*
# colour on the Subregion
= ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East'] ho
Select Columns
One thing you’ll need to explain is why I keep writing df[cols+['Subregion']
and why I don’t just add it to the cols
variable at the start? Don’t try to answer this now, get through the rest of Tasks 3 and 4 and see what you think.
= ['Tenure-Owned outright', 'Tenure-Owned with a mortgage or loan',
cols 'Tenure-Social rented', 'Tenure-Private rented']
Answer: one part of the answer is that it makes it easy to change the columns we select without having to remember to keep Subregion
, but the more important reason is that it allows us to re-use this ‘definition’ of cols
elsewhere throughout the rest of this practical without needing to remember to remove Subregion
.
= df_train[cols+['Subregion']].copy() # train raw
tr_raw = df_test[cols+['Subregion']].copy() # test raw tst_raw
Fit to Data
reshape
is doing.
Fit the training data:
from sklearn.preprocessing import MinMaxScaler
# Notice what this is doing! See if you can explain it clearly.
= [???.fit(???[x].values.reshape(-1,1)) for x in cols] scalers
Apply Transformations
Train:
= tr_raw.copy()
tr_normed
for i, sc in enumerate(scalers):
# Ditto this -- can you explain what this code is doing
= sc.transform(df_???[cols[i]].values.reshape(-1,1)) tr_normed[cols[i]]
Test:
= tst_raw.copy()
tst_normed
for i, sc in enumerate(scalers):
= sc.transform(df_???[cols[i]].values.reshape(-1,1)) tst_normed[cols[i]]
**🔗 Connections**: You don't _have_ to fully understand the next section, but if you are able to get your head around it then this will seriously help you to prepare for the use of more advanced techniques in modelling and programming.
Check out the properties of tst_normed
below. If you’ve understood what the MinMaxScaler
is doing then you should be able to spot something unexpected in the transformed test outputs. If you’ve really understood this, you’ll see why this result is problematic for models. Hint: one way to think of it is an issue of extrapolation.
for c in cols:
print(f" Minimum: {tst_normed[c].min():.4f}")
???
Plot Distributions
= [re.sub('(-|/)',"\n",x) for x in tr_raw.columns.values]
tr_raw.columns = [re.sub('(-|/)',"\n",x) for x in tst_raw.columns.values]
tst_raw.columns = [re.sub('(-|/)',"\n",x) for x in tr_normed.columns.values]
tr_normed.columns = [re.sub('(-|/)',"\n",x) for x in tst_normed.columns.values] tst_normed.columns
=tr_raw, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho); sns.pairplot(data
=tr_normed, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho); sns.pairplot(data
Questions
- Why do I keep writing
df[cols+['Subregion']
? Why I don’t just add Subregions to thecols
variable at the start? - What has changed between the two plots (of
tr_raw
andtr_normed
)? - What is the potential problem that the use of the transformer fitted on
tr_normed
to data fromtst_normed
might cause? Hint: this is why I asked you to investigate the data in the empty code cell above. - Can you explain what this is doing:
[MinMaxScaler().fit(df_train[x].values.reshape(-1,1)) for x in cols]
? - Can you explain what this is doing:
sc.transform(df_test[cols[i]].values.reshape(-1,1))
?
Standardisation
Standardisation is typically focussed on rescaling data to have a mean (or median) of 0 and standard deviation or IQR of 1. That these approaches are conceptually tied to the idea of symmetric, unimodal data such as that encountered in the standard normal distribution. Rather confusingly, many data scientists will use standardisation and normalisation largely interchangeably!
= 'Vehicles-No cars or vans in hh'
col = df_train[[col]].copy()
tr = df_test[[col]].copy() tst
Z-Score Standardisation
= StandardScaler().fit(tr[col].values.reshape(-1,1))
stsc
f"Z. {col}"] = stsc.transform(???)
tr[f"Z. {col}"] = stsc.transform(???) tst[
Inter-Quartile Standardisation
= ???(quantile_range=(25.0, 75.0)).fit(???)
rs
f"IQR. {col}"] = rs.transform(???)
tr[f"IQR. {col}"] = rs.transform(???) tst[
Plot Distributions
**🔗 Connections**: The point of these next plots is simply to show that *linear* transformations (which are 'reversible') is about changing things like the magnitude/scale of our data but doesn't fundamentally change relationships *between* observations.
=tr, x=f"{col}", y=f"Z. {col}", kind='kde'); # hex probably not the best choice sns.jointplot(data
=tr, x=f"{col}", y=f"IQR. {col}", kind='kde'); # hex probably not the best choice sns.jointplot(data
=tr, x=f"Z. {col}", y=f"IQR. {col}", kind='hex'); # hex probably not the best choice sns.jointplot(data
Perhaps a little more useful…
= sns.kdeplot(tr[f"Z. {col}"])
ax f"IQR. {col}"], color='r', ax=ax)
sns.kdeplot(tr[='upper right', labels=['Standard', 'Robust']) # title='Foo'
plt.legend(loc=False, style='plain')
ax.ticklabel_format(useOffset"Standardised Value for No cars or vans in hh") ax.set_xlabel(
Questions?
- Can you see the differences between these two rescalers?
- Can you explain why you might want to choose one over the other?
Non-Linear Transformations
**🔗 Connections**: Now *these* transformations are not directly revsersible because they change the actual relationships between observations in the data. We use these when we need to achieve particular 'behaviours' from our data in order to improve our model's 'predictive' ability. Even a simple regression is doing a kind of prediction ($y = \beta X + \epsilon$). You covered these concepts in [this week's lectures](https://jreades.github.io/fsds/sessions/week8.html#lectures).
So transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more tractable for subsequent analysis. In other words, it’s all about the context of our data.
From above, we know the Median Income data are not normally distributed, but can we work out what distribution best represents Median Income? This can be done by comparing the shape of the histogram to the shapes of theoretical distributitions. For example:
- the log-normal distribution
- the exponential distribution
- the Poisson distribution (for non-continuous data)
From looking at those theoretical distributions, we might make an initial guess as to the type of distribution. There are actually many other distributions encountered in real life data, but these ones are particuarly common. A wider view of this would be that quantile and power transformations are ways of preserving the rank of values but lose many of the other features of the relationships that might be preserved by, for instance, the standard scaler.
In the case of Median Income, taking a log-transform of the data might make it appear more normal: you do not say that a transformation makes data more normal, you say either that ‘it allows us to treat the data as normally distributed’ or that ‘the transformed data follows a log-normal distribution’.
The Normal Distribution
Z-scores are often associated with the normal distribution because their interpretation implicitly assumes a normal distribution. Or to put it another way… You can always calculate z-scores for your data (it’s just a formula applied to data points), but their intuitive meaning is lost if your data don’t have something like a normal distribution (or follow the 68-95-99.7 rule).
But… what if our data are non-normal? Well, Just because data are non-normal doesn’t mean z-scores can’t be calculated (we already did that above); we just have to be careful what we do with them… and sometimes we should just avoid them entirely.
Creating a Normal Distribution
Below is a function to create that theoretical normal distribution. See if you can understand what’s going and add comments to the code to explain what each line does.
def normal_from_dist(series):
= ??? # mean of our data
mu = ??? # standard deviation of our data
sd = ??? # count how many observations are in our data
n = np.random.normal(???, ???, ???) #use the parameters of the data just calculated to generate n random numbers, drawn from a normal distributions
s return s #return this set of random numbers
To make it easier to understand what the function above is doing, let’s use it! We’ll use the function to plot both a distribution plot with both histogram and KDE for our data, and then add a second overplot distplot to the same fig showing the theoretical normal distribution (in red). We’ll do this in a loop for each of the three variables we want to examine.
Visual Comparisons
Looking at the output, which of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation most appropriate as measures of centrality and spread?
Also, how would you determine the meaning of some of the departures from the normal distribution?
= [x for x in df_train.columns.values if x.startswith('Composition')]
selection
for c in selection:
= sns.kdeplot(df_train[c])
ax ='r', fill=True, ax=ax)
sns.kdeplot(normal_from_dist(df_train[c]), color='upper right', labels=['Observed', 'Normal']) # title='Foo'
plt.legend(loc=False, style='plain')
ax.ticklabel_format(useOffsetif ax.get_xlim()[1] > 999999:
=45)
plt.xticks(rotation plt.show()
Questions
- Which, if any, of the variables has a roughly normal distribution? Another way to think about this question is, for which of the variables are the mean and standard deviation most appropriate as measures of centrality and spread?
- How might you determine the significance of some of the departures from the normal distribution?
Logarithmic Transformations
To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we’ve done before, but since pandas doesn’t provide a log-transform operator (i.e. you can’t call df['MedianIncome'].log()
) we need to use the numpy
package since pandas data series are just numpy arrays with some fancy window dressing that makes them even more useful.
Let’s perform the transform then compare to the un-transformed data. Comment the code below to ensure that you understand what it is doing.
Apply and Plot
= ['Median-2012','Total Mean hh Income']
cols
for m in cols:
= df_train[m] # s == series
s = ???.???(s) # ts == transformed series
ts
= sns.kdeplot(s)
ax ='r', fill=True, ax=ax)
sns.kdeplot(normal_from_dist(s), color='upper right', labels=['Observed', 'Normal']) # title also an option
plt.legend(loc"Original Data")
plt.title(
### USEFUL FORMATTING TRICKS ###
# This turns off scientific notation in the ticklabels
=False, style='plain')
ax.ticklabel_format(useOffset# Notice this snippet of code
+ " (Raw Distribution)")
ax.set_xlabel(ax.get_xlabel() # Notice this little code snippet too
if ax.get_xlim()[1] > 999999:
=45)
plt.xticks(rotation
plt.show()
= sns.kdeplot(ts)
ax ='r', fill=True, ax=ax)
sns.kdeplot(normal_from_dist(ts), color='upper right', labels=['Observed', 'Normal'])
plt.legend(loc=False, style='plain')
ax.ticklabel_format(useOffset+ " (Logged Distribution)")
ax.set_xlabel(ax.get_xlabel() if ax.get_xlim()[1] > 999999:
=45)
plt.xticks(rotation"Log-Transformed Data")
plt.title( plt.show()
Hopefully, you can see that the transformed data do indeed look ‘more normal’; the peak of the red and blue lines are closer together and the blue line at the lower extreme is also closer to the red line, but we can check this by seeing what has happened to the z-scores.
Power Transformations
= ['Median-2012','Total Mean hh Income']
cols = ???(method='yeo-johnson')
pt
for m in cols:
= df_train[m] # s == series
s = pt.fit_transform(s.values.reshape(-1,1))
ts print(f"Using lambda (transform 'exponent') of {pt.lambdas_[0]:0.5f}")
= sns.kdeplot(ts.reshape(-1,))
ax
='r', fill=True, ax=ax)
sns.kdeplot(normal_from_dist(???), color='upper right', labels=['Observed', 'Normal'])
plt.legend(loc=False, style='plain')
ax.ticklabel_format(useOffset+ " (Transformed Distribution)")
ax.set_xlabel(m if ax.get_xlim()[1] > 999999: # <-- What does this do?
=45)
plt.xticks(rotation"Power-Transformed Data")
plt.title(; plt.show()
Principal Components Analysis
**🔗 Connections**: This is all about _dimensionality_ and the different ways that we can reduce dimensionality in our data in order to improve our models' robustness and gain a better understanding of whether (or not) there is structure in our data. We'll start with linear decomposition and then look at non-linear decomposition, which we also demonstrated with UMAP last week.
Now we’re going to ask the question: how can we represent our data using a smaller number of components that capture the variance in the original data. You should have covered PCA in Quantitative Methods.
Optional Reload
Use this is your data gets messy…
= gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet')).set_index('MSOA Code')
gdf print(gdf.shape)
= ['Borough','Subregion']
categoricals for c in categoricals:
= gdf[c].astype('category') gdf[c]
Removing Columns
To perform dimensionality we can only have numeric data. In theory, categorical data can be converted to numeric and retained, but there are two issues:
- Nominal data has no innate order so we can’t convert > 2 categories to numbers and have to convert them to One-Hot Encoded values.
- A binary (i.e. One-Hot Encoded) variable will account for a lot of variance in the data because it’s only two values they are 0 and 1!
So in practice, it’s probably a good idea to drop categorical data if you’re planning to use PCA.
Drop Totals Columns
= tr_gdf.drop(columns=['Age-All Ages', 'Households-All Households',
pcadf 'Vehicles-Sum of all cars or vans in the area'])
= pcadf.set_index('MSOA Code') pcadf
Drop Non-Numeric Columns
'category','object']).columns pcadf.select_dtypes([
=pcadf.select_dtypes(['category','object']).columns.to_list(), inplace=True)
pcadf.drop(columns=['BNG_E','BNG_N','geometry', 'LONG', 'LAT','Shape__Are', 'Shape__Len'], inplace=True) pcadf.drop(columns
pcadf.columns
Rescale & Reduce
In order to ensure that our results aren’t dominated by a single scale (e.g. House Prices!) we need to rescale all of our data. You could easily try different scalers as well as a different parameters to see what effect this has on your results.
Robustly Rescale
Set up the Robust Rescaler for inter-decile standardisation: 10th and 90th quantiles.
= ???
rs
for c in pcadf.columns.values:
= rs.fit_transform(pcadf[c].values.reshape(-1, 1)) pcadf[c]
PCA Reduce
from sklearn.decomposition import PCA
= PCA(n_components=50, whiten=True)
pca
pca.fit(pcadf)
= pca.explained_variance_ratio_
explained_variance = pca.singular_values_ singular_values
Examine Explained Variance
= np.arange(1,???)
x
plt.plot(x, explained_variance)'Share of Variance Explained')
plt.ylabel( plt.show()
for i in range(0, 20):
print(f"Component {i:>2} accounts for {explained_variance[i]*100:>2.2f}% of variance")
You should get that Component 0 accounts for 31.35% of the variance and Component 19 accounts for 0.37%.
###: How Many Components?
There are a number of ways that we could set a threshold for dimensionality reduction: - The most common is to look for the ‘knee’ in the Explained Variance plot above. That would put us at about 5 retained components. - Another is to just keep all components contributing more than 1% of the variance. That would put us at about 10 components. - You can also (I discovered) look to shuffle the data and repeatedly perform PCA to build confidence intervals. I have not implemented this (yet).
In order to do anything with these components we need to somehow reattach them to the MSOAs. So that entails taking the transformed results (X_train
and X_test
)
= knee_locator.KneeLocator(x, explained_variance,
kn ='convex', direction='decreasing',
curve='interp1d')
interp_methodprint(f"Knee detected at: {kn.knee}")
kn.plot_knee()
= 7
keep_n_components
# If we weren't changing the number of components we
# could re-use the pca object created above.
= PCA(n_components=keep_n_components, whiten=True)
pca
= pca.fit_transform(???)
X_train
# Notice that we get the _same_ values out,
# so this is a *deterministic* process that
# is fully replicable (allowing for algorithmic
# and programming language differences).
print(f"Total explained variance: {pca.explained_variance_ratio_.sum()*100:2.2f}%")
for i in range(0, keep_n_components):
print(f" Component {i:>2} accounts for {pca.explained_variance_ratio_[i]*100:>5.2f}% of variance")
# Notice...
print(f"X-train shape: {len(X_train)}")
print(f"PCA df shape: {pcadf.shape[0]}")
# So each observation has a row in X_train and there is
# 1 column for each component. This defines the mapping
# of the original data space into the reduced one
print(f"Row 0 of X-train contains {len(X_train[0])} elements.")
Components to Columns
You could actually do this more quickly (but less clearly) using X_train.T
to transpose the matrix!
for i in range(0,keep_n_components):
= pd.Series(X_train[:,???], index=pcadf.???)
s f"Component {i+1}"] = s pcadf[
3).iloc[:,-10:-4] pcadf.sample(
(Re)Attaching GeoData
= gpd.read_file(os.path.join('data','geo','Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip'), driver='ESRI Shapefile')
msoas = msoas.set_index('MSOA11CD')
msoas print(msoas.columns)
1) msoas.head(
1) pcadf.head(
= pd.merge(msoas.set_index(['MSOA11CD'], drop=True), pcadf, left_index=True, right_index=True, how='inner')
gpcadf print(f"Geo-PCA df has shape {gpcadf.shape[0]} x {gpcadf.shape[1]}")
You should get PCA df has shape 983 x 89
.
'Borough'] = gpcadf.MSOA11NM.apply(???) gpcadf[
Map the First n Components
How would you automate this so that the loop creates one plot for each of the first 3 components? How do you interpret these?
for comp in [f"Component {x}" for x in range(1,3)]:
= gpcadf.plot(column=???, cmap='plasma',
ax ='FisherJenks', k=7, edgecolor='None', legend=True, figsize=(9,7));
scheme=ax, edgecolor='w', facecolor='none', linewidth=1, alpha=0.7)
boros.plot(axf'PCA {comp}') ax.set_title(
Your first component map should look something like this:
UMAP
UMAP is a non-linear dimensionality reduction technique. Technically, it’s called manifold learning: imagine being able to roll a piece of paper up in more than just the 3rd dimension…). As a way to see if there is structure in your data this is a much better technique than one you might encounter in many tutorials: t-SNE. It has to do with how the two techniques ‘learn’ the manifold to use with your data.
UMAP on Raw Data
from umap import UMAP
# You might want to experiment with all
# 3 of these values -- it may make sense
# to package a lot of this up into a function!
=2
keep_dims=42
rs
= UMAP(
u =25,
n_neighbors=0.01,
min_dist=keep_dims,
n_components=rs)
random_state
= u.fit_transform(???)
X_embedded print(X_embedded.shape)
Write to Data Frame
Can probably also be solved using X_embedded.T
.
for ix in range(0,X_embedded.shape[1]):
print(ix)
= pd.Series(X_embedded[:,???], index=pcadf.???)
s f"Dimension {ix+1}"] = s gpcadf[
Visualise!
= gpcadf.copy() # Reduced Dimension Data Frame rddf
Simple Scatter
= plt.subplots(1,1,figsize=(8,6))
f,ax =rddf[???], y=rddf[???], hue=rddf['Borough'], legend=False, ax=ax) sns.scatterplot(x
Seaborn Jointplot
That is suggestive of there being struccture in the data, but with 983 data points and 33 colours it’s hard to make sense of what the structure might imply. Let’s try this again using the Subregion instead and taking advantage of the Seaborn visualisation library’s jointplot
(joint distribution plot):
'Subregion'] = rddf.Borough.apply(lambda x: mapping[x]) rddf[
# Sets some handy 'keywords' to tweak the Seaborn plot
= dict(s=7,alpha=0.95,edgecolor="none")
kwds # Set the *hue order* so that all plots have some colouring by Subregion
= ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East'] ho
= sns.jointplot(data=rddf, x=???, y=???, height=8,
g =???, hue_order=ho, joint_kws=kwds)
hue='upper right', prop={'size': 8}); g.ax_joint.legend(loc
Your jointplot should look like this:
What do you make of this?
Maybe let’s give this one last go splitting the plot out by subregion so that we can see how these vary:
for r in rddf.Subregion.unique():
= sns.jointplot(data=rddf[rddf.Subregion==r], x='Dimension 1', y='Dimension 2',
g ='Borough', joint_kws=kwds)
hue='upper right', prop={'size': 8});
g.ax_joint.legend(loc0,15)
g.ax_joint.set_ylim(0,15)
g.ax_joint.set_xlim( plt.suptitle(r)
We can’t unfortunately do any clustering at this point to create groups from the data (that’s next week!) so for now note that there are several large-ish groups (in terms of membership) and few small ones picked up by t-SNE. Alos note that there is strong evidence of some incipient structure: Inner East and West largely clump together, while Outher East and Outer South also seem to group together, with Outer West being more distinctive. If you look back at the PCA Components (especially #1) you might be able to speculate about some reasons for this! Please note: this is only speculation at this time!
Next week we’ll also add the listings data back in as part of the picture!
Map the n Dimensions
for comp in [f"Dimension {x}" for x in range(1,3)]:
= plt.subplots(1,1,figsize=(12,8))
f, ax ;
rddf.plot(???)='w', facecolor='none', linewidth=1, alpha=0.7, ax=ax)
boros.plot(edgecolorf'UMAP {comp}') ax.set_title(
Your first dimension map should look something like this:
And Save
'data','clean','Reduced_Dimension_Data.geoparquet')) rddf.to_parquet(os.path.join(
Questions
- How would you compare/contrast PCA components with UMAP dimensions? Why do they not seem to show the same thing even though both seem to show something?
- What might you do with the output of either the PCA or UMAP processes?
Credits!
Contributors:
The following individuals have contributed to these teaching materials: Jon Reades (j.reades@ucl.ac.uk).
License
These teaching materials are licensed under a mix of The MIT License and the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
Potential Dependencies:
This notebook may depend on the following libraries: pandas, geopandas, sklearn, matplotlib, seaborn