{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Practical 11: Dimensions in Data\n", "\n", "Transformation & Dimensionality Reduction" ], "id": "c59bda42-e0d4-42e3-a86f-69338adc663f" }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/html" }, "source": [ "" ], "id": "7e4e3a0d-49d4-4038-8753-972316d480a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session the focus is on MSOA-level Census data from 2011. We’re\n", "going to explore this as a *possible* complement to the InsideAirbnb\n", "data. Although it’s not ideal to use 2011 data with scraped from Airbnb\n", "this year, we:\n", "\n", "1. Have little choice as the 2021 data is only just starting to come\n", " through from the Census and the London Data Store hasn’t been\n", " updated (still!); and\n", "2. Could usefully do a bit of thinking about whether the situation in\n", " 2011 might in some way help us to ‘predict’ the situation now…\n", "\n", "Ultimately, however, you don’t *need* to use this for your analysis,\n", "this practical is intended as a demonstration of how transformation and\n", "dimensionality reduction work in practice and the kinds of issues that\n", "come up.\n", "\n", "> **🔗 Connections**\n", ">\n", "> There are a *lot* of links across sessions now, as well as some\n", "> *forward links* to stuff we’ve not yet covered (see: `pandas.merge`).\n", "> We’ll pick these up as we move through the notebook.\n", "\n", "## 1. Preamble\n", "\n", "Let’s start with the usual bits of code to ensure plotting works, to\n", "import packages and load the data into memory." ], "id": "90f9f4be-7149-4d53-a5ce-6321aab91361" }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import re\n", "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import seaborn as sns\n", "\n", "import matplotlib.cm as cm\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.preprocessing import RobustScaler\n", "from sklearn.preprocessing import PowerTransformer\n", "\n", "import umap\n", "from kneed import knee_locator" ], "id": "8ab76480" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice here that we’ve moved the function from last week’s notebook to a\n", "separate file `cache.py` from which we import the `cache_data` function.\n", "This should give you some ideas about how to work from script -\\>\n", "function -\\> library effectively." ], "id": "59040825-4e6f-4fee-a324-b53600eeb6c0" }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from cache import cache_data" ], "id": "bc6fa6a3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Loading MSOA Census Data\n", "\n", "> **🔗 Connections**\n", ">\n", "> By this point you should be fairly familiar with the UK’s census\n", "> geographies as you’ll have encountered them in your GIS module. But in\n", "> case you need a refresher, here’s what the [Office for National\n", "> Statistics](https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies)\n", "> says.\n", "\n", "> **Tip**\n", ">\n", "> We’re going to mix in the London’s MSOA ‘Atlas’ from the [London Data\n", "> Store](https://data.london.gov.uk/dataset). I would *strongly* suggest\n", "> that you have a look around the London Data Store as you develop your\n", "> thinking for the group assessment – you will likely find useful\n", "> additional data there!\n", "\n", "Once you see how we deal with this MSOA Atlas data you will be in a\n", "position to work with any other similarly complex (in terms of the\n", "headings and indexes) data set. If you’re feeling particularly ambitious\n", "you can actually do this *same* work at the LSOA scale using the [LSOA\n", "Atlas](https://data.london.gov.uk/dataset/lsoa-atlas) and LSOA\n", "boundaries… the process should be the same, though you will have smaller\n", "samples in each LSOA than you do in the MSOAs and calculations will take\n", "a bit longer to complete. You could also search on the ONS Census web\n", "site for data from 2021.\n", "\n", "There is a CSV file for the MSOA Atlas that would be easier to work\n", "with; however, the Excel file is useful for demonstrating how to work\n", "with multi-level indexes (an extension of last week’s work). Notice that\n", "below we do two new things when reading the XLS file:\n", "\n", "1. We have to specify a sheet name because the file contains multiple\n", " sheets.\n", "2. We have to specify not just *one* header, we actually have to\n", " specify three of them which generates a multi-level index (row 0 is\n", " the top-level, row 1 is the second-level, etc.).\n", "\n", "### 2.1 Load MSOA Excel File\n", "\n", "> **Difficulty level: Low.**\n", "\n", "You might like to load the cached copy of the file into Excel so that\n", "you can see how the next bit works. You can find the rest of the MSOA\n", "Atlas [here](https://data.london.gov.uk/dataset/msoa-atlas)." ], "id": "906e6a9a-65fd-4818-9999-d878f312f508" }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "src_url = 'https://data.london.gov.uk/download/msoa-atlas/39fdd8eb-e977-4d32-85a4-f65b92f29dcb/msoa-data.xls'\n", "dest_path = os.path.join('data','msoa')" ], "id": "8d11d8bb" }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Question\n", "\n", "``` python\n", "excel_atlas = pd.read_excel(\n", " cache_data(src_url, dest_path), \n", " ???, # Which sheet is the data in?\n", " header=[0,1,2]) # Where are the column names... there's three of them!\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "excel_atlas = pd.read_excel(\n", " cache_data(src_url, dest_path),\n", " sheet_name='iadatasheet1', # Which sheet is the data in?\n", " header=[0,1,2]) # Where are the column names... there's three of them!\n", "```\n", "\n", " Found data/msoa/msoa-data.xls locally!\n", " Size is 2 MB (1,979,904 bytes)\n", "\n", "Notice the format of the output and notice that all of the empty cells\n", "in the Excel sheet have come through as\n", "`Unnamed: _level_`:" ], "id": "d4333f8c-9920-4667-b8f0-d563766439b4" }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "

1 rows × 207 columns

\n", "" ] } } ], "source": [ "excel_atlas.head(1)" ], "id": "17c06c0b" }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` python\n", "print(f\"Shape of the MSOA Atlas data frame is: {excel_atlas.shape[0]:,} x {excel_atlas.shape[1]:,}\")\n", "```\n", "\n", "You should get: Shape of the MSOA Atlas data frame is: 984 x\n", "207, but how on earth are you going to access the data?\n", "\n", "### 2.2 Accessing MultiIndexes\n", "\n", "> **Difficulty: Moderate.**\n", ">\n", "> The difficulty is conceptual, not technical.\n", "\n", "Until now we have understood the pandas index as a single column-like\n", "‘thing’ in a data frame, but pandas also supports hierarchical and\n", "grouped indexes that allow us to interact with data in more complex\n", "ways… should we need it. Generally:\n", "\n", "- MultiIndex == hierarchical index on *columns*\n", "- DataFrameGroupBy == iterable pseudo-hierarchical index on *rows*\n", "\n", "> **🔗 Connections**\n", ">\n", "> We’ll be looking at **Grouping Data** in much more detail in [next\n", "> week](https://jreades.github.io/fsds/sessions/week9.html#lectures), so\n", "> the main thing to remember is that grouping is for rows,\n", "> multi-indexing is about columns.\n", "\n", "#### 2.2.1 Direct Access\n", "\n", "Of course, one way to get at the data is to use `.iloc[...]` since that\n", "refers to columns by *position* and ignores the complexity of the index.\n", "Try printing out the the first five rows of the first column using\n", "`iloc`:\n", "\n", "``` python\n", "excel_atlas.iloc[???]\n", "```\n", "\n", "You should get:" ], "id": "6a7c20af-9de8-408b-9244-546c95b846d4" }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "0 E02000001\n", "1 E02000002\n", "2 E02000003\n", "3 E02000004\n", "4 E02000005\n", "Name: (Unnamed: 0_level_0, Unnamed: 0_level_1, MSOA Code), dtype: object" ] } } ], "source": [], "id": "d68ce8e8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.2 Named Access\n", "\n", "But to do it by name is a little trickier:" ], "id": "01d8e159-128f-4ec1-bb3b-399a90358374" }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "[('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'MSOA Code'),\n", " ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'MSOA Name'),\n", " ('Age Structure (2011 Census)', 'All Ages', 'Unnamed: 2_level_2'),\n", " ('Age Structure (2011 Census)', '0-15', 'Unnamed: 3_level_2'),\n", " ('Age Structure (2011 Census)', '16-29', 'Unnamed: 4_level_2')]" ] } } ], "source": [ "excel_atlas.columns.tolist()[:5]" ], "id": "29735160" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how asking for the first five columns has given us a list of…\n", "what exactly?\n", "\n", "##### Question\n", "\n", "So to get the **same output** by column *name* what do you need to copy\n", "from above:\n", "\n", "``` python\n", "excel_atlas.loc[0:5, ???]\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "excel_atlas.loc[0:5, ('Unnamed: 0_level_0','Unnamed: 0_level_1','MSOA Code')]\n", "```\n", "\n", " 0 E02000001\n", " 1 E02000002\n", " 2 E02000003\n", " 3 E02000004\n", " 4 E02000005\n", " 5 E02000007\n", " Name: (Unnamed: 0_level_0, Unnamed: 0_level_1, MSOA Code), dtype: object\n", "\n", "The answer is *really* awkward, so we’re going to look for a better way…\n", "\n", "#### 2.2.3 Grouped Access\n", "\n", "Despite this, *one* way that MultiIndexes can be useful is for accessing\n", "column-slices from a ‘wide’ dataframe. We can, for instance, select all\n", "of the Age Structure columns in one go and it will be *simpler* than\n", "what we did above." ], "id": "892acece-4fe1-4c7d-bb36-3a3cc6494d4c" }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "" ] } } ], "source": [ "excel_atlas.loc[0:5, ('Age Structure (2011 Census)')]" ], "id": "b3416a41" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.4 Understanding Levels\n", "\n", "This works because the MultiIndex tracks the columns using *levels*,\n", "with level 0 at the ‘top’ and level 2 (in our case) at the bottom. These\n", "are the unique *values* for the top level (‘row 0’):" ], "id": "bc4c3ecc-1bd1-4c8b-8a6f-7e6c6b952708" }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "Index(['Adults in Employment (2011 Census)', 'Age Structure (2011 Census)',\n", " 'Car or van availability (2011 Census)',\n", " 'Central Heating (2011 Census)', 'Country of Birth (2011)',\n", " 'Dwelling type (2011)', 'Economic Activity (2011 Census)',\n", " 'Ethnic Group (2011 Census)', 'Health (2011 Census)', 'House Prices',\n", " 'Household Composition (2011)', 'Household Income Estimates (2011/12)',\n", " 'Household Language (2011)', 'Households (2011)', 'Incidence of Cancer',\n", " 'Income Deprivation (2010)', 'Land Area', 'Life Expectancy',\n", " 'Lone Parents (2011 Census)', 'Low Birth Weight Births (2007-2011)',\n", " 'Mid-year Estimate totals', 'Mid-year Estimates 2012, by age',\n", " 'Obesity', 'Population Density', 'Qualifications (2011 Census)',\n", " 'Religion (2011)', 'Road Casualties', 'Tenure (2011)',\n", " 'Unnamed: 0_level_0', 'Unnamed: 1_level_0'],\n", " dtype='object')" ] } } ], "source": [ "excel_atlas.columns.levels[0]" ], "id": "4db77dd8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the *values* for those levels across the actual columns in the\n", "data frame, notice the repeated ‘Age Structure (2011 Census)’:" ], "id": "db9b8a0d-794e-4e6c-ae6b-99c6a626e837" }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "Index(['Unnamed: 0_level_0', 'Unnamed: 1_level_0',\n", " 'Age Structure (2011 Census)', 'Age Structure (2011 Census)',\n", " 'Age Structure (2011 Census)', 'Age Structure (2011 Census)',\n", " 'Age Structure (2011 Census)', 'Age Structure (2011 Census)',\n", " 'Age Structure (2011 Census)', 'Mid-year Estimate totals'],\n", " dtype='object')" ] } } ], "source": [ "excel_atlas.columns.get_level_values(0)[:10]" ], "id": "024b77af" }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are the *values* for the second level of the index (‘row 1’ in\n", "the Excel file):" ], "id": "73e87326-b2cf-4717-a41b-584040d8b46c" }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "Index(['Unnamed: 0_level_1', 'Unnamed: 1_level_1', 'All Ages', '0-15', '16-29',\n", " '30-44', '45-64', '65+', 'Working-age', 'All Ages'],\n", " dtype='object')" ] } } ], "source": [ "excel_atlas.columns.get_level_values(1)[:10]" ], "id": "2b78bdd5" }, { "cell_type": "markdown", "metadata": {}, "source": [ "By extension, if we drop a level 0 index then *all* of the columns that\n", "it supports at levels 1 and 2 are *also* dropped: so when we drop\n", "`Mid-year Estimate totals` from level 0 then all 11 of the ‘Mid-year\n", "Estimate totals (2002…2012)’ columns are dropped in one go." ], "id": "861e2211-1fd7-4aaa-97a1-92e9f3dbc9e5" }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "" ] } } ], "source": [ "excel_atlas[['Mid-year Estimate totals']].head(3)" ], "id": "656489af" }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Excel source had 207 columns.\n", "Test now has 196 columns." ] } ], "source": [ "test = excel_atlas.drop(columns=['Mid-year Estimate totals'], axis=1, level=0)\n", "\n", "print(f\"Excel source had {excel_atlas.shape[1]} columns.\")\n", "print(f\"Test now has {test.shape[1]} columns.\")" ], "id": "c0e267f6" }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Tidy up if the variable exists\n", "if 'test' in locals():\n", " del(test)" ], "id": "3ee0e61c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.5 Questions\n", "\n", "- What data type is used for storing/accessing MultiIndexes?\n", "- *Why* is this is the appropriate data type?\n", "- How (conceptually) are the header rows in Excel are mapped on to\n", " levels in pandas?\n", "\n", "### 2.3 Tidying Up\n", "\n", "> **Difficulty level: Low**\n", ">\n", "> Although there’s a lot of dealing with column names.\n", "\n", "#### 2.3.1 Dropping Named Levels\n", "\n", "There’s a *lot* of data in the data frame that we don’t need for our\n", "Airbnb work, so let’s go a bit further with the dropping of\n", "column-groups using the MultiIndex." ], "id": "3681c86f-f2cc-47b6-9e97-342f28470c64" }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Shape of the MSOA Atlas data frame is now: 984 x 111" ] } ], "source": [ "to_drop = ['Mid-year Estimate totals','Mid-year Estimates 2012, by age','Religion (2011)',\n", " 'Land Area','Lone Parents (2011 Census)','Central Heating (2011 Census)','Health (2011 Census)',\n", " 'Low Birth Weight Births (2007-2011)','Obesity','Incidence of Cancer','Life Expectancy',\n", " 'Road Casualties']\n", "tidy = excel_atlas.drop(to_drop, axis=1, level=0)\n", "print(f\"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}\")" ], "id": "a7545fee" }, { "cell_type": "markdown", "metadata": {}, "source": [ "This should drop you down to 984 x 111. Notice below that the\n", "multi-level *index* has not changed but the multi-level *values*\n", "remaining have!" ], "id": "c3220289-c5fe-47e8-a24b-92c694e846a3" }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "There are 30 categories.\n", "But only 18 values." ] } ], "source": [ "print(f\"There are {len(tidy.columns.levels[0].unique())} categories.\") # The categories\n", "print(f\"But only {len(tidy.columns.get_level_values(0).unique())} values.\") # The actual values" ], "id": "ef809d3e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3.2 Selecting Columns using a List Comprehension\n", "\n", "Now we need to drop all of the percentages from the data set. These can\n", "be found at level 1, though they are specified in a number of different\n", "ways so you’ll need to come up with a way to find them in the level 1\n", "values using a list comprehension…\n", "\n", "I’d suggest looking for: “(%)”, “%”, and “Percentages”. You may need to\n", "check both start and end of the string. You could also use a regular\n", "expression here instead of multiple tests. Either way *works*, but have\n", "a think about the tradeoffs between intelligibility, speed, and what\n", "*you* understand…\n", "\n", "##### Question\n", "\n", "Selection using multiple logical tests:\n", "\n", "``` python\n", "to_drop = [x for x in tidy.columns.get_level_values(1) if (???)]\n", "print(to_drop)\n", "```\n", "\n", "Selection using a regular expression:\n", "\n", "``` python\n", "print([x for x in tidy.columns.get_level_values(1) if re.search(???, x)])\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "to_drop = [x for x in tidy.columns.get_level_values(1) if (\n", " x.endswith(\"(%)\") or x.startswith(\"%\") or x.endswith(\"Percentages\") or x.endswith(\"%\"))]\n", "print(to_drop)\n", "```\n", "\n", " ['Percentages', 'Percentages', 'Percentages', 'Percentages', 'Percentages', 'White (%)', 'Mixed/multiple ethnic groups (%)', 'Asian/Asian British (%)', 'Black/African/Caribbean/Black British (%)', 'Other ethnic group (%)', 'BAME (%)', 'United Kingdom (%)', 'Not United Kingdom (%)', '% of people aged 16 and over in household have English as a main language', '% of households where no people in household have English as a main language', 'Owned: Owned outright (%)', 'Owned: Owned with a mortgage or loan (%)', 'Social rented (%)', 'Private rented (%)', 'Household spaces with at least one usual resident (%)', 'Household spaces with no usual residents (%)', 'Whole house or bungalow: Detached (%)', 'Whole house or bungalow: Semi-detached (%)', 'Whole house or bungalow: Terraced (including end-terrace) (%)', 'Flat, maisonette or apartment (%)', 'Economically active %', 'Economically inactive %', '% of households with no adults in employment: With dependent children', '% living in income deprived households reliant on means tested benefit', '% of people aged over 60 who live in pension credit households', 'No cars or vans in household (%)', '1 car or van in household (%)', '2 cars or vans in household (%)', '3 cars or vans in household (%)', '4 or more cars or vans in household (%)']\n", "\n", "``` python\n", "print([x for x in tidy.columns.get_level_values(1) if re.search(\"(?:%|Percentages)\", x)])\n", "```\n", "\n", " ['Percentages', 'Percentages', 'Percentages', 'Percentages', 'Percentages', 'White (%)', 'Mixed/multiple ethnic groups (%)', 'Asian/Asian British (%)', 'Black/African/Caribbean/Black British (%)', 'Other ethnic group (%)', 'BAME (%)', 'United Kingdom (%)', 'Not United Kingdom (%)', '% of people aged 16 and over in household have English as a main language', '% of households where no people in household have English as a main language', 'Owned: Owned outright (%)', 'Owned: Owned with a mortgage or loan (%)', 'Social rented (%)', 'Private rented (%)', 'Household spaces with at least one usual resident (%)', 'Household spaces with no usual residents (%)', 'Whole house or bungalow: Detached (%)', 'Whole house or bungalow: Semi-detached (%)', 'Whole house or bungalow: Terraced (including end-terrace) (%)', 'Flat, maisonette or apartment (%)', 'Economically active %', 'Economically inactive %', '% of households with no adults in employment: With dependent children', '% living in income deprived households reliant on means tested benefit', '% of people aged over 60 who live in pension credit households', 'No cars or vans in household (%)', '1 car or van in household (%)', '2 cars or vans in household (%)', '3 cars or vans in household (%)', '4 or more cars or vans in household (%)']\n", "\n", "With both you should get:\n", "\n", " ['Percentages',\n", " 'Percentages',\n", " 'Percentages',\n", " 'Percentages',\n", " 'Percentages',\n", " 'White (%)',\n", " 'Mixed/multiple ethnic groups (%)',\n", " 'Asian/Asian British (%)',\n", " 'Black/African/Caribbean/Black British (%)',\n", " 'Other ethnic group (%)',\n", " 'BAME (%)',\n", " 'United Kingdom (%)',\n", " 'Not United Kingdom (%)',\n", " '% of people aged 16 and over in household have English as a main language',\n", " '% of households where no people in household have English as a main language',\n", " 'Owned: Owned outright (%)',\n", " 'Owned: Owned with a mortgage or loan (%)',\n", " 'Social rented (%)',\n", " 'Private rented (%)',\n", " 'Household spaces with at least one usual resident (%)',\n", " 'Household spaces with no usual residents (%)',\n", " 'Whole house or bungalow: Detached (%)',\n", " 'Whole house or bungalow: Semi-detached (%)',\n", " 'Whole house or bungalow: Terraced (including end-terrace) (%)',\n", " 'Flat, maisonette or apartment (%)',\n", " 'Economically active %',\n", " 'Economically inactive %',\n", " '% of households with no adults in employment: With dependent children',\n", " '% living in income deprived households reliant on means tested benefit',\n", " '% of people aged over 60 who live in pension credit households',\n", " 'No cars or vans in household (%)',\n", " '1 car or van in household (%)',\n", " '2 cars or vans in household (%)',\n", " '3 cars or vans in household (%)',\n", " '4 or more cars or vans in household (%)']\n", "\n", "> **🔗 Connections**\n", ">\n", "> See how regular expressions keep coming\n", "> [baaaaaaaaack](https://jreades.github.io/fsds/sessions/week7.html#lectures)?\n", "> That said, you can also often make use of simple string functions like\n", "> `startswith` and `endswith` for this problem.\n", "\n", "#### 2.3.3 Drop by Level\n", "\n", "You now need to drop these columns using the `level` keyword as part of\n", "your drop command. You have plenty of examples of how to drop values in\n", "place, but I’d suggest *first* getting the command correct (maybe\n", "duplicate the cell below and change the code so that the result is saved\n", "to a dataframe called `test` before overwriting `tidy`?) and *then*\n", "saving the change.\n", "\n", "##### Question\n", "\n", "``` python\n", "tidy = tidy.drop(to_drop, axis=1, level=???)\n", "print(f\"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}\")\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "tidy.drop(to_drop, axis=1, level=1, inplace=True)\n", "print(f\"Shape of the MSOA Atlas data frame is now: {tidy.shape[0]} x {tidy.shape[1]}\")\n", "```\n", "\n", " Shape of the MSOA Atlas data frame is now: 984 x 76\n", "\n", "The data frame should now be 984 x 76. This is a *bit* more\n", "manageable though still a *lot* of data columns. Depending on what you\n", "decide to do for your final project you might want to revisit some of\n", "the columns that we dropped above…\n", "\n", "#### 2.3.4 Flattening the Index\n", "\n", "Although this ia big improvement, you’ll have trouble saving or linking\n", "this data to other inputs. The problem is that Level 2 of the\n", "multi-index is mainly composed of ‘Unnamed’ values and so we need to\n", "merge it with Level 1 to simplify our data frame, and then merge *that*\n", "with level 0…" ], "id": "44f4e876-6680-401b-96a1-c295293425f2" }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "array([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'MSOA Code'),\n", " ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'MSOA Name'),\n", " ('Age Structure (2011 Census)', 'All Ages', 'Unnamed: 2_level_2')],\n", " dtype=object)" ] } } ], "source": [ "tidy.columns.values[:3]" ], "id": "f4a35ae9" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s use code to sort this out!" ], "id": "2de40188-69d5-43db-bcfa-41ec013bcbf2" }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "new_cols = []\n", "for c in tidy.columns.values:\n", " \n", " #print(f\"Column label: {c}\")\n", " l1 = f\"{c[0]}\"\n", " l2 = f\"{c[1]}\"\n", " l3 = f\"{c[2]}\"\n", " \n", " # The new column label\n", " clabel = ''\n", " \n", " # Assemble new label from the levels\n", " if not l1.startswith(\"Unnamed\"):\n", " l1 = l1.replace(\" (2011 Census)\",'').replace(\" (2011)\",'').replace(\"Household \",'').replace(\"House Prices\",'').replace(\"Car or van availability\",'Vehicles').replace(' (2011/12)','')\n", " l1 = l1.replace('Age Structure','Age').replace(\"Ethnic Group\",'').replace('Dwelling type','').replace('Income Estimates','')\n", " clabel += l1\n", " if not l2.startswith(\"Unnamed\"):\n", " l2 = l2.replace(\"Numbers\",'').replace(\" House Price (£)\",'').replace(\"Highest level of qualification: \",'').replace(\"Annual Household Income (£)\",'hh Income').replace('Whole house or bungalow: ','').replace(' qualifications','')\n", " l2 = l2.replace('At least one person aged 16 and over in household has English as a main language',\"1+ English as a main language\").replace(\"No people in household have English as a main language\",\"None have English as main language\")\n", " clabel += ('-' if clabel != '' else '') + l2\n", " if not l3.startswith(\"Unnamed\"):\n", " clabel += ('-' if clabel != '' else '') + l3\n", " \n", " # Replace other commonly-occuring verbiage that inflates column name width\n", " clabel = clabel.replace('--','-').replace(\" household\",' hh').replace('Owned: ','')\n", " \n", " #clabel = clabel.replace(' (2011 Census)','').replace(' (2011)','').replace('Sales - 2011.1','Sales - 2012')\n", " #clabel = clabel.replace('Numbers - ','').replace(' (£)','').replace('Car or van availability','Vehicles')\n", " #clabel = clabel.replace('Household Income Estimates (2011/12) - ','').replace('Age Structure','Age')\n", " \n", " new_cols.append(clabel)" ], "id": "718d525d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` python\n", "new_cols\n", "```\n", "\n", " ['MSOA Code',\n", " 'MSOA Name',\n", " 'Age-All Ages',\n", " 'Age-0-15',\n", " 'Age-16-29',\n", " 'Age-30-44',\n", " 'Age-45-64',\n", " 'Age-65+',\n", " 'Age-Working-age',\n", " 'Households-All Households',\n", " 'Composition-Couple hh with dependent children',\n", " 'Composition-Couple hh without dependent children',\n", " 'Composition-Lone parent hh',\n", " 'Composition-One person hh',\n", " 'Composition-Other hh Types',\n", " 'White',\n", " 'Mixed/multiple ethnic groups',\n", " 'Asian/Asian British',\n", " 'Black/African/Caribbean/Black British',\n", " 'Other ethnic group',\n", " 'BAME',\n", " 'Country of Birth-United Kingdom',\n", " 'Country of Birth-Not United Kingdom',\n", " 'Language-1+ English as a main language',\n", " 'Language-None have English as main language',\n", " 'Tenure-Owned outright',\n", " 'Tenure-Owned with a mortgage or loan',\n", " 'Tenure-Social rented',\n", " 'Tenure-Private rented',\n", " 'Household spaces with at least one usual resident',\n", " 'Household spaces with no usual residents',\n", " 'Detached',\n", " 'Semi-detached',\n", " 'Terraced (including end-terrace)',\n", " 'Flat, maisonette or apartment',\n", " 'Population Density-Persons per hectare (2012)',\n", " 'Median-2005',\n", " 'Median-2006',\n", " 'Median-2007',\n", " 'Median-2008',\n", " 'Median-2009',\n", " 'Median-2010',\n", " 'Median-2011',\n", " 'Median-2012',\n", " 'Median-2013 (p)',\n", " 'Sales-2005',\n", " 'Sales-2006',\n", " 'Sales-2007',\n", " 'Sales-2008',\n", " 'Sales-2009',\n", " 'Sales-2010',\n", " 'Sales-2011',\n", " 'Sales-2011.1',\n", " 'Sales-2013(p)',\n", " 'Qualifications-No',\n", " 'Qualifications-Level 1',\n", " 'Qualifications-Level 2',\n", " 'Qualifications-Apprenticeship',\n", " 'Qualifications-Level 3',\n", " 'Qualifications-Level 4 and above',\n", " 'Qualifications-Other',\n", " 'Qualifications-Schoolchildren and full-time students: Age 18 and over',\n", " 'Economic Activity-Economically active: Total',\n", " 'Economic Activity-Economically active: Unemployed',\n", " 'Economic Activity-Economically inactive: Total',\n", " 'Economic Activity-Unemployment Rate',\n", " 'Adults in Employment-No adults in employment in hh: With dependent children',\n", " 'Total Mean hh Income',\n", " 'Total Median hh Income',\n", " 'Vehicles-No cars or vans in hh',\n", " 'Vehicles-1 car or van in hh',\n", " 'Vehicles-2 cars or vans in hh',\n", " 'Vehicles-3 cars or vans in hh',\n", " 'Vehicles-4 or more cars or vans in hh',\n", " 'Vehicles-Sum of all cars or vans in the area',\n", " 'Vehicles-Cars per hh']\n", "\n", "> **Stop**\n", ">\n", "> Make sure you understand what is happening here before just moving on\n", "> to the next thing. Try adding `print()` statements if it will help it\n", "> to make sense. This sort of code comes up a *lot* in the real world." ], "id": "b94e53e6-6efd-465b-aecb-e63b87a061d9" }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "

5 rows × 76 columns

\n", "" ] } } ], "source": [ "tidy.columns = new_cols # <- Blow away complex index, replace with simple\n", "tidy.head()" ], "id": "a1b9abf3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might want to have a look at *what* the code below drops first\n", "before just running it… remember that you can pull apart any complex\n", "code into pieces:" ], "id": "285336d7-fd6e-4583-a4ce-a793d806aaa7" }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "Index([983], dtype='int64')" ] } } ], "source": [ "tidy['MSOA Code'].isna()\n", "tidy[tidy['MSOA Code'].isna()].index" ], "id": "20cfd6dc" }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "tidy.drop(index=tidy[tidy['MSOA Code'].isna()].index, inplace=True)" ], "id": "c4cb41be" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Add Inner/Outer London Mapping\n", "\n", "> **Difficulty: Moderate, since I’m not giving you many clues.**\n", "\n", "> **🔗 Connections**\n", ">\n", "> We touched on `lambda` functions last week; it’s a ‘trivial’ function\n", "> that we don’t even want to bother defining with `def`. We also used\n", "> the lambda function in the context of `apply` so this is just another\n", "> chance to remind yourself how this works. This is quite advanced\n", "> Python, so don’t panic if you don’t get it right away and have to do\n", "> some Googling…\n", "\n", "We want to add the borough name and a ‘subregion’ name. We already have\n", "the borough name buried in a *separate* column, so step 1 is to extract\n", "that from the MSOA Name. Step 2 is to use the borough name as a lookup\n", "to the subregion name using a **lambda** function. The format for a\n", "lambda function is usually\n", "`lambda x: `. Hint:\n", "you’ve got a dictionary and you know how to use it!\n", "\n", "#### 2.4.1 Add Boroughs\n", "\n", "We first need to extract the borough names from one of the existing\n", "fields in the data frame… a *regex* that does *replacement* would be\n", "fastest and easiest: focus on what you *don’t* need from the MSOA Name\n", "**string** and **replacing** that using a **regex**…\n", "\n", "##### Question\n", "\n", "``` python\n", "tidy['Borough'] = tidy['MSOA Name'].???\n", "tidy.Borough.unique()\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "tidy['Borough'] = tidy['MSOA Name'].str.replace(r' \\d+$','',regex=True)\n", "tidy.Borough.unique()\n", "```\n", "\n", " array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',\n", " 'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',\n", " 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',\n", " 'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',\n", " 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',\n", " 'Lewisham', 'Merton', 'Newham', 'Redbridge',\n", " 'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',\n", " 'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)\n", "\n", "You should get:" ], "id": "f8d41238-0d65-4f71-a2ef-bcf9233f4ca1" }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "text/plain": [ "array(['City of London', 'Barking and Dagenham', 'Barnet', 'Bexley',\n", " 'Brent', 'Bromley', 'Camden', 'Croydon', 'Ealing', 'Enfield',\n", " 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Haringey',\n", " 'Harrow', 'Havering', 'Hillingdon', 'Hounslow', 'Islington',\n", " 'Kensington and Chelsea', 'Kingston upon Thames', 'Lambeth',\n", " 'Lewisham', 'Merton', 'Newham', 'Redbridge',\n", " 'Richmond upon Thames', 'Southwark', 'Sutton', 'Tower Hamlets',\n", " 'Waltham Forest', 'Wandsworth', 'Westminster'], dtype=object)" ] } } ], "source": [], "id": "163fe03c" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.4.2 Map Boroughs to Subregions\n", "\n", "And now you need to understand how to *apply* the mapping ot the Borough\n", "field using a `lambda` function. It’s fairly straightforward once you\n", "know the syntax: just a dictionary lookup. But as usual, you might want\n", "to first create a new cell and experiment with the output from the apply\n", "function before using it to write the `Subregion` field of the data\n", "frame…" ], "id": "c3bf191d-7e5c-4de0-94cd-7693ef8b595d" }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "mapping = {}\n", "for b in ['Enfield','Waltham Forest','Redbridge','Barking and Dagenham','Havering','Greenwich','Bexley']:\n", " mapping[b]='Outer East and North East'\n", "for b in ['Haringey','Islington','Hackney','Tower Hamlets','Newham','Lambeth','Southwark','Lewisham']:\n", " mapping[b]='Inner East'\n", "for b in ['Bromley','Croydon','Sutton','Merton','Kingston upon Thames']:\n", " mapping[b]='Outer South'\n", "for b in ['Wandsworth','Kensington and Chelsea','Hammersmith and Fulham','Westminster','Camden','City of London']:\n", " mapping[b]='Inner West'\n", "for b in ['Richmond upon Thames','Hounslow','Ealing','Hillingdon','Brent','Harrow','Barnet']:\n", " mapping[b]='Outer West and North West'" ], "id": "b8fce8ad" }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Question\n", "\n", "``` python\n", "tidy['Subregion'] = tidy.Borough.apply(???)\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "tidy['Subregion'] = tidy.Borough.apply(lambda x: mapping[x])\n", "```\n", "\n", "#### 2.4.3 And Save\n", "\n", "There’s a little snipped of useful code to work out here: we need to\n", "check if the `clean` directory exists in the `data` directory; if we\n", "don’t then the `tidy.to_parquet()` call will fail." ], "id": "c2991bb0-f85b-4af4-9558-9ac5a512cc71" }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Done." ] } ], "source": [ "if not os.path.exists(os.path.join('data','clean')):\n", " os.makedirs(os.path.join('data','clean'))\n", "tidy.to_parquet(os.path.join('data','clean','MSOA_Atlas.parquet'))\n", "print(\"Done.\")" ], "id": "4b09b0d8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.4.4 Questions\n", "\n", "- What are the advantages to `apply` and `lambda` functions over\n", " looping and named functions?\n", "- When might you choose a named function over a lambda function?\n", "\n", "### 2.5 Merge Data & Geography\n", "\n", "> **Difficulty: Low, except for plotting.**\n", "\n", "> **🔗 Connections**\n", ">\n", "> We’ll cover joins (of which a `merge` is just one type) in the [final\n", "> week’s\n", "> lectures](https://jreades.github.io/fsds/sessions/week10.html#lectures),\n", "> but between what you’d done in GIS and what we have here there should\n", "> be enough here for you to start being able to make sense of how they\n", "> work so that you don’t have to wait until Week 10 to think about how\n", "> this could help you with your Group Assessment.\n", "\n", "First, we need to download the MSOA source file, which is a zipped\n", "archive of a Shapefile:" ], "id": "46c68da7-8593-4444-ab22-d4f4976b4daa" }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Found data/geo/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip locally!\n", " Size is 7 MB (7,381,177 bytes)" ] } ], "source": [ "# Oh look, we can read a Shapefile without needing to unzip it!\n", "msoas = gpd.read_file(\n", " cache_data('https://github.com/jreades/fsds/blob/master/data/src/Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip?raw=true', \n", " os.path.join('data','geo')), driver='ESRI Shapefile')" ], "id": "5df78ed6" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5.1 Identifying Matching Columns\n", "\n", "Looking at the first few columns of each data frame, which one might\n", "allow us to link the two files together? You’ve done this in GIS.\n", "*Remember*: the column *names* don’t need to match for us to use them in\n", "a join, it’s the *values* that matter." ], "id": "df67f7f8-063d-47a7-95f8-b6602cf2bc66" }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Column names: MSOA Code, MSOA Name, Age-All Ages, Age-0-15, Age-16-29" ] }, { "output_type": "display_data", "metadata": {}, "data": { "text/html": [ "\n", "" ] } } ], "source": [ "print(f\"Column names: {', '.join(tidy.columns.tolist()[:5])}\")\n", "tidy.iloc[:3,:5]" ], "id": "9d67d9ae" }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5.2 Merge\n", "\n", "One more thing: if you’ve got more than one choice I’d *always* go with\n", "a code over a name because one is intended for matching and other is\n", "not…\n", "\n", "##### Question\n", "\n", "``` python\n", "gdf = pd.merge(msoas, tidy, left_on=???, right_on=???, how='inner')\n", "gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])\n", "\n", "print(f\"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}\")\n", "```\n", "\n", "##### Answer\n", "\n", "``` python\n", "gdf = pd.merge(msoas, tidy, left_on='MSOA11CD', right_on='MSOA Code', how='inner')\n", "gdf = gdf.drop(columns=['MSOA11CD','MSOA11NM','OBJECTID'])\n", "\n", "print(f\"Final MSOA Atlas data frame has shape {gdf.shape[0]:,} x {gdf.shape[1]}\")\n", "```\n", "\n", " Final MSOA Atlas data frame has shape 983 x 86\n", "\n", "You should get Final data frame has shape 983 x 86.\n", "\n", "#### 2.5.3 Plot Choropleth\n", "\n", "Let’s plot the median income in 2011 column using the `plasma` colour\n", "ramp… The rest is to show you how to customise a legend.\n", "\n", "``` python\n", "col = 'Median-2011'\n", "fig = gdf.plot(column=???, cmap='???', \n", " scheme='FisherJenks', k=7, edgecolor='None', \n", " legend=True, legend_kwds={'frameon':False, 'fontsize':8},\n", " figsize=(8,7));\n", "plt.title(col.replace('-',' '));\n", "\n", "# Now to modify the legend: googling \"geopandas format legend\"\n", "# brings me to: https://stackoverflow.com/a/56591102/4041902\n", "leg = fig.get_legend()\n", "leg._loc = 3\n", "\n", "for lbl in leg.get_texts():\n", " label_text = lbl.get_text()\n", " [low, hi] = label_text.split(', ')\n", " new_text = f'£{float(low):,.0f} - £{float(hi):,.0f}'\n", " lbl.set_text(new_text)\n", "\n", "plt.show();\n", "```\n", "\n", "#### 2.5.4 Save\n", "\n", "``` python\n", "gdf.to_geoparquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))\n", "```\n", "\n", "#### 2.5.5 Questions\n", "\n", "- Try changing the colour scheme, classification scheme, and number of\n", " classes to see if you feel there’s a *better* opeion than the one\n", " shown above… Copy the cell (click on anywhere outside the code and\n", " then hit `C` to copy. Then click on this cell *once*, and hit `V` to\n", " paste.\n", "\n", "## 3. Splitting into Test & Train\n", "\n", "> **Note**\n", ">\n", "> **🔗 Connections**: Here you will be using a standard approach in Machine Learningknown as _test/train split_, although we _won't_ be doing any actual ML (😅). Here we are just going to use it to explore some issues raised by normalisation and standardisation in [this week's lectures](https://jreades.github.io/fsds/sessions/week8.html#lectures).\n", "\n", "A standard approach to Machine Learning, and something that is becoming\n", "more widely used elsewhere, is the splitting of a large data into set\n", "into testing and training components. Typically, you would take 80-90%\n", "of your data to ‘train’ your algorithm and withold between 10-20% for\n", "validation (‘testing’). An even ‘stricter’ approach, in the sense of\n", "trying to ensure the robustness of your model against outlier effects,\n", "is [cross\n", "validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)\n", "such as [k-folds\n", "cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).\n", "\n", "Sci-Kit Learn is probably *the* most important reason Python has become\n", "the *de fact* language of data science. Test/train-split is used when to\n", "avoid over-fitting when we are trying to **predict** something; so here\n", "Sci-Kit Learn *expects* that you’ll have an `X` which is your\n", "**predictors** (the inputs to your model) and a `y` which is the thing\n", "you’re **trying to predict** (because: $y = \\beta X + \\epsilon$).\n", "\n", "We’re not building a model here (that’s for Term 2!) so we’ll just\n", "‘pretend’ that we’re trying to predict the price of a listing and will\n", "set that up as our `y` data set *so that* we can see how the choice of\n", "normalisation/standardisation technique affects the robustness of the\n", "model against ‘new’ data. Notice too that you can pass a data frame\n", "directly to Sci-Kit Learn and it will split it for you.\n", "\n", "### 3.1 Reload\n", "\n", "> **Tip**\n", ">\n", "> In future 'runs' of this notebook you can now just pick up here and skip all of Task 1.\n", "\n", "On subsequent runs of this notebook you might just want to start here!\n", "\n", "``` python\n", "# Notice this handy code: we check if the data is already\n", "# in memory. And notice this is just a list comprehension\n", "# to see what is locally loaded.\n", "if 'gdf' not in locals():\n", " gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet'))\n", "print(gdf.shape)\n", "```\n", "\n", "``` python\n", "categoricals = ['Borough','Subregion']\n", "for c in categoricals:\n", " gdf[c] = gdf[c].astype('category')\n", "```\n", "\n", "### 3.2 Split\n", "\n", "> **Difficulty: Low!**\n", "\n", "For our purposes this is a little bit overkill as you could also use\n", "pandas’ `sample(frac=0.2)` and the indexes, but it’s useful to see how\n", "this works. You use test/train split to get **four** data sets out: the\n", "training data gives you two (predictors + target as separate data sets)\n", "and the testing data gives you two as well (predictors + target as\n", "separate data sets). These are sized accoridng to the `test_size`\n", "specfied in the `test_train_split` parameters.\n", "\n", "``` python\n", "from sklearn.model_selection import train_test_split \n", "\n", "pdf = gdf['Median-2011'].copy() # pdf for Median *P*rice b/c we need *something*\n", "\n", "# df == *data* frame (a.k.a. predictors or independent variables)\n", "# pr == *predicted* value (a.k.a. dependent variable)\n", "# Notice we don't want the median price included in our training data\n", "df_train, df_test, pr_train, pr_test = train_test_split(\n", " gdf.drop(columns=['Median-2011']), pdf, test_size=0.2, random_state=44)\n", "```\n", "\n", "Below you should see that the data has been split roughly on the basis\n", "of the `test_size` parameter.\n", "\n", "``` python\n", "print(f\"Original data size: {gdf.shape[0]:,} x {gdf.shape[1]}\")\n", "print(f\" Training data size: {df_train.shape[0]:,} x {df_train.shape[1]} ({(df_train.shape[0]/gdf.shape[0])*100:.0f}%)\")\n", "print(f\" Testing data size: {df_test.shape[0]:,} x {df_test.shape[1]} ({(df_test.shape[0]/gdf.shape[0])*100:.0f}%)\")\n", "```\n", "\n", "Also notice the indexes of each pair of data sets match:\n", "\n", "``` python\n", "print(\", \".join([str(x) for x in df_train.index[:10]]))\n", "print(\", \".join([str(x) for x in pr_train.index[:10]]))\n", "```\n", "\n", "### 3.3 Plot Test/Train Data\n", "\n", "> **Difficulty: Low, but important!**\n", "\n", "``` python\n", "boros = gpd.read_file(os.path.join('data','geo','Boroughs.gpkg'))\n", "```\n", "\n", "``` python\n", "f,axes = plt.subplots(1,2, figsize=(12,5))\n", "df_train.plot(ax=???)\n", "df_test.plot(ax=???)\n", "boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n", "boros.plot(ax=???, facecolor='none', edgecolor='r', linewidth=.5, alpha=0.4)\n", "axes[0].set_title('Training Data')\n", "axes[1].set_title('Testing Data');\n", "axes[0].set_ylim(150000,210000)\n", "axes[1].set_ylim(150000,210000)\n", "axes[0].set_xlim(500000,565000)\n", "axes[1].set_xlim(500000,565000)\n", "axes[1].set_yticks([]);\n", "```\n", "\n", "#### 3.3.1 Questions\n", "\n", "- Why might it be useful to produce a *map* of a test/train split?\n", "- Why might it matter *more* if you were dealing with user locations\n", " or behaviours?\n", "\n", "## 4. Normalisation\n", "\n", "The developers of [SciKit-Learn](https://scikit-learn.org/) define\n", "[normalisation](https://scikit-learn.org/stable/modules/preprocessing.html#normalization)\n", "as “scaling individual samples to have **unit norm**.” There are a *lot*\n", "of subtleties to this when you start dealing with ‘sparse’ data, but for\n", "the most part it’s worthwhile to think of this as a rescaling of the raw\n", "data to have similar ranges in order achieve some kind of comparison.\n", "This is such a common problem that sklearn offers a range of such\n", "(re)scalers including: `MinMaxScaler`.\n", "\n", "Let’s see what effect this has on the data!\n", "\n", "``` python\n", "# Sets some handy 'keywords' to tweak the Seaborn plot\n", "kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n", "\n", "# Set the *hue order* so that all plots have the *same*\n", "# colour on the Subregion\n", "ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']\n", "```\n", "\n", "### 4.1 Select Columns\n", "\n", "> **Difficulty: Low.**\n", "\n", "One thing you’ll need to explain is why I keep writing\n", "`df[cols+['Subregion']` and why I don’t just add it to the `cols`\n", "variable at the start? Don’t try to answer this now, get through the\n", "rest of Tasks 3 and 4 and see what you think.\n", "\n", "``` python\n", "cols = ['Tenure-Owned outright', 'Tenure-Owned with a mortgage or loan',\n", " 'Tenure-Social rented', 'Tenure-Private rented']\n", "```\n", "\n", "*Answer*: one part of the answer is that it makes it easy to change the\n", "columns we select without having to remember to keep `Subregion`, but\n", "the more important reason is that it allows us to re-use *this*\n", "‘definition’ of `cols` elsewhere throughout the rest of this practical\n", "without needing to remember to *remove* `Subregion`.\n", "\n", "``` python\n", "tr_raw = df_train[cols+['Subregion']].copy() # train raw\n", "tst_raw = df_test[cols+['Subregion']].copy() # test raw\n", "```\n", "\n", "### 4.2 Fit to Data\n", "\n", "> **Difficulty: Moderate if you want to understand what `reshape` is\n", "> doing.**\n", "\n", "Fit the training data:\n", "\n", "``` python\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# Notice what this is doing! See if you can explain it clearly.\n", "scalers = [???.fit(???[x].values.reshape(-1,1)) for x in cols]\n", "```\n", "\n", "### 4.3 Apply Transformations\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "Train:\n", "\n", "``` python\n", "tr_normed = tr_raw.copy()\n", "\n", "for i, sc in enumerate(scalers):\n", " # Ditto this -- can you explain what this code is doing\n", " tr_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))\n", "```\n", "\n", "Test:\n", "\n", "``` python\n", "tst_normed = tst_raw.copy()\n", "\n", "for i, sc in enumerate(scalers):\n", " tst_normed[cols[i]] = sc.transform(df_???[cols[i]].values.reshape(-1,1))\n", "```\n", "\n", "> **Note**\n", ">\n", "> **🔗 Connections**: You don't _have_ to fully understand the next section, but if you are able to get your head around it then this will seriously help you to prepare for the use of more advanced techniques in modelling and programming.\n", "\n", "Check out the properties of `tst_normed` below. If you’ve understood\n", "what the `MinMaxScaler` is doing then you should be able to spot\n", "something unexpected in the transformed *test* outputs. If you’ve\n", "*really* understood this, you’ll see why this result is problematic for\n", "*models*. *Hint*: one way to think of it is an issue of\n", "**extrapolation**.\n", "\n", "``` python\n", "for c in cols:\n", " print(f\" Minimum: {tst_normed[c].min():.4f}\")\n", " ???\n", "```\n", "\n", "### 4.4 Plot Distributions\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "``` python\n", "tr_raw.columns = [re.sub('(-|/)',\"\\n\",x) for x in tr_raw.columns.values]\n", "tst_raw.columns = [re.sub('(-|/)',\"\\n\",x) for x in tst_raw.columns.values]\n", "tr_normed.columns = [re.sub('(-|/)',\"\\n\",x) for x in tr_normed.columns.values]\n", "tst_normed.columns = [re.sub('(-|/)',\"\\n\",x) for x in tst_normed.columns.values]\n", "```\n", "\n", "``` python\n", "sns.pairplot(data=tr_raw, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);\n", "```\n", "\n", "``` python\n", "sns.pairplot(data=tr_normed, hue='Subregion', diag_kind='kde', corner=True, plot_kws=kwds, hue_order=ho);\n", "```\n", "\n", "#### 4.4.1 Questions\n", "\n", "- Why do I keep writing `df[cols+['Subregion']`? Why I don’t just add\n", " Subregions to the `cols` variable at the start?\n", "- What has changed between the two plots (of `tr_raw` and\n", " `tr_normed`)?\n", "- What is the *potential* problem that the use of the transformer\n", " fitted on `tr_normed` to data from `tst_normed` might cause? *Hint*:\n", " this is why I asked you to investigate the data in the empty code\n", " cell above.\n", "- Can you explain what this is doing:\n", " `[MinMaxScaler().fit(df_train[x].values.reshape(-1,1)) for x in cols]`?\n", "- Can you explain what *this* is doing:\n", " `sc.transform(df_test[cols[i]].values.reshape(-1,1))`?\n", "\n", "## 5. Standardisation\n", "\n", "Standardisation is typically focussed on rescaling data to have a mean\n", "(or median) of 0 and standard deviation or IQR of 1. That these\n", "approaches are conceptually tied to the idea of symmetric, unimodal data\n", "such as that encountered in the standard normal distribution. Rather\n", "confusingly, many data scientists will use standardisation and\n", "normalisation largely interchangeably!\n", "\n", "``` python\n", "col = 'Vehicles-No cars or vans in hh'\n", "tr = df_train[[col]].copy()\n", "tst = df_test[[col]].copy()\n", "```\n", "\n", "### 5.1 Z-Score Standardisation\n", "\n", "> **Difficulty: Low.**\n", "\n", "``` python\n", "stsc = StandardScaler().fit(tr[col].values.reshape(-1,1))\n", "\n", "tr[f\"Z. {col}\"] = stsc.transform(???)\n", "tst[f\"Z. {col}\"] = stsc.transform(???)\n", "```\n", "\n", "### 5.2 Inter-Quartile Standardisation\n", "\n", "> **Difficulty: Low.**\n", "\n", "``` python\n", "rs = ???(quantile_range=(25.0, 75.0)).fit(???)\n", "\n", "tr[f\"IQR. {col}\"] = rs.transform(???)\n", "tst[f\"IQR. {col}\"] = rs.transform(???)\n", "```\n", "\n", "### 5.3 Plot Distributions\n", "\n", "> **Note**\n", ">\n", "> **🔗 Connections**: The point of these next plots is simply to show that *linear* transformations (which are 'reversible') is about changing things like the magnitude/scale of our data but doesn't fundamentally change relationships *between* observations.\n", "\n", "> **Difficulty: Low.**\n", "\n", "``` python\n", "sns.jointplot(data=tr, x=f\"{col}\", y=f\"Z. {col}\", kind='kde'); # hex probably not the best choice\n", "```\n", "\n", "``` python\n", "sns.jointplot(data=tr, x=f\"{col}\", y=f\"IQR. {col}\", kind='kde'); # hex probably not the best choice\n", "```\n", "\n", "``` python\n", "sns.jointplot(data=tr, x=f\"Z. {col}\", y=f\"IQR. {col}\", kind='hex'); # hex probably not the best choice\n", "```\n", "\n", "Perhaps a little more useful…\n", "\n", "``` python\n", "ax = sns.kdeplot(tr[f\"Z. {col}\"])\n", "sns.kdeplot(tr[f\"IQR. {col}\"], color='r', ax=ax)\n", "plt.legend(loc='upper right', labels=['Standard', 'Robust']) # title='Foo'\n", "ax.ticklabel_format(useOffset=False, style='plain')\n", "ax.set_xlabel(\"Standardised Value for No cars or vans in hh\")\n", "```\n", "\n", "#### 5.3.1 Questions?\n", "\n", "- Can you see the differences between these two rescalers?\n", "- Can you explain why you might want to choose one over the other?\n", "\n", "## 6. Non-Linear Transformations\n", "\n", "> **Note**\n", ">\n", "> **🔗 Connections**: Now *these* transformations are not directly revsersible because they change the actual relationships between observations in the data. We use these when we need to achieve particular 'behaviours' from our data in order to improve our model's 'predictive' ability. Even a simple regression is doing a kind of prediction ($y = \\beta X + \\epsilon$). You covered these concepts in [this week's lectures](https://jreades.github.io/fsds/sessions/week8.html#lectures).\n", "\n", "So transformations are useful when a data series has features that make\n", "comparisons or analysis difficult, or that affect our ability to intuit\n", "meaningful difference. By manipulating the data using one or more\n", "mathematical operations we can sometimes make it more *tractable* for\n", "subsequent analysis. In other words, it’s all about the *context* of our\n", "data.\n", "\n", "
\n", "\n", "
How tall is tall?
\n", "
\n", "\n", "From above, we know the *Median Income* data are *not* normally\n", "distributed, but can we work out what distribution best represents\n", "*Median Income*? This can be done by comparing the shape of the\n", "histogram to the shapes of theoretical distributitions. For example:\n", "\n", "- the\n", " [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution)\n", " distribution\n", "- the\n", " [exponential](https://en.wikipedia.org/wiki/Exponential_distribution)\n", " distribution\n", "- the [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution)\n", " distribution (for non-continuous data)\n", "\n", "From looking at those theoretical distributions, we might make an\n", "initial guess as to the type of distribution. There are actually *many*\n", "other distributions encountered in real life data, but these ones are\n", "particuarly common. A wider view of this would be that [quantile and\n", "power\n", "transformations](https://scikit-learn.org/stable/modules/preprocessing.html#non-linear-transformation)\n", "are ways of preserving the rank of values but lose many of the other\n", "features of the relationships that might be preserved by, for instance,\n", "the standard scaler.\n", "\n", "In the case of Median Income, taking a log-transform of the data might\n", "make it *appear* more normal: you do **not** say that a transformation\n", "*makes* data more normal, you say either that ‘it allows us to treat the\n", "data as normally distributed’ or that ‘the transformed data follows a\n", "log-normal distribution’.\n", "\n", "### 6.1 The Normal Distribution\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "Z-scores are often associated with the normal distribution because their\n", "interpretation implicitly assumes a normal distribution. Or to put it\n", "another way… You can always calculate z-scores for your data (it’s just\n", "a formula applied to data points), but their *intuitive meaning* is lost\n", "if your data don’t have something like a normal distribution (or follow\n", "the [68-95-99.7 rule](https://en.wikipedia.org/wiki/68–95–99.7_rule)).\n", "\n", "But… what if our data are non-normal? Well, Just because data are\n", "non-normal doesn’t mean z-scores can’t be calculated (we already did\n", "that above); we just have to be careful what we do with them… and\n", "sometimes we should just avoid them entirely.\n", "\n", "#### 6.1.1 Creating a Normal Distribution\n", "\n", "Below is a function to create that theoretical normal distribution. See\n", "if you can understand what’s going and add comments to the code to\n", "explain what each line does.\n", "\n", "``` python\n", "def normal_from_dist(series): \n", " mu = ??? # mean of our data\n", " sd = ??? # standard deviation of our data\n", " n = ??? # count how many observations are in our data\n", " s = np.random.normal(???, ???, ???) #use the parameters of the data just calculated to generate n random numbers, drawn from a normal distributions \n", " return s #return this set of random numbers\n", "```\n", "\n", "To make it easier to understand what the function above is doing, let’s\n", "use it! We’ll use the function to plot both a distribution plot with\n", "both histogram and KDE for our data, and then add a *second* overplot\n", "distplot to the same fig showing the theoretical normal distribution (in\n", "red). We’ll do this in a loop for each of the three variables we want to\n", "examine.\n", "\n", "#### 6.1.2 Visual Comparisons\n", "\n", "Looking at the output, which of the variables has a roughly normal\n", "distribution? Another way to think about this question is, for which of\n", "the variables are the mean and standard deviation *most* appropriate as\n", "measures of centrality and spread?\n", "\n", "*Also*, how would you determine the *meaning* of some of the departures\n", "from the normal distribution?\n", "\n", "``` python\n", "selection = [x for x in df_train.columns.values if x.startswith('Composition')]\n", "\n", "for c in selection:\n", " ax = sns.kdeplot(df_train[c])\n", " sns.kdeplot(normal_from_dist(df_train[c]), color='r', fill=True, ax=ax)\n", " plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title='Foo'\n", " ax.ticklabel_format(useOffset=False, style='plain')\n", " if ax.get_xlim()[1] > 999999:\n", " plt.xticks(rotation=45)\n", " plt.show()\n", "```\n", "\n", "#### 6.1.3 Questions\n", "\n", "- Which, if any, of the variables has a roughly normal distribution?\n", " Another way to think about this question is, for which of the\n", " variables are the mean and standard deviation *most* appropriate as\n", " measures of centrality and spread?\n", "- How might you determine the *significance* of some of the departures\n", " from the normal distribution?\n", "\n", "### 6.2 Logarithmic Transformations\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "To create a new series in the data frame containing the natural log of\n", "the original value it’s a similar process to what we’ve done before, but\n", "since pandas doesn’t provide a log-transform operator (i.e. you can’t\n", "call `df['MedianIncome'].log()` ) we need to use the `numpy` package\n", "since pandas data series are just numpy arrays with some fancy window\n", "dressing that makes them even *more* useful.\n", "\n", "Let’s perform the transform then compare to the un-transformed data.\n", "Comment the code below to ensure that you understand what it is doing.\n", "\n", "#### 6.2.1 Apply and Plot\n", "\n", "``` python\n", "cols = ['Median-2012','Total Mean hh Income']\n", "\n", "for m in cols:\n", " s = df_train[m] # s == series\n", " ts = ???.???(s) # ts == transformed series\n", " \n", " ax = sns.kdeplot(s)\n", " sns.kdeplot(normal_from_dist(s), color='r', fill=True, ax=ax)\n", " plt.legend(loc='upper right', labels=['Observed', 'Normal']) # title also an option\n", " plt.title(\"Original Data\")\n", " \n", " ### USEFUL FORMATTING TRICKS ###\n", " # This turns off scientific notation in the ticklabels\n", " ax.ticklabel_format(useOffset=False, style='plain')\n", " # Notice this snippet of code\n", " ax.set_xlabel(ax.get_xlabel() + \" (Raw Distribution)\")\n", " # Notice this little code snippet too\n", " if ax.get_xlim()[1] > 999999:\n", " plt.xticks(rotation=45)\n", " \n", " plt.show()\n", " \n", " ax = sns.kdeplot(ts)\n", " sns.kdeplot(normal_from_dist(ts), color='r', fill=True, ax=ax)\n", " plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n", " ax.ticklabel_format(useOffset=False, style='plain')\n", " ax.set_xlabel(ax.get_xlabel() + \" (Logged Distribution)\")\n", " if ax.get_xlim()[1] > 999999:\n", " plt.xticks(rotation=45)\n", " plt.title(\"Log-Transformed Data\")\n", " plt.show()\n", "```\n", "\n", "Hopefully, you can see that the transformed data do indeed look ‘more\n", "normal’; the peak of the red and blue lines are closer together and the\n", "blue line at the lower extreme is also closer to the red line, but we\n", "can check this by seeing what has happened to the z-scores.\n", "\n", "### 6.3 Power Transformations\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "``` python\n", "cols = ['Median-2012','Total Mean hh Income']\n", "pt = ???(method='yeo-johnson')\n", "\n", "for m in cols:\n", " s = df_train[m] # s == series\n", " ts = pt.fit_transform(s.values.reshape(-1,1))\n", " print(f\"Using lambda (transform 'exponent') of {pt.lambdas_[0]:0.5f}\")\n", " \n", " ax = sns.kdeplot(ts.reshape(-1,))\n", " \n", " sns.kdeplot(normal_from_dist(???), color='r', fill=True, ax=ax)\n", " plt.legend(loc='upper right', labels=['Observed', 'Normal'])\n", " ax.ticklabel_format(useOffset=False, style='plain')\n", " ax.set_xlabel(m + \" (Transformed Distribution)\")\n", " if ax.get_xlim()[1] > 999999: # <-- What does this do?\n", " plt.xticks(rotation=45)\n", " plt.title(\"Power-Transformed Data\")\n", " plt.show();\n", "```\n", "\n", "## 7. Principal Components Analysis\n", "\n", "> **Note**\n", ">\n", "> **🔗 Connections**: This is all about _dimensionality_ and the different ways that we can reduce dimensionality in our data in order to improve our models' robustness and gain a better understanding of whether (or not) there is structure in our data. We'll start with linear decomposition and then look at non-linear decomposition, which we also demonstrated with UMAP last week.\n", "\n", "Now we’re going to ask the question: how can we represent our data using\n", "a smaller number of components that capture the variance in the original\n", "data. You should have covered PCA in Quantitative Methods.\n", "\n", "#### 7.0.1 Optional Reload\n", "\n", "Use this is your data gets messy…\n", "\n", "``` python\n", "gdf = gpd.read_parquet(os.path.join('data','geo','MSOA_Atlas.geoparquet')).set_index('MSOA Code')\n", "print(gdf.shape)\n", "```\n", "\n", "``` python\n", "categoricals = ['Borough','Subregion']\n", "for c in categoricals:\n", " gdf[c] = gdf[c].astype('category')\n", "```\n", "\n", "### 7.1 Calculating Shares\n", "\n", "> **Difficulty: Hard.**\n", "\n", "Sadly, there’s no transformer to work this out for you automatically,\n", "but let’s start by converting the raw population and household figures\n", "to shares so that our later dimensionality reduction steps aren’t\n", "impacted by the size of the MSOA.\n", "\n", "``` python\n", "gdf[['Age-All Ages','Households-All Households']].head(5)\n", "```\n", "\n", "#### 7.1.1 Specify Totals Columns\n", "\n", "``` python\n", "total_pop = gdf['Age-All Ages']\n", "total_hh = gdf['Households-All Households']\n", "total_vec = gdf['Vehicles-Sum of all cars or vans in the area']\n", "```\n", "\n", "#### 7.1.2 Specify Columns for Pop or HH Normalisation\n", "\n", "``` python\n", "pop_cols = ['Age-', 'Composition-', 'Qualifications-', 'Economic Activity-', 'White', 'Mixed/multiple',\n", " 'Asian/Asian British', 'Black/African', 'BAME', 'Other ethnic',\n", " 'Country of Birth-']\n", "hh_cols = [???, ???, ???, 'Detached', 'Semi-detached', 'Terraced', 'Flat, ']\n", "```\n", "\n", "``` python\n", "popre = re.compile(r'^(?:' + \"|\".join(pop_cols) + r')')\n", "hhre = re.compile(r'^(?:' + \"|\".join(???) + r')')\n", "```\n", "\n", "#### 7.1.3 Apply to Columns\n", "\n", "``` python\n", "tr_gdf = gdf.copy()\n", "tr_gdf['Mean hh size'] = tr_gdf['Age-All Ages']/tr_gdf['Households-All Households']\n", "\n", "for c in gdf.columns:\n", " print(c)\n", " if popre.match(c):\n", " print(\" Normalising by total population.\")\n", " tr_gdf[c] = gdf[c]/???\n", " elif ???.match(???):\n", " print(\" Normalising by total households.\")\n", " tr_gdf[c] = gdf[c]/???\n", " elif c.startswith('Vehicles-') and not c.startswith('Vehicles-Cars per hh'):\n", " print(\" Normalising by total vehicles.\")\n", " tr_gdf[c] = gdf[c]/???\n", " else:\n", " print(\" Passing through.\")\n", "```\n", "\n", "### 7.2 Removing Columns\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "To perform dimensionality we can only have numeric data. In theory,\n", "categorical data can be converted to numeric and retained, but there are\n", "two issues:\n", "\n", "1. Nominal data has no *innate* order so we *can’t* convert \\> 2\n", " categories to numbers and have to convert them to One-Hot Encoded\n", " values.\n", "2. A binary (i.e. One-Hot Encoded) variable will account for a *lot* of\n", " variance in the data because it’s only two values they are 0 and 1!\n", "\n", "So in practice, it’s probably a good idea to drop categorical data if\n", "you’re planning to use PCA.\n", "\n", "#### 7.2.1 Drop Totals Columns\n", "\n", "``` python\n", "pcadf = tr_gdf.drop(columns=['Age-All Ages', 'Households-All Households',\n", " 'Vehicles-Sum of all cars or vans in the area'])\n", "pcadf = pcadf.set_index('MSOA Code')\n", "```\n", "\n", "#### 7.2.2 Drop Non-Numeric Columns\n", "\n", "``` python\n", "pcadf.select_dtypes(['category','object']).columns\n", "```\n", "\n", "``` python\n", "pcadf.drop(columns=pcadf.select_dtypes(['category','object']).columns.to_list(), inplace=True)\n", "pcadf.drop(columns=['BNG_E','BNG_N','geometry', 'LONG', 'LAT','Shape__Are', 'Shape__Len'], inplace=True)\n", "```\n", "\n", "``` python\n", "pcadf.columns\n", "```\n", "\n", "### 7.3 Rescale & Reduce\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "In order to ensure that our results aren’t dominated by a single scale\n", "(e.g. House Prices!) we need to rescale all of our data. You could\n", "easily try different scalers as well as a different parameters to see\n", "what effect this has on your results.\n", "\n", "#### 7.3.1 Robustly Rescale\n", "\n", "Set up the Robust Rescaler for inter-decile standardisation: 10th and\n", "90th quantiles.\n", "\n", "``` python\n", "rs = ???\n", "\n", "for c in pcadf.columns.values:\n", " pcadf[c] = rs.fit_transform(pcadf[c].values.reshape(-1, 1))\n", "```\n", "\n", "#### 7.3.2 PCA Reduce\n", "\n", "``` python\n", "from sklearn.decomposition import PCA \n", "\n", "pca = PCA(n_components=50, whiten=True) \n", "\n", "pca.fit(pcadf)\n", "\n", "explained_variance = pca.explained_variance_ratio_\n", "singular_values = pca.singular_values_\n", "```\n", "\n", "#### 7.3.3 Examine Explained Variance\n", "\n", "``` python\n", "x = np.arange(1,???)\n", "plt.plot(x, explained_variance)\n", "plt.ylabel('Share of Variance Explained')\n", "plt.show()\n", "```\n", "\n", "``` python\n", "for i in range(0, 20):\n", " print(f\"Component {i:>2} accounts for {explained_variance[i]*100:>2.2f}% of variance\")\n", "```\n", "\n", "You should get that Component 0 accounts for 31.35% of the variance and\n", "Component 19 accounts for 0.37%.\n", "\n", "###: How Many Components?\n", "\n", "There are a number of ways that we could set a threshold for\n", "dimensionality reduction: - The most common is to look for the ‘knee’ in\n", "the Explained Variance plot above. That would put us at about 5 retained\n", "components. - Another is to just keep all components contributing more\n", "than 1% of the variance. That would put us at about 10 components. - You\n", "can also ([I\n", "discovered](https://medium.com/@nikolay.oskolkov/hi-jon-reades-my-sincere-apologies-for-this-very-late-reply-444f57054d14))\n", "look to shuffle the data and repeatedly perform PCA to build confidence\n", "intervals. I have not implemented this (yet).\n", "\n", "In order to *do* anything with these components we need to somehow\n", "reattach them to the MSOAs. So that entails taking the transformed\n", "results (`X_train` and `X_test`)\n", "\n", "``` python\n", "kn = knee_locator.KneeLocator(x, explained_variance, \n", " curve='convex', direction='decreasing', \n", " interp_method='interp1d')\n", "print(f\"Knee detected at: {kn.knee}\")\n", "kn.plot_knee()\n", "```\n", "\n", "``` python\n", "keep_n_components = 7\n", "\n", "# If we weren't changing the number of components we\n", "# could re-use the pca object created above. \n", "pca = PCA(n_components=keep_n_components, whiten=True)\n", "\n", "X_train = pca.fit_transform(???)\n", "\n", "# Notice that we get the _same_ values out,\n", "# so this is a *deterministic* process that\n", "# is fully replicable (allowing for algorithmic\n", "# and programming language differences).\n", "print(f\"Total explained variance: {pca.explained_variance_ratio_.sum()*100:2.2f}%\")\n", "for i in range(0, keep_n_components):\n", " print(f\" Component {i:>2} accounts for {pca.explained_variance_ratio_[i]*100:>5.2f}% of variance\")\n", "\n", "# Notice...\n", "print(f\"X-train shape: {len(X_train)}\")\n", "print(f\"PCA df shape: {pcadf.shape[0]}\")\n", "# So each observation has a row in X_train and there is \n", "# 1 column for each component. This defines the mapping\n", "# of the original data space into the reduced one\n", "print(f\"Row 0 of X-train contains {len(X_train[0])} elements.\") \n", "```\n", "\n", "### 7.4 Components to Columns\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "You could actually do this more quickly (but less clearly) using\n", "`X_train.T` to transpose the matrix!\n", "\n", "``` python\n", "for i in range(0,keep_n_components):\n", " s = pd.Series(X_train[:,???], index=pcadf.???)\n", " pcadf[f\"Component {i+1}\"] = s\n", "```\n", "\n", "``` python\n", "pcadf.sample(3).iloc[:,-10:-4]\n", "```\n", "\n", "### 7.5 (Re)Attaching GeoData\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "``` python\n", "msoas = gpd.read_file(os.path.join('data','geo','Middle_Layer_Super_Output_Areas__December_2011__EW_BGC_V2-shp.zip'), driver='ESRI Shapefile')\n", "msoas = msoas.set_index('MSOA11CD')\n", "print(msoas.columns)\n", "```\n", "\n", "``` python\n", "msoas.head(1)\n", "```\n", "\n", "``` python\n", "pcadf.head(1)\n", "```\n", "\n", "``` python\n", "gpcadf = pd.merge(msoas.set_index(['MSOA11CD'], drop=True), pcadf, left_index=True, right_index=True, how='inner')\n", "print(f\"Geo-PCA df has shape {gpcadf.shape[0]} x {gpcadf.shape[1]}\")\n", "```\n", "\n", "You should get `PCA df has shape 983 x 89`.\n", "\n", "``` python\n", "gpcadf['Borough'] = gpcadf.MSOA11NM.apply(???)\n", "```\n", "\n", "### 7.6 Map the First *n* Components\n", "\n", "> **Difficulty: Moderate.**\n", "\n", "How would you automate this so that the loop creates one plot for each\n", "of the first 3 components? How do you interpret these?\n", "\n", "``` python\n", "for comp in [f\"Component {x}\" for x in range(1,3)]:\n", " ax = gpcadf.plot(column=???, cmap='plasma', \n", " scheme='FisherJenks', k=7, edgecolor='None', legend=True, figsize=(9,7));\n", " boros.plot(ax=ax, edgecolor='w', facecolor='none', linewidth=1, alpha=0.7)\n", " ax.set_title(f'PCA {comp}')\n", "```\n", "\n", "Your first component map should look something like this:\n", "\n", "
\n", "\n", "
PCA Component 1
\n", "
\n", "\n", "## 8. UMAP\n", "\n", "UMAP is a non-linear dimensionality reduction technique. Technically,\n", "it’s called *manifold* learning: imagine being able to roll a piece of\n", "paper up in more than just the 3rd dimension…). As a way to see if there\n", "is structure in your data this is a *much* better technique than one you\n", "might encounter in many tutorials: *t-SNE*. It has to do with how the\n", "two techniques ‘learn’ the manifold to use with your data.\n", "\n", "### 8.1 UMAP on Raw Data\n", "\n", "> **Difficulty: Hard.**\n", "\n", "``` python\n", "from umap import UMAP\n", "\n", "# You might want to experiment with all\n", "# 3 of these values -- it may make sense \n", "# to package a lot of this up into a function!\n", "keep_dims=2\n", "rs=42\n", "\n", "u = UMAP(\n", " n_neighbors=25,\n", " min_dist=0.01,\n", " n_components=keep_dims,\n", " random_state=rs)\n", "\n", "X_embedded = u.fit_transform(???)\n", "print(X_embedded.shape)\n", "```\n", "\n", "### 8.2 Write to Data Frame\n", "\n", "> **Difficulty: Low.**\n", "\n", "Can probably also be solved using `X_embedded.T`.\n", "\n", "``` python\n", "for ix in range(0,X_embedded.shape[1]):\n", " print(ix)\n", " s = pd.Series(X_embedded[:,???], index=pcadf.???)\n", " gpcadf[f\"Dimension {ix+1}\"] = s\n", "```\n", "\n", "### 8.3 Visualise!\n", "\n", "> **Difficulty: Low.**\n", "\n", "``` python\n", "rddf = gpcadf.copy() # Reduced Dimension Data Frame\n", "```\n", "\n", "#### 8.3.1 Simple Scatter\n", "\n", "``` python\n", "f,ax = plt.subplots(1,1,figsize=(8,6))\n", "sns.scatterplot(x=rddf[???], y=rddf[???], hue=rddf['Borough'], legend=False, ax=ax)\n", "```\n", "\n", "#### 8.3.2 Seaborn Jointplot\n", "\n", "That is *suggestive* of there being struccture in the data, but with 983\n", "data points and 33 colours it’s hard to make sense of what the structure\n", "*might* imply. Let’s try this again using the Subregion instead and\n", "taking advantage of the Seaborn visualisation library’s `jointplot`\n", "(joint distribution plot):\n", "\n", "``` python\n", "rddf['Subregion'] = rddf.Borough.apply(lambda x: mapping[x])\n", "```\n", "\n", "``` python\n", "# Sets some handy 'keywords' to tweak the Seaborn plot\n", "kwds = dict(s=7,alpha=0.95,edgecolor=\"none\")\n", "# Set the *hue order* so that all plots have some colouring by Subregion\n", "ho = ['Inner East','Inner West','Outer West and North West','Outer South','Outer East and North East']\n", "```\n", "\n", "``` python\n", "g = sns.jointplot(data=rddf, x=???, y=???, height=8, \n", " hue=???, hue_order=ho, joint_kws=kwds)\n", "g.ax_joint.legend(loc='upper right', prop={'size': 8});\n", "```\n", "\n", "Your jointplot should look like this:\n", "\n", "
\n", "\n", "
UMAP Jointplot
\n", "
\n", "\n", "What do you make of this?\n", "\n", "Maybe let’s give this one last go splitting the plot out by subregion so\n", "that we can see how these vary:\n", "\n", "``` python\n", "for r in rddf.Subregion.unique():\n", " g = sns.jointplot(data=rddf[rddf.Subregion==r], x='Dimension 1', y='Dimension 2', \n", " hue='Borough', joint_kws=kwds)\n", " g.ax_joint.legend(loc='upper right', prop={'size': 8});\n", " g.ax_joint.set_ylim(0,15)\n", " g.ax_joint.set_xlim(0,15)\n", " plt.suptitle(r)\n", "```\n", "\n", "We can’t unfortunately do any clustering at this point to create groups\n", "from the data (that’s next week!) so for now note that there are several\n", "large-ish groups (in terms of membership) and few small ones picked up\n", "by t-SNE. Alos note that there is strong evidence of some incipient\n", "structure: Inner East and West largely clump together, while Outher East\n", "and Outer South also seem to group together, with Outer West being more\n", "distinctive. If you look back at the PCA Components (especially #1) you\n", "might be able to speculate about some reasons for this! Please note:\n", "this is *only* speculation at this time!\n", "\n", "Next week we’ll also add the listings data back in as part of the\n", "picture!\n", "\n", "### 8.4 Map the *n* Dimensions\n", "\n", "> **Difficulty: Low.**\n", "\n", "``` python\n", "for comp in [f\"Dimension {x}\" for x in range(1,3)]:\n", " f, ax = plt.subplots(1,1,figsize=(12,8))\n", " rddf.plot(???);\n", " boros.plot(edgecolor='w', facecolor='none', linewidth=1, alpha=0.7, ax=ax)\n", " ax.set_title(f'UMAP {comp}')\n", "```\n", "\n", "Your first dimension map should look something like this:\n", "\n", "
\n", "\n", "
UMAP Dimension 1
\n", "
\n", "\n", "### 8.5 And Save\n", "\n", "``` python\n", "rddf.to_parquet(os.path.join('data','clean','Reduced_Dimension_Data.geoparquet'))\n", "```\n", "\n", "#### 8.5.1 Questions\n", "\n", "- How would you compare/contrast PCA components with UMAP dimensions?\n", " Why do they not seem to show the same thing even though *both* seem\n", " to show *something*?\n", "- What might you do with the output of either the PCA or UMAP\n", " processes?\n", "\n", "### 8.6 Credits!\n", "\n", "##### 8.6.0.1 Contributors:\n", "\n", "The following individuals have contributed to these teaching materials:\n", "Jon Reades (j.reades@ucl.ac.uk).\n", "\n", "##### 8.6.0.2 License\n", "\n", "These teaching materials are licensed under a mix of [The MIT\n", "License](https://opensource.org/licenses/mit-license.php) and the\n", "[Creative Commons Attribution-NonCommercial-ShareAlike 4.0\n", "license](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n", "\n", "##### 8.6.0.3 Potential Dependencies:\n", "\n", "This notebook may depend on the following libraries: pandas, geopandas,\n", "sklearn, matplotlib, seaborn" ], "id": "ab725679-553f-4099-9321-ea455329e716" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3 (ipykernel)", "language": "python", "path": "/Users/jreades/anaconda3/envs/sds/share/jupyter/kernels/python3" }, "language_info": { "name": "python", "codemirror_mode": { "name": "ipython", "version": "3" }, "file_extension": ".py", "mimetype": "text/x-python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } } }