import os
import numpy as np
import pandas as pd
Practical 6: Numeric Data
Easing into EDA with Pandas
This session is a tour-de-pandas
; since this is Python’s equivalent of the tidyverse
meets data.tables
it is fundamental to the data science ecosystem and is probably one of the most-widely used libraries in the language as a whole. I get more than 286,000 questions tagged with pandas on StackOverflow.
This week we are also going to start looking at the InsideAirbnb data which forms the core of the work that we do over the rest of the term. The focus of this notebook is simple numeric data: no mapping or text data… yet… and direct manipulation of data types, derivation of summary statistics, and simple plotting.
We hope that you will be able to draw on the past few practical sessions to develop a more intuitive understanding of how to interact with pandas since it supports both a ‘dictionary-of-lists’ style of interaction and a methods-based style of interaction with the ‘Data Frame’.
Conceptually, this practical links together all of the preceding ones; you will find data structures, classes and methods, reading CSV files from a remote location, numpy
, and more than you ever wanted to know about data types in Python. Making these connections will make the remainder of term much, much easier, so it might be worth revising this practical over Reading Week so make sure it all makes sense!
The Importance of EDA
After a few weeks getting to grips with Python, we’re now going to start working with some real data. One of the first things that we do when working with any new data set is to familiarise ourselves with it. There are a huge number of ways to do this, but there are no shortcuts to:
- Reading about the data (how it was collected, what the sample size was, etc.)
- Reviewing any accompanying metadata (data about the data, column specs, etc.)
- Looking at the data itself at the row- and column-levels
- Producing descriptive statistics
- Visualising the data using plots
You should use all of these together to really understand where the data came from, how it was handled, and whether there are gaps or other problems. If you’re wondering which comes first, the concept of start with a chart is always good… though we’ve obviously not quite gotten there yet! This week we want you to get a handle on pandas itself, so although we will do some plotting of charts, we’ll focus on 3-4 with a tiny bit of 5. There will be much more on plotting charts next week, and you should be looking into 1 and 2 yourself based on what’s been written both on the Inside Airbnb web site and in the suggested readings.
So although they don’t need to be done now, you probably want to add both those links to your reading list!
Preamble
It’s always sensible to import packages these at the top of the notebook:
- Because it lets everyone know what they need to have installed to run your code.
- It’s easy to run this and then skip further down the notebook if you have already done some of the work and saved an intermediate output.
Beyond what we provide below there are numerous useful introductions; one of our favourites is from Greg Reda, and there are some good videos on our YouTube channel. And of course, there’s TONS of stuff on StackOverflow. If you want an actual physical book, you might try McKinney (2017).
However, one thing you will really want to bookmark is the official documentation since you will undoubtedly need to refer to it fairly regularly. Note: this link is to the most recent release. Over time there will be updates published and you may find that you no longer have the most up-to-date version. If you find that you are now using an older version of pandas and the methods have changed then you’ll need to track down the specific version of the documentation that you need from the home page.
You can always check what version you have installed like this:
print(pd.__version__)
2.2.3
The <package_name>.__version__
approach isn’t guaranteed to work with every package, but it will work with most of them. Remember that variables and methods starting and ending with ‘__
’ are private and any interaction with them should be approached very, very carefully.
Reading and Writing Data
Pandas can do a lot, and you might be feeling a little intimidated by this, but here’s the thing: we were already writing something like pandas from scratch! That’s because pandas takes a column-view of data in the same way that our Dictionary-of-Lists did, it’s just that it’s got a lot more features than our ‘simple’ tool did. That’s why the documentation is so much more forbidding and why pandas is so much more powerful.
But at its heart, a pandas Data Frame
(df
for short) is a collection of Data Series
objects (i.e. columns) with an index. Each Series is like one of our column-lists from the last notebook. And the df
is like the dictionary that held the data together. So you’ve seen this before and you already know what’s going on… or at least you now have an analogy that you can use to make sense of pandas:
= {
myDataFrame '<column_name_1>': <Series_1>,
'<column_name_2>': <Series_2>,
'<column_name_3>': <Series_3>
}
And pandas gives us two ways to access that data:
- Using a method syntax:
myDataFrame.column_name_1
- Using a dictionary syntax:
myDataFrame['column_name_1']
Depending on which syntax you prefer, you can use these interchangeably. The only times you have to choose one over the other are:
- Assignment (e.g.
myDataFrame['column_name_1'] = ...
); - Columns with spaces in their names (e.g.
myDataFrame['Column Name 1')
).
Reading Remote Data
You will need to do several things here to read the remote, compressed CSV file specified by url
into a data frame called df
. Setting low_memory=False
ensures that pandas will try to load the entire data set before guessing the data format! Obviously, with very large files this is probably a bad idea and it’s possible to force a particular column type while readng in the data as well. For larger data sets there are platforms like Dask (see, eg, this), and beyond that are other options.
# Set download URL
= '20240614'
ymd = 'London'
city = 'https://orca.casa.ucl.ac.uk'
host = f'{host}/~jreades/data/{ymd}-{city}-listings.csv.gz' url
Question
# your code here
= pd.read_csv(??, compression='gzip', low_memory=False)
df print(f"Data frame is {df.shape[0]:,} x {df.shape[1]}")
You should get a data frame containing 75 columns and 93,486 rows of data.
Inspecting the Data Frame
Let’s get a general sense of the data by printing out information about the data frame. There are several ways to do this (and we’ll see another futher on):
df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
– descriptive stats for all numeric columnsdf.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)
– summarises all columns, but without distribution informationdf.memory_usage(index=True, deep=True)
– memory usage details about each column (can be quite slow as it’s doing a lot of digging)
Question
What is another term for the 0.5 percentile?
Describing
Describing a data frame provides general information about numeric columns, such as the median, IQR, or number of discrete values.
So to show the 5th and 95th percentiles you need to pass an argument to describe
to override the default report from pandas:
Question
=[??]) df.describe(percentiles
Info
The info
method provides a more system-oriented view of the data frame, helping you to understand what each column is composed of, how many NAs there might be, and some high-level (but often incomplete) data on performance.
=True) df.info(verbose
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93486 entries, 0 to 93485
Data columns (total 75 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 93481 non-null float64
1 listing_url 93485 non-null object
2 scrape_id 93485 non-null object
3 last_scraped 93485 non-null object
4 source 93486 non-null object
5 name 93486 non-null object
6 description 90297 non-null object
7 neighborhood_overview 46673 non-null object
8 picture_url 93477 non-null object
9 host_id 93486 non-null float64
10 host_url 93486 non-null object
11 host_name 93480 non-null object
12 host_since 93480 non-null object
13 host_location 72291 non-null object
14 host_about 47802 non-null object
15 host_response_time 62226 non-null object
16 host_response_rate 62231 non-null object
17 host_acceptance_rate 67455 non-null object
18 host_is_superhost 93022 non-null object
19 host_thumbnail_url 93475 non-null object
20 host_picture_url 93475 non-null object
21 host_neighbourhood 46906 non-null object
22 host_listings_count 93475 non-null object
23 host_total_listings_count 93475 non-null float64
24 host_verifications 93474 non-null object
25 host_has_profile_pic 93475 non-null object
26 host_identity_verified 93474 non-null object
27 neighbourhood 46675 non-null object
28 neighbourhood_cleansed 93481 non-null object
29 neighbourhood_group_cleansed 5 non-null float64
30 latitude 93481 non-null float64
31 longitude 93481 non-null float64
32 property_type 93481 non-null object
33 room_type 93481 non-null object
34 accommodates 93481 non-null float64
35 bathrooms 61355 non-null float64
36 bathrooms_text 93331 non-null object
37 bedrooms 81800 non-null float64
38 beds 61289 non-null float64
39 amenities 93481 non-null object
40 price 61423 non-null object
41 minimum_nights 93481 non-null object
42 maximum_nights 93481 non-null float64
43 minimum_minimum_nights 93480 non-null float64
44 maximum_minimum_nights 93480 non-null float64
45 minimum_maximum_nights 93479 non-null object
46 maximum_maximum_nights 93479 non-null object
47 minimum_nights_avg_ntm 93479 non-null float64
48 maximum_nights_avg_ntm 93479 non-null float64
49 calendar_updated 4 non-null float64
50 has_availability 89394 non-null object
51 availability_30 93480 non-null float64
52 availability_60 93480 non-null float64
53 availability_90 93480 non-null float64
54 availability_365 93476 non-null float64
55 calendar_last_scraped 93481 non-null object
56 number_of_reviews 93481 non-null float64
57 number_of_reviews_ltm 93481 non-null float64
58 number_of_reviews_l30d 93481 non-null float64
59 first_review 68740 non-null object
60 last_review 68739 non-null object
61 review_scores_rating 68735 non-null float64
62 review_scores_accuracy 68657 non-null float64
63 review_scores_cleanliness 68664 non-null float64
64 review_scores_checkin 68627 non-null float64
65 review_scores_communication 68652 non-null float64
66 review_scores_location 68626 non-null float64
67 review_scores_value 68627 non-null float64
68 license 0 non-null float64
69 instant_bookable 93476 non-null object
70 calculated_host_listings_count 93476 non-null float64
71 calculated_host_listings_count_entire_homes 93476 non-null float64
72 calculated_host_listings_count_private_rooms 93476 non-null float64
73 calculated_host_listings_count_shared_rooms 93476 non-null float64
74 reviews_per_month 68735 non-null float64
dtypes: float64(36), object(39)
memory usage: 53.5+ MB
You should get that the data frame has a mix of float64
, int
, and object
(text) columns and that some columns contain many nulls. You will also get an estimate of memory usage that may differ substantially from the more complete picture provided below, which suggests a ‘true’ value of 398MB.
Memory Usage
If you really need to get into the ‘weeds’ and profile your data frame because you are crashing Python and seeing messages about ‘core dumped’, or seeing appallingly poor performance, then memory_usage
is the way to go:
=True, deep=True) df.memory_usage(index
Index 132
id 747888
listing_url 9290686
scrape_id 6637426
last_scraped 6263498
...
calculated_host_listings_count 747888
calculated_host_listings_count_entire_homes 747888
calculated_host_listings_count_private_rooms 747888
calculated_host_listings_count_shared_rooms 747888
reviews_per_month 747888
Length: 76, dtype: int64
You should see that the data frame uses 417,473,797 bytes of memory, but the really important thing to note here is the difference between string
and other types of data: keeping data as raw strings (instead of converting to categories, for instance) uses up a lot more memory and this can have a huge impact on the performance of your code.
Printing the Columns
Finally, I find it very useful to be able to quickly print out a list of the columns without all of the details shown above. You just need to print the columns as a list:
print(df.columns.to_list())
['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'license', 'instant_bookable', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']
You should get a list showing every single column. If you get Index(['id', 'listing_url',...], dtype='object')
then you have printed the column index object and you to need to tell the object to convert its output to a list (hint: Google).
Saving the File Locally
Now save the file somewhere local so that you don’t have to keep downloading 40MB of compressed data every time you want to start the practical. We’ll be using this data for the rest of term, so you might as well save yourself some time and bandwidth! We’ll talk more about data processing pipelines over the course of the term, but I’d suggest putting this data set into a data/raw
folder because then you can have directories like data/clean
and data/analytical
as you move through the process of cleaning and prepping your data for analysis.
= os.path.join('data','raw') # A default location to save raw data
path = url.split('/')[-1] # What does this do?
fn print(f"Writing to: {fn}")
Writing to: 20240614-London-listings.csv.gz
if not os.path.exists(path): # And what does *this* do?
print(f"Creating {path} under {os.getcwd()}")
os.makedirs(path)
if not os.path.exists(os.path.join(path,fn)):
=False)
df.to_csv(os.path.join(path,fn), indexprint("Done.")
Managing Your Data
When starting out it’s common to think in terms of there being one input (the raw data) and one output (the results) to an analysis. In practice, you will have many intermediate outputs used as ‘milestones’ in the overall analysis:
- You might have a ‘canonical’ data file that has dealt with formatting issues and converted the columns to appropriate data types.
- You might have a ‘clean’ data file that has dealt with observations that seem to be incomplete or otherwise improperly formatted.
- You might generate subsets by region or area.
- You might produce an ‘analytical’ or ‘final’ data set appropriate to a specific analysis.
Most importantly, if you are going to run the same analysis multiple times using data from different time periods (e.g. Land Registry’s Price Paid Data is updated every month) then you will multiple versions of each of each of the above.
But, in addition, you might also be working with such a large data set that processing the entire thing every time you want to do some development work is impractical: do you want to load 1 billion rows only to find out that you needed 1,000 of them or that one of your columns is incorrectly formatted?
So although you could do the next few steps as part of loading the raw data, I always prefer to keep the original data set handy since I almost always discover that there are fields I didn’t realise I needed when I started my work.
So my approach to coding is usually:
- Download the raw file and save it locally in a
data/raw
directory. - Load the first
nrows
of data so that I can quickly:- Check that the specification matches the data and select columns/rows accordingly.
- Identify obviously invalid rows/columns and investigate further.
- Check the code to fix data types and (where relevant) values works.
- Write this new, smaller file (\(m` << m\) and \(n` << n\)) out to a
data/clean
ordata/canonical
directory (depending on whether formatting the columns is so complex or takes so long on a large data set that it needs to be separated out from actual cleaning). - Test out some initial ideas for further analysis.
- Re-run the code (remove the
nrows
limit) using the full data set.
Although the code here is simple, the logic is not.
File Names
You should always be looking for ways to avoid hard-coding values that might change over time, especially those linked to the date of the data file.
In this case you might try to work out how to make it easy to update the code to download the latest file. For instance, if the file looks like 2022-09-10-listings.csv.gz
then I might well specify the url
as {date}-listings.csv.gz
or {year}-{month}-{day}-listings.csv.gz
and set up the variables that I need beforehand or in a separate file.
Using parameters makes it easier to write robust code that doesn’t have unwanted side-effects. Here’s a common one: you write code to download and process a file named 20221111-data.csv.gz
. After doing all the steps in Tasks 2 and 3 below you save it to clean-data.csv.gz
.
Question
What happens when your boss asks you to process 20221211-data.csv.gz
?
File Loading
Now let’s write something that will allow us to more quickly write our code and validate the results in exploratory phase. For simplicity I’ve called this ‘testing’, but you could also think of it as ‘dev’ mode. What we want is to be able to easily swap between testing and operational contexts using a ‘switch’ (typically, a Boolean value) and limit the data load in testing mode.
To achieve this you could set pandas to:
- Load only the first 10,000 rows using
nrows
if we are testing - Use the columns specified in
cols
- Allow pandas to load the entire data set before deciding on the column type by setting
low_memory
appropriately.
Row Subsetting
Let’s tackle the rows problem first:
Question
= True
testing
if testing:
= pd.read_csv(os.path.join(path,fn),
df =??, ??)
low_memoryelse:
= pd.read_csv(os.path.join(path,fn),
df =??)
low_memory
print(f"Data frame is {df.shape[0]:,} x {df.shape[1]}")
So notice how this code deliberately works the same for either testing or operational execution – we just flip between the option by changing the testing
variable from True
to False
!
To make this more robust and useful we could use this testing
variable throughout our code if we wanted to change other behaviours based on development/deployment context. The state of the switch could then be set globally using an external configuration file (usually just called a ‘conf file’). The easiest way to do this is to have a conf.py
which contains your global parameters and then every script or notebook file reads in the configuration and sets these variables.
Something like:
= False testing
And:
from conf import *
Column Subsetting
Now let’s tackle the column problem… In order to avoid having to load lots of data that we aren’t sure we need yet, we can restrict the columns that we load. We got cols
below by copying the output of (df.columns.to_list()
and then removing the fields that we thought we weren’t interested in.
= ['id', 'listing_url', 'last_scraped', 'name', 'description', 'host_id',
cols 'host_name', 'host_since', 'host_location', 'host_about', 'host_is_superhost',
'host_listings_count', 'host_total_listings_count', 'host_verifications',
'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'availability_365', 'number_of_reviews',
'first_review', 'last_review', 'review_scores_rating', 'license',
'reviews_per_month']
print(f"Cols contains {len(cols)} columns.")
Cols contains 34 columns.
So let’s extend our previous answer
Question
= True
testing
if testing:
= pd.read_csv(os.path.join(path,fn),
df =False, nrows=10000, ??)
low_memoryelse:
= pd.read_csv(os.path.join(path,fn),
df =False, ??)
low_memory
print(f"Data frame is {df.shape[0]:,} x {df.shape[1]}")
Releasing Memory
A particular risk when working with Jupyter notebooks is that you either: a) have run code in an order other than the order shown in the notebook; or b) have made edits to code but not re-run the changed code. So you’re still working from code that is no longer visible!
When that happens you can get very confusing issues because what you see doesn’t square with what the computer has executed. To resolve this without having to re-run the entire notebook (though that can also be a good choice!) you might want to ‘delete’ the current object and re-load or re-run the relevant data or code.
del(df)
So we use del(df)
to ensure that we aren’t accidentally using the ‘old’ data frame. But another good reason to delete data you’re no longer using is to free up memory.
Exploring Your Data
Let’s start over from the saved data:
= pd.read_csv(os.path.join(path,fn),
df =False, usecols=cols) low_memory
Selecting Rows
You will want to refer to the Randomness lecture to understand how we can select the same random sample each time and to the session on Logic lecture to cover NaN
s and NA
s.
I often like to start my EDA by simply printing out randomly-selected rows to get a feel for what’s in the data. Does what I see square with what I read in the documentation? What does the name
look like? What do I see in last_scraped
and is it a sensible? What’s the id
field for?
3) df.sample(
id | listing_url | last_scraped | name | description | host_id | host_name | host_since | host_location | host_about | ... | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | first_review | last_review | review_scores_rating | license | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26908 | 2.742332e+07 | https://www.airbnb.com/rooms/27423325 | 2023-09-07 | Serviced apartment in London · 2 bedrooms · 2 ... | <b>The space</b><br />The 2 Two Bedroom apart... | 194048242.0 | Ashburn | 2018-06-07 | NaN | NaN | ... | $511.00 | 1 | 1125.0 | 201.0 | 1.0 | 2022-08-27 | 2022-08-27 | 5.00 | NaN | 0.08 |
76365 | 9.073725e+17 | https://www.airbnb.com/rooms/907372454601544521 | 2023-09-06 | Condo in Greater London · ★4.45 · 1 bedroom · ... | This stylish place to stay is perfect for grou... | 81827675.0 | Ayoob | 2016-07-05 | London, United Kingdom | I live and work in central London and enjoy h... | ... | $149.00 | 1 | 365.0 | 293.0 | 11.0 | 2023-06-08 | 2023-08-29 | 4.45 | NaN | 3.63 |
29748 | 3.079404e+07 | https://www.airbnb.com/rooms/30794039 | 2023-09-07 | Rental unit in London · 1 bedroom · 1 bed · 1 ... | NaN | 118972937.0 | Jane | 2017-03-03 | Abbots Leigh, United Kingdom | Brown hair\nFemale \nRun my own company | ... | $249.00 | 1 | 1125.0 | 0.0 | 2.0 | 2018-12-18 | 2023-07-12 | 5.00 | NaN | 0.03 |
3 rows × 34 columns
See if you can work out from the documentation (Google search time!) how to get the same ‘random’ sample every time you re-run this code block:
Question
3, ??) df.sample(
Selecting Columns
If you look very closely, you’ll see that pandas isn’t showing you the full range of columns since there are 34! If you’d like to only look at specific columns then you can specify them after the sample method call using what looks like a nested list: [[<column names as strings>]]
.
I’d like you to sample 3 random rows, selecting the ‘latitude’, ‘longitude’, ‘license’, ‘property_type’, ‘room_type’ and ‘price’ columns only.
Question
df.sample(??)[??]
Your answer should look like this:
latitude | longitude | license | property_type | room_type | price | |
---|---|---|---|---|---|---|
30036 | 51.459590 | -0.125550 | NaN | Entire rental unit | Entire home/apt | $85.00 |
18694 | 51.595440 | -0.251620 | NaN | Private room in rental unit | Private room | $42.00 |
6829 | 51.530279 | -0.197675 | NaN | Entire rental unit | Entire home/apt | $282.00 |
latitude | longitude | license | property_type | room_type | price | |
---|---|---|---|---|---|---|
30036 | 51.459590 | -0.125550 | NaN | Entire rental unit | Entire home/apt | $85.00 |
18694 | 51.595440 | -0.251620 | NaN | Private room in rental unit | Private room | $42.00 |
6829 | 51.530279 | -0.197675 | NaN | Entire rental unit | Entire home/apt | $282.00 |
Dealing with NaNs and Nulls
There is a lot going on here and you should be paying close attention.
If you really dig into the data you will see that a number of data types that aren’t ‘appropriate’ for their contents: the id columns are floats; the dates aren’t dates; there’s a boolean that’s not a boolean… It would be nice to fix these!
# Add some columns here...
I had intended to ask you to fix these by combining code from previous weeks with information provided in the lecture, but it turns out that the InsideAirbnb data set is dirty. There are a lot of NaN
values and some of these are deeply problematic for some of the column types in pandas. There are also a number of challenges with other columns so, instead, I’ve opted to show you how I would clean this data as a first pass to get it into a format where it’s tractable for further cleaning.
Identifying Problem Rows
The reason I’m not asking you to do this part yourselves is that it took me nearly an hour just to work out why I couldn’t convert some of the columns to the right data types; then I started finding rows like these:
df[df.price.isna()]
id | listing_url | last_scraped | name | description | host_id | host_name | host_since | host_location | host_about | ... | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | first_review | last_review | review_scores_rating | license | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8668 | 1.012818e+07 | https://www.airbnb.com/rooms/10128178 | 2023-09-06 | Loft in London · ★4.15 · 2 bedrooms · 2 beds ·... | This is a lived in apartment with two medium s... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11814 | 1.359431e+07 | https://www.airbnb.com/rooms/13594306 | 2023-09-06 | Rental unit in London · ★5.0 · 2 bedrooms · 2 ... | Beautiful quiet flat with high ceilings and lo... | 18380563.0 | Cécile And Maarten | 2014-07-19 | London, United Kingdom | Hello! We live in London with our two daughter... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21912 | 2.206322e+07 | https://www.airbnb.com/rooms/22063217 | 2023-09-07 | Rental unit in Greater London · 1 bedroom · 1 ... | <b>The space</b><br />Parejas | 1824036.0 | Arya | 2012-02-28 | London, United Kingdom | Citizen of the world. | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48138 | 5.342282e+07 | https://www.airbnb.com/rooms/53422816 | 2023-09-06 | Condo in London · ★4.93 · 1 bedroom · 1 bed · ... | We've newly refurbished our 1-bed apartment fo... | 30626999.0 | Sitara | 2015-04-05 | London, United Kingdom | Love travelling and meeting people! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48765 | 5.389986e+07 | https://www.airbnb.com/rooms/53899858 | 2023-09-06 | Loft in Greater London · ★4.0 · 1 bedroom · 1 ... | Enjoy a stylish experience at this centrally-l... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
68884 | 8.446589e+17 | https://www.airbnb.com/rooms/844658929307006668 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Stylish apartment Located in sought after neig... | 10573878.0 | Liz | 2013-12-11 | London, United Kingdom | I love hosting and meeting worldly guests trav... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
74896 | 8.971145e+17 | https://www.airbnb.com/rooms/897114471991989638 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Tucked away right off the high street, near 50... | 11520835.0 | Barry | 2014-01-21 | Levittown, NY | Hey there! born and raised in New York. Airbnb... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 rows × 34 columns
df[df.room_type.isna()]
id | listing_url | last_scraped | name | description | host_id | host_name | host_since | host_location | host_about | ... | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | first_review | last_review | review_scores_rating | license | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8668 | 1.012818e+07 | https://www.airbnb.com/rooms/10128178 | 2023-09-06 | Loft in London · ★4.15 · 2 bedrooms · 2 beds ·... | This is a lived in apartment with two medium s... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11814 | 1.359431e+07 | https://www.airbnb.com/rooms/13594306 | 2023-09-06 | Rental unit in London · ★5.0 · 2 bedrooms · 2 ... | Beautiful quiet flat with high ceilings and lo... | 18380563.0 | Cécile And Maarten | 2014-07-19 | London, United Kingdom | Hello! We live in London with our two daughter... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21912 | 2.206322e+07 | https://www.airbnb.com/rooms/22063217 | 2023-09-07 | Rental unit in Greater London · 1 bedroom · 1 ... | <b>The space</b><br />Parejas | 1824036.0 | Arya | 2012-02-28 | London, United Kingdom | Citizen of the world. | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48138 | 5.342282e+07 | https://www.airbnb.com/rooms/53422816 | 2023-09-06 | Condo in London · ★4.93 · 1 bedroom · 1 bed · ... | We've newly refurbished our 1-bed apartment fo... | 30626999.0 | Sitara | 2015-04-05 | London, United Kingdom | Love travelling and meeting people! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48765 | 5.389986e+07 | https://www.airbnb.com/rooms/53899858 | 2023-09-06 | Loft in Greater London · ★4.0 · 1 bedroom · 1 ... | Enjoy a stylish experience at this centrally-l... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
68884 | 8.446589e+17 | https://www.airbnb.com/rooms/844658929307006668 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Stylish apartment Located in sought after neig... | 10573878.0 | Liz | 2013-12-11 | London, United Kingdom | I love hosting and meeting worldly guests trav... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
74896 | 8.971145e+17 | https://www.airbnb.com/rooms/897114471991989638 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Tucked away right off the high street, near 50... | 11520835.0 | Barry | 2014-01-21 | Levittown, NY | Hey there! born and raised in New York. Airbnb... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 rows × 34 columns
~(df.price.str.startswith('$', na=False))] df[
id | listing_url | last_scraped | name | description | host_id | host_name | host_since | host_location | host_about | ... | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | first_review | last_review | review_scores_rating | license | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8668 | 1.012818e+07 | https://www.airbnb.com/rooms/10128178 | 2023-09-06 | Loft in London · ★4.15 · 2 bedrooms · 2 beds ·... | This is a lived in apartment with two medium s... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8669 | NaN | within an hour | 71% | https://a0.muscache.com/im/pictures/user/652bf... | https://a0.muscache.com/im/pictures/user/652bf... | 3.0 | t | t | London, England, United Kingdom | Hackney | ... | 170 | 2023-09-06 | 14.0 | NaN | 2.0 | 0 | 0.19 | NaN | NaN | NaN |
11814 | 1.359431e+07 | https://www.airbnb.com/rooms/13594306 | 2023-09-06 | Rental unit in London · ★5.0 · 2 bedrooms · 2 ... | Beautiful quiet flat with high ceilings and lo... | 18380563.0 | Cécile And Maarten | 2014-07-19 | London, United Kingdom | Hello! We live in London with our two daughter... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11815 | NaN | NaN | NaN | https://a0.muscache.com/im/pictures/user/User-... | https://a0.muscache.com/im/pictures/user/User-... | 1.0 | t | t | London, United Kingdom | Brent | ... | 0 | 2023-09-06 | 6.0 | NaN | 1.0 | 0 | 0.07 | NaN | NaN | NaN |
21912 | 2.206322e+07 | https://www.airbnb.com/rooms/22063217 | 2023-09-07 | Rental unit in Greater London · 1 bedroom · 1 ... | <b>The space</b><br />Parejas | 1824036.0 | Arya | 2012-02-28 | London, United Kingdom | Citizen of the world. | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21913 | NaN | NaN | NaN | https://a0.muscache.com/defaults/user_pic-50x5... | https://a0.muscache.com/defaults/user_pic-225x... | 1.0 | f | t | NaN | Tower Hamlets | ... | 0 | 2023-09-07 | 2.0 | NaN | 1.0 | 0 | 0.03 | NaN | NaN | NaN |
48138 | 5.342282e+07 | https://www.airbnb.com/rooms/53422816 | 2023-09-06 | Condo in London · ★4.93 · 1 bedroom · 1 bed · ... | We've newly refurbished our 1-bed apartment fo... | 30626999.0 | Sitara | 2015-04-05 | London, United Kingdom | Love travelling and meeting people! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48139 | NaN | NaN | 83% | https://a0.muscache.com/im/pictures/user/3a51c... | https://a0.muscache.com/im/pictures/user/3a51c... | 2.0 | t | t | London, United Kingdom | Tower Hamlets | ... | 179 | 2023-09-06 | 15.0 | NaN | 1.0 | 0 | 0.75 | NaN | NaN | NaN |
48765 | 5.389986e+07 | https://www.airbnb.com/rooms/53899858 | 2023-09-06 | Loft in Greater London · ★4.0 · 1 bedroom · 1 ... | Enjoy a stylish experience at this centrally-l... | 233649.0 | Michelle | 2010-09-13 | London, United Kingdom | Michelle McLaughlin - A professional who work... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
48766 | NaN | within an hour | 71% | https://a0.muscache.com/im/pictures/user/652bf... | https://a0.muscache.com/im/pictures/user/652bf... | 3.0 | t | t | NaN | Hackney | ... | 78 | 2023-09-06 | 5.0 | NaN | 2.0 | 0 | 0.28 | NaN | NaN | NaN |
68884 | 8.446589e+17 | https://www.airbnb.com/rooms/844658929307006668 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Stylish apartment Located in sought after neig... | 10573878.0 | Liz | 2013-12-11 | London, United Kingdom | I love hosting and meeting worldly guests trav... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
68885 | NaN | a few days or more | 6% | https://a0.muscache.com/im/pictures/user/7d57d... | https://a0.muscache.com/im/pictures/user/7d57d... | 3.0 | t | t | NaN | Kensington and Chelsea | ... | 359 | 2023-09-06 | 1.0 | NaN | 1.0 | 0 | 0.17 | NaN | NaN | NaN |
74896 | 8.971145e+17 | https://www.airbnb.com/rooms/897114471991989638 | 2023-09-06 | Rental unit in Greater London · 1 bedroom · 1 ... | Tucked away right off the high street, near 50... | 11520835.0 | Barry | 2014-01-21 | Levittown, NY | Hey there! born and raised in New York. Airbnb... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
74897 | NaN | within an hour | 80% | https://a0.muscache.com/im/users/11520835/prof... | https://a0.muscache.com/im/users/11520835/prof... | 3.0 | t | t | Greater London, England, United Kingdom | Ealing | ... | 157 | 2023-09-06 | 0.0 | NaN | 1.0 | 0 | NaN | NaN | NaN | NaN |
14 rows × 34 columns
If I had to guess, I’d say that it’s some kind of partial extract/write process because there are elements in some of the problem row(s) that look right but they are in the wrong columns. So we can probably drop some of these rows, but one thing to do is look at the frequency of NaNs across the data frame first. So we need to look for NaNs and Nulls, but it’s quite obvious that a NaN
in the listing id is a basic problem and we should drop these.
id.isna()][['id','listing_url','name','description','host_id','host_name','price']] df[df.
id | listing_url | name | description | host_id | host_name | price | |
---|---|---|---|---|---|---|---|
8669 | NaN | within an hour | https://a0.muscache.com/im/pictures/user/652bf... | https://a0.muscache.com/im/pictures/user/652bf... | 3.0 | t | 170 |
11815 | NaN | NaN | https://a0.muscache.com/im/pictures/user/User-... | https://a0.muscache.com/im/pictures/user/User-... | 1.0 | t | 0 |
21913 | NaN | NaN | https://a0.muscache.com/defaults/user_pic-50x5... | https://a0.muscache.com/defaults/user_pic-225x... | 1.0 | f | 0 |
48139 | NaN | NaN | https://a0.muscache.com/im/pictures/user/3a51c... | https://a0.muscache.com/im/pictures/user/3a51c... | 2.0 | t | 179 |
48766 | NaN | within an hour | https://a0.muscache.com/im/pictures/user/652bf... | https://a0.muscache.com/im/pictures/user/652bf... | 3.0 | t | 78 |
68885 | NaN | a few days or more | https://a0.muscache.com/im/pictures/user/7d57d... | https://a0.muscache.com/im/pictures/user/7d57d... | 3.0 | t | 359 |
74897 | NaN | within an hour | https://a0.muscache.com/im/users/11520835/prof... | https://a0.muscache.com/im/users/11520835/prof... | 3.0 | t | 157 |
As always, if you don’t know that’s going on, break it down:
- You have seen how column works (
[[<column names>]]
), so that’s just selecting the columns that we want to show; - You know how row selection works (
df[<selection criteria>]
), so that isn’t anything really new either; - So the only really new part is
df.id.isna()
:df.id
is theid
column (we could have written thisdf['id']
if we wanted) andisna()
is a test for whether or not a value is NaN.
So this shows that only one row in the 10,000 row sub-sample has a NaN for its id.
If you’re not sure what the next line does, try breaking it down by running the inner bits before you run the drop
command; and also try looking online for examples of how to use df.drop
(e.g. just up above):
id.isna()].index.array, axis=0, inplace=True) df.drop(df[df.
With that really troublesome data out of the way, you can now turn to counting NaNs or Nulls in the remaining data with a view to identifying other rows that can probably be dropped.
Counting Nulls by Column
As a starting point I would look to drop the columns that contain only NaNs. Remember that we’ve dropped a row from the data frame so our maximum is now \(n-1\))! Notice how this next command works:
# returns a data frame with all values set to True/False according to Null status
df.isnull() # counts these values by column (we'll see another option in a moment)
sum(axis=0)
df.isnull.# Sort results in descending order
sum(axis=0).sort_values(ascending=False) df.isnull.
sum(axis=0).sort_values(ascending=False)[:12] df.isnull().
bathrooms 87946
license 87945
host_about 42040
bedrooms 32781
first_review 22164
reviews_per_month 22164
last_review 22164
review_scores_rating 22164
host_location 18778
description 1267
beds 1141
host_is_superhost 939
dtype: int64
The most obvious ones here are: bathrooms, license, and host_about.
=['bathrooms','license','host_about'], inplace=True) df.drop(columns
Because we have dropped everything inplace
the code simply runs and doesn’t return anything.
Counting Nulls by Row
We now know that there are still quite a few problems, but we do still need a way to identify the rows that are causing most of the problems.
Notice here that the change from axis=0
to axis=1
changes the ‘direction’ of the sum
from columns to rows. And we are getting back a data series because the summing operation reduces it to just one column.
sum(axis=1).sort_values(ascending=False).head(20) df.isnull().
48765 22
8668 22
21912 22
68884 22
48138 22
11814 22
74896 22
7003 11
6042 11
5353 11
4274 11
6694 11
2412 9
39141 8
39082 8
40686 8
27778 8
39023 8
1134 8
611 8
dtype: int64
So that is Series showing how many NaN values there are by index value. You should see two columns of numbers: the first is the row id, the second is the number of Nulls in that row.
If we save the results to a variable called probs
(i.e. problems) then we can decide what to do next.
There’s a chance that Python will complain why you try to run the third line of code. This is particularly likely if you are using Anaconda Python directly (i.e. not Docker). In that case you need to add the code listed at the start of Task 5.
= df.isnull().sum(axis=1)
probs print(type(probs)) # Note that this has returned a series!
=30) # Oooooooh, check out what we can do with a series! probs.plot.hist(bins
<class 'pandas.core.series.Series'>
Looking at this histogram, these look like two groups in the data so I would start there. I would take values greater than 3–5 as being ones that are most likely be problematic. We can use the index from probs
to select out the rows we want to inspect from the main data frame.
Here’s another bit of code that bears unpacking:
= 5
cutoff > cutoff].index, inplace=True) df.drop(probs[probs
probs > 5
: this selects only those rows in the ‘probs’ series whose value is greater than 5probs[...].index
returns the index values from the Series, which we will then pass to thedrop
command.df.drop(..., inplace=True)
will then drop the rows selected byprobs[probs>5].index
.
print(f"Have reduced data frame to: {df.shape[0]:,} rows and {df.shape[1]:,} columns")
Have reduced data frame to: 85,127 rows and 31 columns
Using Indexes
To recap, when we use the [[...]]
syntax we’re taking a short-cut through the data by column (keeping all rows). The full syntax is df[<row_selection>,<col_selection>]
. Only when we don’t specify both does it then default to df[<col_selection>]
.
To make the most of pandas you will need to get to grips with the logic than underpins this syntax. This is embedded in the idea of there being row and column indexes. These are like the columns A
..ZZ
and the rows 1..n in Excel. As you’ll have seen in the video, these aren’t considered data, they are ways to access the data. Unlike Excel, while every data frame must have an index, in pandas you can ‘promote’ or ‘demote’ any column to be used as an index.
The default row index is just the row number — this will be created for you if you don’t specify something else when you create the data frame. The default column index is created from a file’s column names (works for many types of data) but you can change these at any time.
Label and Numeric Indexing
Perhaps this will (eventually) help to make it more clear:
df.loc[4552, 4554, 4556, 4557],
['latitude','longitude','property_type','room_type','price']
[ ]
latitude | longitude | property_type | room_type | price | |
---|---|---|---|---|---|
4552 | 51.531070 | -0.186060 | Private room in rental unit | Private room | $95.00 |
4554 | 51.476640 | -0.215200 | Entire rental unit | Entire home/apt | $250.00 |
4556 | 51.400162 | -0.076788 | Private room in home | Private room | $56.00 |
4557 | 51.408920 | -0.180910 | Entire rental unit | Entire home/apt | $150.00 |
And compare that with:
df.iloc[4552:4557,
14:19
]
longitude | property_type | room_type | accommodates | bathrooms_text | |
---|---|---|---|---|---|
4655 | -0.15610 | Private room in home | Private room | 1.0 | 1 shared bath |
4656 | -0.11470 | Private room in rental unit | Private room | 1.0 | 1 bath |
4657 | -0.06745 | Private room in rental unit | Private room | 1.0 | 1 shared bath |
4658 | -0.24050 | Entire rental unit | Entire home/apt | 4.0 | 1 bath |
4659 | -0.18325 | Private room in home | Private room | 2.0 | 1 bath |
This code seems similar, but what are iloc
and loc
? The way I remember it is that iloc
means integer location (as you would with list indexing), while loc
means label location (as you would with dictionary keys or labels). I guess that should therefore be lloc
, but you get the idea).
Numeric Indexes
In this case, the index (the numbers down the left-hand side in bold) is numeric, so we can treat it as a label (which allows us to use df.loc
) or a list-type index (which allows us to use df.iloc
). So with loc
we refer to the columns by label, whereas with iloc
we refer to them by location; as well, loc
allows us to access rows and columns non-sequentially/randomly by label, while iloc
allows us to access them as a numeric range.
Non-numeric Indexes
Notice how this works differently if we specify a non-numeric index:
'listing_url')[
df.set_index('latitude','longitude','property_type','room_type','price']
[3) ].sample(
latitude | longitude | property_type | room_type | price | |
---|---|---|---|---|---|
listing_url | |||||
https://www.airbnb.com/rooms/49205557 | 51.511670 | -0.120240 | Entire condo | Entire home/apt | $337.00 |
https://www.airbnb.com/rooms/884709941812627302 | 51.595673 | -0.245288 | Private room in rental unit | Private room | $54.00 |
https://www.airbnb.com/rooms/36399168 | 51.544380 | -0.142460 | Entire rental unit | Entire home/apt | $96.00 |
Notice change in indexing because ‘listing_url’ is no longer a column, it’s the index now!
'listing_url').iloc[0:3,13:18] df.set_index(
longitude | property_type | room_type | accommodates | bathrooms_text | |
---|---|---|---|---|---|
listing_url | |||||
https://www.airbnb.com/rooms/92644 | -0.18739 | Private room in rental unit | Private room | 2.0 | 1.5 shared baths |
https://www.airbnb.com/rooms/93015 | -0.21707 | Entire rental unit | Entire home/apt | 5.0 | 1 bath |
https://www.airbnb.com/rooms/13913 | -0.11270 | Private room in rental unit | Private room | 1.0 | 1 shared bath |
It’s vital that you understand how this code works. By which I mean why it does something at all, not exactly how to use loc
and iloc
(though that is also useful).
df.set_index(...)
changes the index from the default row number to another field in the data frame. This operation returns a new data frame with listing_url
as its index. Because set index returned a data frame, we can simply add another method call (iloc
or loc
) on to the end of that line and it returns a new data frame in turn!
The fact that each operation returns a new data frame (or data series) is why you can even do this:
'listing_url').iloc[0:3].latitude.mean() df.set_index(
np.float64(51.50351666666666)
Fixing Data Types
If you want to challenge yourself, then I’d suggest trying to work out how to adapt what we saw in previous weeks using the data type dictionary to map column names to column types; however, a more straightforward way to do this is to create different for loops for each:
Profiling (Optional)
The Pandas Profiling tool (rebranded a year or so back as ydata-profiling) offers an alternative way of understanding what’s going on in your data. The output looks rather nice and you might be tempted to ask why we didn’t use this straight away on the full data set – well, if you really want to know, see what happens when you profile all 70,000-odd rows and 70-odd columns in the raw data frame… in effect: while it’s ‘nice to have’, the likelihood of crashing your computer increases significantly and it’s a bit of a tangent, so that’s why it’s no longer included in the Docker image.
If you do want to explore this then you’ll need to install the library, and this is a good chance to look at a quite sophisiticated way to install software on another machine:
from ydata_profiling import ProfileReport
Specify the Profiling Columns
Looking back over earlier code see if you can work out how to profile latitude
, longitude
,and review_scores_rating
together.
Question
= ProfileReport(??, title="Pandas Profiling Report") profile
Profiling Targets
You can write the profile either directly into the Jupyter notebook (this file) or into a separate HTML (i.e. Web) page.
profile.to_notebook_iframe()# You can also write this profile to a web page:
# profile.to_file("your_report.html")
Managing Memory
As to why you’d want to fix your data types, there are two reasons: 1) to ensure that you can make the most of your data; 2) to ensure that it takes up as little space as possible in memory. Some simple examples:
- A column containing only
'True'
(4 bytes) and'False'
(5 bytes) will take up much more space than a column containing onlyTrue
andFalse
(1 bit each). - A column containing only
'Red'
,'Green'
, and'Blue'
(3, 5, and 4 bytes each respectively) will take up much more space that a column where we use the numbers1, 2, 3
to represent these values and have a map that tells us1==Red
,2==Blue
, and3==Green
.
Let’s test this idea out:
= df.room_type.memory_usage(deep=True) # Room Type Memory
rtm = df.room_type.astype('category').memory_usage(deep=True) # Categorical Type Memory
ctm
print(f"The raw memory usage of `room_type` is {rtm/1024:,.0f} Kb.")
print(f"The categorical memory usage of `room_type` is {ctm/1024:,.0f} Kb.")
print(f"That's {(ctm/rtm)*100:.0f}% of the original!")
The raw memory usage of `room_type` is 8,623 Kb.
The categorical memory usage of `room_type` is 2,813 Kb.
That's 33% of the original!
= df.host_is_superhost.memory_usage(deep=True) # Super Host Memory
shm = df.host_is_superhost.replace({'f':False, 't':True}).astype('bool').memory_usage(deep=True) # Boolean Host Memory
bhm
print(f"The raw memory usage of `host_is_superhost` is {shm/1024:,.0f} Kb.")
print(f"The boolean memory usage of `host_is_superhost` is {bhm/1024:,.0f} Kb.")
print(f"That's {(bhm/shm)*100:.0f}% of the original!")
The raw memory usage of `host_is_superhost` is 7,528 Kb.
The boolean memory usage of `host_is_superhost` is 2,812 Kb.
That's 37% of the original!
Boolean Values
Let’s start with columns that are likely to be boolean:
= ['host_is_superhost']
bools 5, random_state=43)[bools] df.sample(
host_is_superhost | |
---|---|
55197 | f |
24505 | f |
44546 | f |
16312 | t |
66564 | f |
Here we have to map
‘t’ to True and ‘f’ to False before converting the column to a boolean type. If you simply tried to replace them with the strings ‘True’ and ‘False’, then the conversion would run into the same problem as Week 3: any string that is not None
will convert a True boolean.
# This approach requires us to map 't'
# and 'f' to True and False
for b in bools:
print(f"Converting {b}")
= df[b].replace({'f':False, 't':True}).astype('bool') df[b]
Converting host_is_superhost
5, random_state=43)[bools] df.sample(
host_is_superhost | |
---|---|
55197 | False |
24505 | False |
44546 | False |
16312 | True |
66564 | False |
Dates
I’ve found dates to be particularly challenging, though pandas has tried to make this process less painful than it was a few years ago. What can be particularly frustrating is if one row has a non-sensical date value (e.g. a t
, as happened in 2019/20) then the entire type conversion will fail. When that happens, pandas is not great about communicating where the problem occurred and I had to work it out by trying to convert parts of each series (using .iloc
) to the datetime type until I had a block that failed. I then knew that I could narrow this down further using integer location indexing.
= ['last_scraped','host_since','first_review','last_review']
dates
print(f"Currently {dates[1]} is of type '{df[dates[1]].dtype}'", "\n")
5, random_state=43)[dates] df.sample(
Currently host_since is of type 'object'
last_scraped | host_since | first_review | last_review | |
---|---|---|---|---|
55197 | 2023-09-07 | 2015-02-11 | NaN | NaN |
24505 | 2023-09-06 | 2012-12-18 | 2019-01-05 | 2020-01-05 |
44546 | 2023-09-06 | 2016-05-26 | 2021-08-15 | 2023-01-10 |
16312 | 2023-09-07 | 2017-04-02 | 2017-04-04 | 2023-08-26 |
66564 | 2023-09-06 | 2017-05-07 | NaN | NaN |
for d in dates:
print("Converting " + d)
= pd.to_datetime(df[d]) df[d]
Converting last_scraped
Converting host_since
Converting first_review
Converting last_review
5, random_state=43)[dates] df.sample(
last_scraped | host_since | first_review | last_review | |
---|---|---|---|---|
55197 | 2023-09-07 | 2015-02-11 | NaT | NaT |
24505 | 2023-09-06 | 2012-12-18 | 2019-01-05 | 2020-01-05 |
44546 | 2023-09-06 | 2016-05-26 | 2021-08-15 | 2023-01-10 |
16312 | 2023-09-07 | 2017-04-02 | 2017-04-04 | 2023-08-26 |
66564 | 2023-09-06 | 2017-05-07 | NaT | NaT |
Of course, it’s not actually clear there what has changed! But if you dig a little more deeply:
print(f"Now {dates[1]} is of type '{df[dates[1]].dtype}'", "\n")
5, random_state=45)[dates[1]].dt.strftime('%A %B %d, %Y')
df.sample(# Try some other formats!
Now host_since is of type 'datetime64[ns]'
45006 Sunday December 15, 2013
11882 Monday April 13, 2015
74675 Wednesday July 16, 2014
80039 Friday May 06, 2022
84399 Friday April 22, 2022
Name: host_since, dtype: object
In that line of code we:
- Took a random sample (setting the state to 45),
- Took the second column from the dates list (
dates[1]
), - Used the date ‘accessor method’ (
.dt
), - And called
string format time
with the format%A %B %d, %Y
(Full Day of Week, Month Name, Date, 4-digit Year)
Categories
We know that these are likely to be categories because there’d be no other way to allow users to effectively search Airbnb.
= ['property_type','room_type']
cats
print(f"Currently {cats[1]} is of type '{df[cats[1]].dtype}'", "\n")
5, random_state=42)[cats] df.sample(
Currently room_type is of type 'object'
property_type | room_type | |
---|---|---|
18050 | Private room in home | Private room |
86868 | Private room in home | Private room |
54798 | Private room in bed and breakfast | Private room |
74233 | Entire rental unit | Entire home/apt |
87242 | Entire rental unit | Entire home/apt |
This next piece of code is quite useful for grouping and counting operations: we are counting the occurences of each unique value in part particular column or combination of columns:
0]].value_counts() df[cats[
property_type
Entire rental unit 33450
Private room in rental unit 13278
Private room in home 9795
Entire condo 8656
Entire home 7530
...
Yurt 1
Private room in island 1
Shared room in villa 1
Shared room in serviced apartment 1
Treehouse 1
Name: count, Length: 99, dtype: int64
1]].value_counts() df[cats[
room_type
Entire home/apt 54203
Private room 30366
Shared room 340
Hotel room 218
Name: count, dtype: int64
One column has many different values (including Campers/RVs and Yurts!), the other has just four. If I were looking to conduct research I’d probably start with the room_type
column since I may not care about hotels and therefore never even need to decide whether I care about boutique ones!
for c in cats:
print(f"Converting {c}")
= df[c].astype('category') df[c]
Converting property_type
Converting room_type
print(f"Now {cats[1]} is of type '{df[cats[1]].dtype}'", "\n")
print(df[cats[1]].cat.categories.values)
Now room_type is of type 'category'
['Entire home/apt' 'Hotel room' 'Private room' 'Shared room']
5, random_state=42)[cats] df.sample(
property_type | room_type | |
---|---|---|
18050 | Private room in home | Private room |
86868 | Private room in home | Private room |
54798 | Private room in bed and breakfast | Private room |
74233 | Entire rental unit | Entire home/apt |
87242 | Entire rental unit | Entire home/apt |
Dealing with Strings
We’ll have to put some more work into deal with the description and other more free-from text fields later in the term, but for now let’s just deal with a straightforward one: price!
= ['price']
money 5, random_state=42)[money] df.sample(
price | |
---|---|
18050 | $51.00 |
86868 | $56.00 |
54798 | $45.00 |
74233 | $104.00 |
87242 | $126.00 |
You will get an error when you run the next code block, that’s because I want you to do a little thinking about how to extend the code to fix the data. You’ve already got the code you need to fix it, you just need to do a bit of thinking about ‘method chaining’!
for m in money:
print(f"Converting {m}")
try:
= df[m].str.replace('$','', regex=False).astype('float')
df[m] except ValueError as e:
print(f" xxxx Unable to convert {m} to float xxxx")
print(e)
Converting price
xxxx Unable to convert price to float xxxx
could not convert string to float: '1,000.00'
Look closely at the error and then think about what you need to add to the code below:
For now don’t worry about what regex=False
means. It will all make sense when we get to dealing with text.
Question
for m in money:
print(f"Converting {m}")
= df[m].str.replace('$','', regex=False).str.replace(??).astype('float') df[m]
5, random_state=42)[money] df.sample(
price | |
---|---|
18050 | 51.0 |
86868 | 56.0 |
54798 | 45.0 |
74233 | 104.0 |
87242 | 126.0 |
='price', ascending=False).head(3)[['id','name','price','minimum_nights']] df.sort_values(by
id | name | price | minimum_nights | |
---|---|---|---|---|
36168 | 3.845268e+07 | Guesthouse in Dagenham · ★5.0 · 1 bedroom · 1 ... | 80100.0 | 2 |
11249 | 1.325477e+07 | Rental unit in London · ★4.85 · 1 bedroom · 1 ... | 53588.0 | 3 |
53708 | 6.454631e+17 | Serviced apartment in Greater London · 4 bedro... | 36000.0 | 28 |
Dealing with Integers
This is the issue that made me abandon the idea of making you clean the data yourselves. Although floats have no issues with np.nan
in the Series, by default there are no numpy integer arrays that can cope with NaNs. This was such a major issue for Pandas that they’ve actually created their own data type that does support NaN values in integer columns. There are a lot of integer columns, but only one of them seems to be a problem.
= ['id','host_id','host_listings_count','host_total_listings_count','accommodates',
ints 'beds','minimum_nights','maximum_nights','availability_365']
for i in ints:
print(f"Converting {i}")
try:
= df[i].astype('float').astype('int')
df[i] except ValueError as e:
print(" - !!!Converting to unsigned 16-bit integer!!!")
= df[i].astype('float').astype(pd.UInt16Dtype()) df[i]
Converting id
Converting host_id
Converting host_listings_count
Converting host_total_listings_count
Converting accommodates
Converting beds
- !!!Converting to unsigned 16-bit integer!!!
Converting minimum_nights
Converting maximum_nights
Converting availability_365
So we convert the column but using a try / except
approach that allows to trap ValueError
exceptions triggered by the presence of NaNs in the column. The following code tells us that there are just eight of these in the 10k sample, but they’re enough to cause the code to fail if you don’t trap them. The alternatives would be to: a) drop those rows; or b) leave the data as floats. For some reason the latter offends my sense of order, and the former feels like avoiding the problem rather than dealing with it.
df.beds.isna().value_counts()
beds
False 84326
True 801
Name: count, dtype: int64
Validation
Ordinarily, at this point I would then output information to confirm that all of the opeations I think I’ve undertaken were correctly applied.
df.info()
Saving
Also at this point I would save a copy of the cleaned data, though I would only consider this data partially cleaned since we’ve not made it any further than just ensuring that each column is in an appropriate format and that some particularly problematic rows have been dropped!
= os.path.join('data','clean')
path
if not os.path.exists(path):
print(f"Creating {path} under {os.getcwd()}")
os.makedirs(path)
=False)
df.to_csv(os.path.join(path,fn), indexprint("Done.")
Done.
Feather is an alternative format (gradually being replaced by parquet, which is more widely supported) for data interchange between R and Python: it’s fast, it preserves data types, it’s compressed, and it will avoid the kinds of the problems that come up when you move to/from CSV as a default.
Selection using Criteria
So far we’ve been taking primarily a row and column view of the data, now we want to think about selecting ranges from within the data set…
Selecting using Data Types
If we wanted to filter in/out certain columns pandas can do that! Let’s try for floats and ints (hint: these are 64-bit data types).
Question
=[??]) df.select_dtypes(include
Selecting using Conditions
What if we wanted to find whole homes listings for more than $100/night?
To do this we use a combination of the selection approaches above in combination with conditionals, but first we need to see what sort of properties there are in the data set! groupby
is a really useful function that we’ll come back to later in the term, but for now notice that it helps us to group the analysis by room_type
so that subsequently asking for the property_type
value counts allows the same property_type
to appear in more than once place if it’s associated with more than one room_type
.
'room_type', observed=True)['property_type'].value_counts() df.groupby(
room_type property_type
Entire home/apt Entire rental unit 33450
Entire condo 8656
Entire home 7530
Entire serviced apartment 2021
Entire townhouse 1153
...
Shared room Tent 0
Tiny home 0
Tower 0
Treehouse 0
Yurt 0
Name: count, Length: 396, dtype: int64
Now try to select only the Entire home/apt
room type:
Question
=='??']['property_type'].value_counts().head(10) df[df.??
Your output should be:
property_type
Entire rental unit 33450
Entire condo 8656
Entire home 7530
Entire serviced apartment 2021
Entire townhouse 1153
Entire loft 389
Entire guesthouse 205
Entire guest suite 178
Entire vacation home 102
Boat 68
Name: count, dtype: int64
Arbitrary Selection Criteria
OK, now let’s look for the Entire home/apt listings that are more expected than average… to do that let’s get a sense of where the mean and median value fall:
Question
print(f"The mean price is ${df.price.??():0.2f}")
print(f"The median price is ${df.price.??():0.2f}")
You should see that the mean is higher than the median price but both are very roughly plausible values. Given your understanding of distributions from, say, Quantitative Methods, what can you say about the pricing distribution of Airbnb units?
You might want to have a look at the documentation: it’s rather a long list, but most of your descriptive stats are on that page in the Cumulative / Descriptive Stats section, and there’s also lots of information about methods for strings and categorical data.
Filtering: it’s ‘logical’
So we want to take Entire home/apt
and filter the data set together with the price per night from the price
column. For that, let’s use the mean price/night of $183.63 (note: this is totally arbitrary)?
Question
So here we want to filter on two values in the data set using &
:
= df[(??) & (df.price>df.price.??)]
pricey print(f"Selected {pricey.shape[0]:,} rows")
In the code above we see two things:
- The use of the bitwise
&
(it’s not the same asand
). - The fact that you need parentheses around the selection in order to make the the
&
work.
Selection with an Aggregate
Let’s find the cheapest and most expensive listings using min
and max
methods:
Question
Least expensive:
==df.price.??()][['price','id','listing_url','room_type','description']] df[df.price
Most expensive:
==df.price.??()][['price','id','listing_url','room_type','description']] df[df.price
You should see one or more units priced at exceedingly high levels… and here’s a way to see a few more of these budget-busting options.
='price', ascending=False).head(3)[
df.sort_values(by'price','listing_url','room_type','description']
[ ]
price | listing_url | room_type | description | |
---|---|---|---|---|
36168 | 80100.0 | https://www.airbnb.com/rooms/38452677 | Entire home/apt | Bouquet design open plan house. <br />Beautifu... |
11249 | 53588.0 | https://www.airbnb.com/rooms/13254774 | Private room | PLEASE NOTE THIS IS NO LONGER AVAILABLE! Cosy ... |
53708 | 36000.0 | https://www.airbnb.com/rooms/645463113262447532 | Entire home/apt | Enjoy your stay in London in a modern, archite... |
Question
What do you make of this result?
Selection with a Range
Perhaps we aren’t just looking for extremes… how about all of the properties falling within the middle of the distribution? We can ask for any abitrary quantile we like, so let’s go with the 25th and 75th percentile to get the middle 50% of the data. Google how to get percentiles from pandas.
Question
= df[
dfr > df.price.quantile(??)) &
(df.price < df.price.quantile(??)) ]
(df.price
print(f"Lower Quartile: {df.price.quantile(??):>6.2f}")
print(f"Upper Quartile: {df.price.quantile(??):>6.2f}")
print()
print(f"Range selected contains {dfr.shape[0]:,} rows.")
print(f"Minimum price: {dfr.price.??():>6.2f}")
print(f"Maximum price: {dfr.price.??():>6.2f}")
That example contains a few things to which you need to pay attention:
- Again you can see that, with mutiple selections, we had to put parentheses around each one – this forces Python to…
- Process the
&
(bit-wise AND) that asks pandas to “Find all the rows where condition 1 and condition 2 are bothTrue
”. So it calculates theTrue
/False
for the left side and theTrue
/False
for the right side of the&
, and then combines them.
Enhancing our Understanding
Deriving a New Variable
Let’s try calculating several derived measures of distribution for the price… these deliberately demonstrate different ways of handling this process (and notice also the little call to apply
that can perform additional tasks).
The z-Score
The z-score is given by \(z = (x - \bar{x})/\sigma\).
Question
'z'] = (df.?? - df.??.??()) / df.??.??()
df[apply(lambda x: f"{x:5.5f}") df.z.describe().
Inter-Quartile Standardisation
The IQR-standardised score is given by \(i = (x - Q_{1})/(Q_{3} - Q_{1})\)
Question
'iqs'] = (df.price - ??)/(??-??)
df[apply(lambda x: f"{x:5.5f}") df.iqs.describe().
Log-Normalisation
The natural log of the price is gven by \(ln(x)\)
Question
'lnprice'] = np.log(??)
df[apply(lambda x: f"{x:5.5f}") df.lnprice.describe().
Quick (and Dirty) Plotting
One of the first things we should do when exploring a new dataset is plot (aka graph) the data. We’ve left plotting until a little later in this practical so that we could see some other basic attributes of how pandas stores data. We’ll look at plotting and exploratory data analyses in much more detail next week, including using packages other than pandas.
For now, let’s look at the basic plotting functionality pandas provides - in conjunctions with the online documentation for both DataFrames and Series. There are also examples of all the different types of plots pandas can produce.
MacOS users who are not using Docker will need to do certain things in a specific order at the start of any notebook in order to show maps or graphs. Please make a copy of the following code for any notebook that you create and make it the first code that you run in the notebook…
# Needed on a Mac
import matplotlib as mpl
'TkAgg')
mpl.use(%matplotlib inline
import matplotlib.pyplot as plt
Histograms
First, let’s see some of the ways we could visualise the distribution of the Series
in the dataset:
# histogram df.price.plot.hist()
If the code worked properly you should have just created a standard histogram plot (if you can’t see one, ask for help). However, a basic problem here may be the range of the data: if your maximum price is much more than £5,000 then you’ll find the majority of your data plotted in one bar, which isn’t very helpful.
You can filter the data and pass in some simple options to improve the plotting:
# Notice the ';' here to suppress `<AxesSubplot...>`
# That information doesn't *always* appear, but whenever
# you have unwanted textual output above your plot just
# add a ';' on the end of the line of code!
< 1000].price.plot.hist(bins=50); df[df.price
KDE Plots
Similarly, we can produce a Kernel Density Estimate plot. This time, instead of dropping data just before calling plot
we’re going to modify the limits of the x-axis using xlim
:
Question
Look for information about using xlim
:
=(??)); #kernel density estimate plot df.price.plot.kde(xlim
Kind of handy, no? These aren’t the best looking plots, but they are all being generated on-the-fly for you by pandas with no more than a cheery DataFrame.Series.plot.<plot type>
! Since those plots are all just method calls, many of them take optional parameters to change the colour, the notation (scientific or not), and other options. For example, many of the documentation pages linked to above are rather brief, but include a link to the general options that can be applied to all Series.plot
calls.
This is why we like pandas: it allows us to be constructively lazy. We don’t need to know how a draw a KDE plot (though it always helps if you don’t see what you expected), we just need to know that pandas provides a method that will do it for you. And that is why it’s always worth having a look at the documentation.
A Slight Case of Over-Plotting
Generally, Jupyter is clever enough to overlay plots one on top of the other if you call them all in the same cell. We’ll see ways to gain more control later, but this is still a good start! Note that here we also need to get rid of the -inf
values from rows that had a price of £0.
The more we use pandas to sort and filter data the more you will start to see a SettingWithCopyWarning
. This happens because of an interaction between how Pandas works and how Python works: when you are working with a very large data set you don’t want to make a ‘deep copy’ of the data structure every time you make a change to the data. Instead, you get a ‘view’ into the data using a reference, which is a just a lightweight shortcut. So what happens when you try to modify that lightweight copy? Well, if you want to drop rows or columns then you either want to make a copy()
at that point, or you will have to accept the warning and the computational risks that go with it.
# Calling copy() ensures the index is updated
# and note that all subsequent plots will have
# these £0 rows removed!
= df[df.price > 0].copy()
df =[-2, 10])
df.z.plot.kde(xlim=[-2, 10])
df.iqs.plot.kde(xlim; df.lnprice.plot.kde()
Boxplots
A standard boxplot:
=(4, 8)); df.lnprice.plot.box(figsize
Scatterplots
We can also plot two variables in a scatter plot by applying a plot method to the DataFrame
(not an individual Series
):
='longitude', y='latitude', c='price', s=2, cmap='viridis', figsize=(15,10)) df.plot.scatter(x
Note how the code above has the form DataFrame.plot.<plot type>
, not DataFrame.Series.plot.<plot type>
as in the prior plots. Think about why this then means we need the x
and y
arguments.
Looking at the plot produced, it’s hard to see where the high values are, so we might want to think about ways that we could make it easier to spot the big numbers… We could, for instance, also vary the size of the point in a plot by some variable, but why does the following not really work?
='longitude', y='latitude', c='price', s=(df.price/df.price.min()), cmap='viridis', figsize=(15,10)) df.plot.scatter(x
And we can plot subsets of our data without creating a new object. See if you can work out what the following code is doing that is different from the last plot:
> df.price.quantile(0.90)].plot.scatter(x='longitude', y='latitude', c='price', cmap='viridis', s=8) df[df.price
Hex Bin Plots
And pandas allows us to create ‘less standard’ plots, like a hex bin plot:
='longitude', y='latitude', gridsize=50, figsize=(10,7)) df.plot.hexbin(x
That’s just a taste of what the basic plotting functionality of pandas can do. Feel free to explore more yourself and we’ll also see the seaborn package later.
Credits!
License
These teaching materials are licensed under a mix of The MIT License and the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.
Acknowledgements:
Supported by the Royal Geographical Society (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.
Potential Dependencies:
This notebook may depend on the following libraries: pandas, matplotlib