Numeric Data

Overview

This week we will be introducing the use of the pandas library for data analysis and management through a focus on numeric data and its distribution(s). This marks a major shift from working with concepts (lists, dictionaries, functions, etc.) largely in isolation to encountering all of them together ‘in the wild’ as part of a full data science workflow. So we are moving from the acquisition of concepts to their integration in the same way that we will — over the course of these three sessions — be coming from data acquisition to data integration.

Learning Objectives
  1. An appreciation of how and why this module differs from (QM) CASA0007.
  2. The beginnings of a more integrative understanding of foundational computer science concepts and the practice(s) of data science.
  3. A basic understanding of data acquisition and manipulation in Python.

Lectures

Come to class prepared to present/discuss:

Session Video Presentation
Logic Video Slides
Randomness Video Slides
Data Video Slides
Pandas Video Slides
More on the Assessments In class Slides

Other Prep

  • Come to class prepared to present/discuss:
    • D’Ignazio and Klein (2020), chap. 4, What Gets Counted Counts
      <URL>
    • Wachsmuth and Weisler (2018)] <URL>
    • Harris (2018) <URL>
Connections

Two more readings about the impact of Airbnb on cities (Wachsmuth and Weisler 2018; Harris 2018) that you’re likely to find useful for developing your thinking for the Group Work and one by D’Ignazio and Klein (2020) to highlight the importance of thinking about what a data set captures… and what it excludes. You should almost never be claiming that your (social) data represents the ‘universe’ of behaviours or is somehow ‘complete’.

Practical

In this practical we will begin working with the InsideAirbnb data, which you will have briefly examined in CASA0005. This week we focus on the first ‘class’ of data in the data set: simple numeric columns. We will see how to use Pandas for (simple) visualisation and (the beginnings of) analysis. It is hoped that you will see how Pandas combines and builds on techniques that we’ve already seen: while Pandas is incredibly sophisticated, the underlying concepts have been covered in the preceding three weeks! At this point we will also begin to make use of Pandas functionality to subset and explore the data.

Connections

The practical focusses on:

  • Seeing how Pandas is ‘just’ a sophisticated extension of what we’ve already done.
  • Familiarising yourself with Pandas functionality.
  • Performing basic data cleaning and exploration tasks (including visualisation).
  • Selecting and aggregating data in pandas.

To access the practical:

  1. Preview on GitHub
  2. Download the Notebook

References

D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. MIT Press. https://bookbook.pubpub.org/data-feminism.
Harris, J. 2018. “Profiteers Make a Killing on Airbnb - and Erode Communities.” The Guardian. https://www.theguardian.com/commentisfree/2018/feb/12/profiteers-killing-airbnb-erode-communities.
Wachsmuth, D., and A. Weisler. 2018. “Airbnb and the Rent Gap: Gentrification Through the Sharing Economy.” Environment and Planning A: Economy and Space 50 (6):1147–70. https://doi.org/10.1177/0308518X18778038.