Numeric Data
Overview
This week we will be introducing the use of the pandas library for data analysis and management through a focus on numeric data and its distribution(s). This marks a major shift from working with concepts (lists, dictionaries, functions, etc.) largely in isolation to encountering all of them together ‘in the wild’ as part of a full data science workflow. So we are moving from the acquisition of concepts to their integration in the same way that we will — over the course of these three sessions — be coming from data acquisition to data integration.
- An appreciation of how and why this module differs from (QM) CASA0007.
- The beginnings of a more integrative understanding of foundational computer science concepts and the practice(s) of data science.
- A basic understanding of data acquisition and manipulation in Python.
Preparatory Lectures
Come to class prepared to present/discuss:
Session | Video | Presentation |
---|---|---|
Logic | Video | Slides |
Randomness | Video | Slides |
Data | Video | Slides |
Pandas | Video | Slides |
Other Preparation
Readings
Come to class prepared to discuss the following readings:
Citation | Article | ChatGPT Summary |
---|---|---|
Anderson (2008) | URL | N/A |
D’Ignazio and Klein (2020a) Ch.4 | URL | N/A |
Cox and Slee (2016) | URL | N/A |
Study Guide
Looking at Anderson (2008):
- How does the idea of “more is different” relate to the concept of the Petabyte Age?
- What does Anderson mean by “Correlation is enough”?
- What examples does Anderson provide to show that the scientific method is becoming obsolete?
Reading D’Ignazio and Klein (2020a, Ch.4):
- How do their ideas about the power dynamics of data collection challenge Anderson’s claim that “the numbers speak for themselves”?
- What are the potential consequences of relying solely on correlation, as Anderson suggests, without considering the social and political contexts of data? [5]
- How might the concept of the “privilege hazard” inform an analysis of Anderson’s argument? [21]
Considering Cox and Slee (2016):
- How does the case study presented in Cox and Slee illustrate the issues explored in the other readings?
- According to Cox and Slee, how did the cleaning of the New York City listings reflect the company’s power and privilege?
- How does this cleaning potentially harm less-privileged stakeholders in the company’s platform?
Read D’Ignazio and Klein (2020b) to highlight the importance of thinking about what a data set captures… and what it excludes. Cox and Slee (2016) is also a first introduction to the underlying data that we’ll be working with over the rest of term. D’Ignazio and Klein (2020b) should be getting you thinking about how ‘cleaning’ is not just about hygiene but has real implications for what you can and will find in your data: data is often a lot less ‘tidy’ than tidyr
might lead you think! You should almost never be claiming that your (social) data represents the ‘universe’ of behaviours or is somehow ‘complete’.
Practical
In this practical we will begin working with the InsideAirbnb data, which you will have briefly examined in CASA0005. This week we focus on the first ‘class’ of data in the data set: simple numeric columns. We will see how to use Pandas for (simple) visualisation and (the beginnings of) analysis. It is hoped that you will see how Pandas combines and builds on techniques that we’ve already seen: while Pandas is incredibly sophisticated, the underlying concepts have been covered in the preceding three weeks! At this point we will also begin to make use of Pandas functionality to subset and explore the data.
The practical focusses on:
- Seeing how Pandas is ‘just’ a sophisticated extension of what we’ve already done.
- Familiarising yourself with Pandas functionality.
- Performing basic data cleaning and exploration tasks (including visualisation).
- Selecting and aggregating data in pandas.
To access the practical: