Dimensions in Data
Supplemental
This session is ‘supplemental’, meaning that it is here to help you integrate ideas seen across Term 1 (and which will be encountered again in Term 2) in a way that sup ports long-term learning. It is not essential to passing the course and there are no ‘bonus points’ for using methods found in this session.
Overview
This is the most profoundly abstract aspect of data analysis: how to conceive of your data as a multi-dimensional space that can be reshaped and transformed to support your analytical objectives. This foregrounds the importance of judgement since, as the economist Ronald Coase is reputed to have said:
“If you torture the data long enough, it will confess.”
By which you should understand that transformation is a form of ‘torture’1: it can force the data to reveal relationships that were previously hidden from the data scientist. However, taken too far the data will confess to whatever you want, which isn’t the purpose of critical, reproducible, sound data science!
- A deeper understanding of the issues surrounding clustering that were covered in Week 6 of CASA0005 (GIS) and CASA0007 (QM).
- An understanding of how data transformation works and the reasons for choosing one transform over another.
- An appreciation of the pros and cons of at least two dimensionality reduction techniques.
Preparatory Lectures
Come to class prepared to present/discuss:
Session | Video | Presentation |
---|---|---|
The Data Space | Video | Slides |
Transformation | Video | Slides |
Dimensionality | Video | Slides |
Other Preparation
- The following readings may be useful for reflecting on the topics covered in this session:
These readings provide very practical insights into the ways that data transformation can ‘torture the data until it confesses’ as well as how we can use data transformation to generate new ways of seeing our data and, consequently, new ways of understanding it. You should be coming out of these readings with a clearer understanding of why there’s rarely a ‘right’ or ‘wrong’ approach to a real-world data set, but there are ‘better’ and ‘worse’ approaches. These readings are predominantly non-academic so they should (I hope) be fairly accessible and quick to read despite the potential dryness of the topics.
Practical
This practical will show you how data transformation is an essential, but often overlooked, aspect of data analysis: depending on the choices we make here, we can reduce (or increase) the dimensionality of the data and make it more (or less) tractable for subsequent analysis. This approach to the pipeline relies on you being able to see your data as existing in an abstract ‘space’ that can be manipulated in order to foreground, compress, or even mask attributes.
The practical focusses on:
- Working with a more complex data structure to create new ‘grouped’ variables (as the simplest form of transformation)
- Using
sklearn
tofit
andtransform
data in a flexible manner. - Doing two types of dimensionality reduction to demonstrate how different linear and non-linear dimensionality reduction are.
To access the practical:
References
Footnotes
To be clear, this is a metaphor only!↩︎