Setting Up

Overview

In the first week we will be focussing on the supporting infrastructure for ‘doing data science’. That is to say, we’ll be dealing with the installation and configuration of tools such as GitHub and Docker which support replicable, shareable, and document-able data science. As a (free) bonus, the use of these tools also protects you against catastrophic (or the merely irritating) data loss thanks to over-zealous editing of code or content. You should see this as preparing the foundation not only for your remaining CASA modules (especially those in Term 2) but also for your post-MSc career.

Learning objectives
  1. A basic understanding of the data science ‘pipeline’.
  2. An understanding of how data scientists use a wide range of ‘tools’ to do data science.
  3. A completed installation/configuration of these tools.

You should also see this session as connecting to Quantitative Methods Week 1 content on ‘setting quantitative research questions’ since the main assessment will require you to develop a data-led policy briefing. In other words, you’ll need to map current policy on to one or more research questions that can be quantitatively examined using the tools and techniques acquired over the course of the term! While you don’t need to start work on this yet, you should keep it in the back of your mind for when you come across readings/results that you’d like to explore in more detail.

Preparation

Although none of these activities are compulsory in advance of the first session, getting your computer set up to code does take time and most of these preparatory activites are fairly straightforward… with a few exceptions noted below. If you are able to get these tools installed in advance then you can focus on the taught content in the first two practicals rather than also wrestling with an installation process. This will also give us more time to help you if you discover that you’re one of the unlucky few for whom getting set up is a lot more work!

Tip

Complete as many of these activities as you can:

  1. Go through the computer health check.
  2. Finish installing the programming environment.

The last of these is the stage where you’re most likely to encounter problems that will need our assistance, so knowing that you need our help in Week 1 means that you can ask for it much sooner in the practical!

Class

In this week’s workshop we will review the module aims, learning outcomes, and expectations with a general introduction to the course.

Session Presentation
Getting Oriented Slides
Tools of the Trade Slides
Writing Code Slides
Group Working Slides

Practical

This week’s practical is focussed on getting you set up with the tools and accounts that you’ll need to across many of the CASA modules in Terms 1 and 2, and familiarising you with ‘how people do data science’. Outside of academia, it’s rare to find a data scientist who works entirely on their own: most code is collaborative, as is most analysis! But collaborating effectively requires tools that: get out of the way of doing ‘stuff’; support teams in negotating conflicts in code; make it easy to share results; and make it easy to ensure that everyone is ‘on the same page’.

Connections

The practical focusses on:

  • Getting you up and running with the coding and collaboration tools.
  • Providing you with hands-on experience of using these tools.
  • Configuring your programming environment for the rest of the programme.
Note

To save a copy of notebook to your own GitHub Repo: follow the GitHub link, click on Raw and then Save File As... to save it to your own computer. Make sure to change the extension from .ipynb.txt (which will probably be the default) to .ipynbbefore adding the file to your GitHub repository.

To access the practical:

  1. Preview on GitHub
  2. Download the Notebook