Tools of the Trade

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Literate Programming

Ideally, we want to ‘do’ data science in ways that are ‘literate’.

The best programs are written so that computing machines can perform them quickly and so that human beings can understand them clearly. A programmer is ideally an essayist who works with traditional aesthetic and literary forms as well as mathematical concepts, to communicate the way that an algorithm works and to convince a reader that the results will be correct. ~ Knuth (1996)

Key Tenets

What we want:

  1. Weaving: the code and its documentation are together.
  2. Tangling: the code can be run directly.

In an ideal world, these are the same file…

But why would we want this?

And how do we do this?

Hint: it’s more than just one thing…

  1. JupyterLab: how we do ‘data science’.
  2. Virtualisation: separate your computer from your coding environment.
  3. Version Control: manage your code, your data, and even your reports.
  4. Markup: focus on the structure while you write!
  5. Render: creating documents and web pages from code and markup.

Jupyter(Lab) & Notebooks

Browser + JupyterLab + Markup == Tangled, Woven code in (m)any languages

Why Use JupyterLab?

Coding in JupyterLab has a number of advantages over ‘point-and-click’:

  1. Coding requires our instructions to be unambiguous and logical.1
  2. Computers are infinitely patient so we can re-run as many times as necessary to get it ‘right’.
  3. There is nothing to install (runs in your web browser).
  4. You can run code from anywhere (runs in your web browser).

The Bigger Picture

If we can’t explain it simply enough that a computer can do it, perhaps we don’t actually understand it?

  • Together with the other tools in this talk, you can largely stop worrying about where code is running.
  • It’s easy to forget how you obtained a particular result when you are clicking around inside software like ArcGIS; this is much harder when using code.
  • In analysing the problem so that we can submit it to the computer we often develop a better understanding of the problem ourselves!
  • Why spend your time doing the boring stuff???

JupyterLab + Python

Virtualisation

… a technology that enables the creation of virtual environments from a single physical machine, allowing for more efficient use of resources by distributing them across computing environments.

Source: Susnjara and Smalley (2025)

Two Basic ‘Flavours’

Both do the same thing: separate the platform from the hardware, but they do this in defferent ways for different reasons.

  • A ‘full’ Virtual Machine (VM) includes the Operating System and behaves like a separate computer even though it may share hardware with other VMs.
  • A ‘container’ is a ‘lightweight’ VM running only the application and its dependencies; everything else is managed by the host Operating System so the resulting ‘image’ is small and easy to distribute.

Short version: if you have to install an Operating System you are using a full VM; otherwise you are probably using a containerisation tool/

Many things, including storage, networks, CPUs, GPUs, etc. can be virtualised.

Why Use Containers?

We gain quite a few benefits:

  1. Easier installation and ‘everyone’ has the same versions of the code.
  2. Each container is isolated and read-only.
  3. Easy to tidy up when you’re done.
  4. Easy to scale up and scale down, or to link them together via ‘microservices’.
  5. Used in the ‘real world’ by many companies (JP Morgan Chase, GSK, PayPal, Twitter, Spotify, Uber…).

The Bigger Picture

Rather than having one environment for every project, we have one environment for each project.

  • ‘Computing contexts’ are disposable, while data and code are persistent when I need them.
  • I don’t care where my code and data are, so long as they’re accessible when I need them.
  • I don’t care if containers are created or destroyed, so long as they’re available when I need them.
  • I rebuild or update the computing context when I am ready to do so.

Podman

Podman is an open source container and image management engine. Podman makes it easy to find, run, build, and share containers.

Using Podman

Podman makes configuring a development environment (fairly) simple. If a Podman image works for us then we know1 it works for you.

Use either:

  1. jreades/sds:2025-amd (Windows and Older Macs)
  2. jreades/sds:2025-arm (Newer Macs)

Version Control

… is the practice of tracking and managing changes to software code.

Source: Altassian

Why use Version Control?

… If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.

Source: Altassian

In addition:

  • We can share code with others (directly) as source code or (indirectly) as the product of compiling that source code.
  • We can rewind, fast forward, and combine changes by different people working on different features.
  • We gain detailed, incremental backups that help us tro track down the changes that introduced a bug when something goes wrong.

The Bigger Picture

In open source projects there may be no one view of what the ‘right’ solution/version of a project is, so differences need to be negotiated.

  • Every computer with version control might have the ‘right’ version of the code for a given user, so there is no ‘master’ view of a project.
  • We need to be able to choose whether to merge other people’s changes with our changes, rather than having everything forced on us.
  • We still want to be able to share our version of the code / outputs of the code with other people, and a web site is a good way to do that.

Git

Version control allows us to:

  1. Track changes to files with a high level of detail using commit.
  2. push these changes out to others.
  3. pull down changes made by others.
  4. merge and resolve conflicting changes.
  5. Create a tag when a ‘milestones’ is reached.
  6. Create a branch to add a feature.
  7. Retrieve specific versions or branches with a checkout.

GitHub

Git is distributed, meaning that every computer is a potential server and a potential authority. Result: commits on a plane!

But how do people find and access your code if your ‘server’ is a home machine that goes to sleep at night? Result: GitHub.

GitHub is ‘just’ a very large Git server with a lot of nice web-friendly features tacked on: create a web site, issue/bug tracking, promote your project…

Git+GitHub is for… anything!

Oh My Git!

Source: OhMyGit

Markup

A markup language is a text-encoding system which specifies the structure and formatting of a document…

Source: Wikipedia

Why use Markup?

  • Quickly sketch out the structure of a document.
  • Focus on the substance, not the style.
  • Works well with version control (line-by-line changes + GitHub.io web site).
  • Combine code, documentation, and narrative easily.

The Bigger Picture

I spend a lot less time ‘faffing’ writing in Markdown than I used to. Spend more time on what you want to say and worry about the how later.

### A Subtitle

Some text in **bold** and *italics* with a [link](https://jreades.github.io/).

> A blockquote

A Subtitle

Some text in bold and italics with a link.

A blockquote

Markdown Examples

See CommonMark and the Markdown Guide for more:

Format Output
Plain text... Plain text
## A Large Heading

A Large Heading

### A Medium Heading

A Medium Heading

- A list
- More list
  • A list
  • More list
1. An ordered list
2. More ordered list
  1. An ordered list
  2. More ordered list
[A link](http://casa.ucl.ac.uk) A link
Format Output
![Alt Text](casa_logo.jpg) Alt Text
`x=y+1`
x=y+1
```{python}
# A block of Python code
x = y+1
```
# A block of Python code
  x = y+1

Embedded Maths

$$
f(a) = \frac{1}{2\pi i} 
    \oint_{\gamma} \frac{f(z)}{z-a} dz $$

 

\[ f(a) = \frac{1}{2\pi i} \oint_{\gamma} \frac{f(z)}{z-a} dz \]

Custom Markup

<div style="border: dotted 2px red; margin-top: 25px; background-color: rgb(230,230,230)">
  This content has HTML formatting attached.
</div>

 

This content has HTML formatting attached.

Render

‘Rendering’ is the process of taking all of the code and markup and outputting it to a particular format (web page, web site, PDF, etc.). So it’s the last piece of this pipeline for working with data, code, and text.

Why Render?

  • Outputs can be: web pages, Jupyter notebooks, Word documents, PDFs, presentations…
  • It can be really useful to have a single input and multiple outputs because requirements and needs always change.
  • It teaches you to focus on the process, not the minutiae.

The Bigger Picture

Everything this week was created using these basic tools and techniques. It has transformed the way I teach, do research, and write! It embodies the potential of ‘literate programming’ (Knuth 1984).

Recap

  • Podman is how you will run a virtual machine with all the necessary tools pre-installed.
  • JupyterLab is how you will ‘talk’ to the virtual machine and tell it to run code.
  • Markdown is how you will write simply-formatted content (in JupyterLab and Quarto)
  • Git/GitHub is how you will can manage code and content so that you always have a backup plan.
  • Rendering in Quarto is how you will output nicely formatted code and content.

Don’t just take our word for it…

Programming Languages (Used)

Databases (Used)

Frameworks & Libraries (Used)

Virtualisation & Other Tools (Used)

Programming Languages (Desired/Admired)

Databases (Desired/Admired)

Additional Resources

And once you’re ready to get ‘serious’, check out this tutorial on Sustainable Authorship in Plain Text using Pandoc and Markdown from The Programming Historian! That’s what actually underpins Quarto, but you can do so much more…

Let’s Get Started!

References

Abhinav. 2025. “Docker’s Gone — Here’s Why It’s Time to Move on | by Abhinav | Medium.” Online; Medium.
Knuth, D. E. 1984. “Literate Programming.” The Computer Journal 27 (2). Oxford University Press:97–111.
———. 1996. Selected Papers on Computer Science. Cambridge University Press.
Susnjara, S., and I. Smalley. 2025. “What Is Virtualization?” 2025. https://www.ibm.com/think/topics/virtualization.