Scaling Data

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

The Challenge

As the data get bigger
Pandas gets slower.

The Options

  1. Make pandas faster.
  2. Make the storage better.
  3. Make using more cores easier.

Make Pandas Faster: Code

Introducing:
lazy evaluation and filtergraphs.

Make Pandas Faster: Storage

Introducing:
Arrow and Parquet.

Make Pandas Faster: Scaling

Introducing:
Parallelisation.

The Tools

Introducing:
Polars, Dask, DuckDB, and Postgres.

What About Polars?

Pros:

  • Looks like pandas code1, doesn’t run like it.
  • Multi-threaded execution.
  • Arrow behind the scenes.
  • Streaming data processing possible.

Cons:

  • Easy to miss out on the benefits.
  • Designed for single machines2.

What About the Duck?

Pros:

  • Serverless SQL queries against Parquet files.
  • Queries can be returned as Pandas data frames.
  • Select and filter before loading using Arrow.
  • Plugins for geospatial and remote parquet files.

Cons:

  • Need to learn SQL.
  • Still constrained by available memory.
  • Really still optimised for a single machine.

What About Dask?

Pros:

  • Distributed and parallel data processing
  • Queries returned as Pandas data frames
  • Lazy evaluation of code
  • Built-in scaling

Cons:

  • Lots of overhead for small queries.
  • Moving a lot of data around.
  • Not all problems readily parallelisable.

Postgres?

Pros:

  • Fully-fledged database.
  • Industry-leading handling of geospatial data.

Cons:

  • Need to learn SQL.
  • Need to learn how to design databases to maximise gains.
  • Moving lots of data in/out of the database is slow.

A Simple Comparison

import pandas as pd
import polars as pl
import duckdb as duck
import time
import seaborn as sns
sns.load_dataset('titanic').to_parquet('titanic.pq')
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Measuring Performance

We can compare performance using profiling1:

import timeit

def load_pandas():
    pd.read_parquet('titanic.pq', columns=['age']).age.mean()

def load_polars():
    pl.read_parquet('titanic.pq', columns=['age'])['age'].mean()

def load_duck():
    duck.sql("SELECT MEAN(age) FROM read_parquet('titanic.pq')")

num_reps = 1000

The Results

Pandas

pd_execution_time = timeit.timeit(load_pandas, number=num_reps)
print(f"Pandas execution time: {pd_execution_time:.6f} seconds")
Pandas execution time: 0.784750 seconds

Polars

pl_execution_time = timeit.timeit(load_polars, number=num_reps)
print(f"Polars execution time: {pl_execution_time:.6f} seconds")
Polars execution time: 0.161258 seconds

79.45% faster than pandas.

DuckDB

db_execution_time = timeit.timeit(load_duck, number=num_reps)
print(f"DuckDB execution time: {db_execution_time:.6f} seconds")
DuckDB execution time: 0.100967 seconds

87.13% faster than pandas and 37.39% faster than polars.

Summarising Data

Useful, But Limited?

Method Achieves
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

Grouping Operations

In Pandas these follow a split / apply / combine approach:

In Practice

grouped_df = df.groupby(<fields>).<function>

For instance, if we had a Local Authority (LA) field:

grouped_df = df.groupby('LA').sum()

Using apply the function could be anything:

def norm_by_data(x): # x is a column from the grouped df
    x['d1'] /= x['d2'].sum() 
    return x

df.groupby('LA').apply(norm_by_data)

Grouping by Arbitrary Mappings

mapping = {'HAK':'Inner', 'TH':'Outer', 'W':'Inner'}
df.set_index('LA', inplace=True)
df.groupby(mapping).sum()

Pivot Tables

A ‘special case’ of Group By features:

  • Commonly-used in business to summarise data for reporting.
  • Grouping (summarisation) happens along both axes (Group By operates only on one).
  • pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.

In Practice

age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

Counts

Pivots & Groups

Extracting

Thank You

References