Scaling Data

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

The Challenge

As the data get bigger
Pandas gets slower.

The Options

Make pandas faster.
Make the storage better.
Make using more cores easier.

Make Pandas Faster: Code

Introducing:
lazy evaluation and filtergraphs.

Make Pandas Faster: Storage

Introducing:
Arrow and Parquet.

Make Pandas Faster: Scaling

Introducing:
Parallelisation.

The Tools

Introducing:
Polars, Dask, DuckDB, and Postgres.

What About Polars?

Pros:

Looks like pandas code¹, doesn’t run like it.
Multi-threaded execution.
Arrow behind the scenes.
Streaming data processing possible.

Cons:

Easy to miss out on the benefits.
Designed for single machines².

What About the Duck?

Pros:

Serverless SQL queries against Parquet files.
Queries can be returned as Pandas data frames.
Select and filter before loading using Arrow.
Plugins for geospatial and remote parquet files.

Cons:

Need to learn SQL.
Still constrained by available memory.
Really still optimised for a single machine.

What About Dask?

Pros:

Distributed and parallel data processing
Queries returned as Pandas data frames
Lazy evaluation of code
Built-in scaling

Cons:

Lots of overhead for small queries.
Moving a lot of data around.
Not all problems readily parallelisable.

Postgres?

Pros:

Fully-fledged database.
Industry-leading handling of geospatial data.

Cons:

Need to learn SQL.
Need to learn how to design databases to maximise gains.
Moving lots of data in/out of the database is slow.

Is it Worth It?

A Simple Comparison

import pandas as pd
import polars as pl
import duckdb as duck
import time
import seaborn as sns
sns.load_dataset('titanic').to_parquet('titanic.pq')

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Measuring Performance

We can compare performance using profiling¹:

import timeit

def load_pandas():
    pd.read_parquet('titanic.pq', columns=['age']).age.mean()

def load_polars():
    pl.read_parquet('titanic.pq', columns=['age'])['age'].mean()

def load_duck():
    duck.sql("SELECT MEAN(age) FROM read_parquet('titanic.pq')")

num_reps = 1000

The Results

Pandas

pd_execution_time = timeit.timeit(load_pandas, number=num_reps)
print(f"Pandas execution time: {pd_execution_time:.6f} seconds")

Pandas execution time: 1.171555 seconds

Polars

pl_execution_time = timeit.timeit(load_polars, number=num_reps)
print(f"Polars execution time: {pl_execution_time:.6f} seconds")

Polars execution time: 1.030761 seconds

12.02% faster than pandas.

DuckDB

db_execution_time = timeit.timeit(load_duck, number=num_reps)
print(f"DuckDB execution time: {db_execution_time:.6f} seconds")

DuckDB execution time: 0.409600 seconds

65.04% faster than pandas and 60.26% faster than polars.

Summarising Data

Useful, But Limited?

Method	Achieves
`count()`	Total number of items
`first()`, `last()`	First and last item
`mean()`, `median()`	Mean and median
`min()`, `max()`	Minimum and maximum
`std()`, `var()`	Standard deviation and variance
`mad()`	Mean absolute deviation
`prod()`	Product of all items
`sum()`	Sum of all items

Grouping Operations

In Pandas these follow a split / apply / combine approach:

1. Raw data
	class	sex	age
0	Third	m	22.0
1	First	f	38.0
2	Third	f	26.0
3	First	f	35.0
4	Third	m	35.0
5	Third	m	NaN
6	First	m	54.0
7	Third	m	2.0

2a. Grouped First Class
	class	sex	age
1	First	f	38.0
3	First	f	35.0
6	First	m	54.0

2b. Grouped Third Class
	class	sex	age
0	Third	m	22.0
2	Third	f	26.0
4	Third	m	35.0
5	Third	m	NaN
7	Third	m	2.0

3a. Summary First Class
	class	age
	First	42.33

3b. Summary Third Class
	class	age
	Third	21.25

4. Result
	class	age
	First	42.33
	Third	21.25

In Practice

grouped_df = df.groupby(<fields>).agg(['<function1>', ..., '<functionn>'])

For instance, to get multiple summary values:

# This will throw an error when there are non-numeric columns
# grouped_df = df.groupby('class').agg(['sum','mean'])
# This will only aggregate one column
grouped_df = df.groupby('class').age.agg(['sum','mean'])
grouped_df.head()

	sum	mean
class
First	7111.42	38.233441
Second	5168.83	29.877630
Third	8924.92	25.140620

Grouping by Arbitrary Mappings

In cases where we want to group by a derived value that doesn’t exist in the data we can apply a mapping as part of the process:

mapping = {'First':'Fancy', 'Second':'Not-Fancy', 'Third':'Not-Fancy'}
grouped_df = df.set_index('class').groupby(mapping).age.agg(['sum','mean'])
grouped_df.head()

	sum	mean
class
Fancy	7111.42	38.233441
Not-Fancy	14093.75	26.692708

Apply

You can take this one step further by ‘applying’ an arbitrary function in a row- or column-wise fashion.

def norm(x): # x is a column from the grouped df
    return x['age'] * x['fare'] 

df.groupby('class').apply(norm)

class     
First  1      2708.7654
       3      1858.5000
       6      2800.5750
       11     1539.9000
       23      994.0000
                ...    
Third  882     231.3674
       884     176.2500
       885    1135.8750
       888          NaN
       890     248.0000
Length: 891, dtype: float64

Pivot Tables

A ‘special case’ of Group By features:

Commonly-used in business to summarise data for reporting.
Grouping (summarisation) happens along both axes (Group By operates only on one).
pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.

In Pandas

age = pd.cut(df['age'], [0, 18, 80])
pd.set_option('display.float_format', '{:,.3f}'.format)
df.pivot_table('survived', ['sex', age], 'class', observed=False)

	class	First	Second	Third
sex	age
female	(0, 18]	0.909	1.000	0.512
female	(18, 80]	0.973	0.900	0.424
male	(0, 18]	0.800	0.600	0.216
male	(18, 80]	0.375	0.071	0.134

In SQL¹

sql = """PIVOT (
    SELECT sex, survived, class, CASE WHEN age < 18 then '(0, 18]' else '(18, 80]' end as age 
    FROM read_parquet('titanic.pq'))
  ON class USING SUM(survived) AS survived GROUP BY sex, age"""
  
pd.set_option('display.float_format', '{:,.0f}'.format)
df = duck.sql(sql).to_df().set_index('sex')
df.columns = ['Age','First','Second','Third']
df

	Age	First	Second	Third
sex
female	(18, 80]	84	58	53
male	(0, 18]	4	9	10
male	(18, 80]	41	8	37
female	(0, 18]	7	12	19

Counts

Other ways to summarise:

duck.sql("""SELECT class, sex, case WHEN survived=1 THEN 'Y' ELSE 'N' end AS survived, COUNT(*) AS n
  FROM read_parquet('titanic.pq') WHERE sex='female' GROUP BY sex, class, survived
  ORDER BY sex ASC, class ASC, survived DESC""").to_df().set_index(['class','sex'])

		survived	n
class	sex
First	female	Y	91
First	female	N	3
Second	female	Y	70
Second	female	N	6
Third	female	Y	72
Third	female	N	72

Scaling Data

The Challenge

As the data get bigger
Pandas gets slower.

The Options

Make Pandas Faster: Code

Introducing:
lazy evaluation and filtergraphs.

Make Pandas Faster: Storage

Introducing:
Arrow and Parquet.

Make Pandas Faster: Scaling

Introducing:
Parallelisation.

The Tools

Introducing:
Polars, Dask, DuckDB, and Postgres.

What About Polars?

What About the Duck?

What About Dask?

Postgres?

Is it Worth It?

A Simple Comparison

Measuring Performance

The Results

Pandas

Polars

DuckDB

Summarising Data

Useful, But Limited?

Grouping Operations

In Practice

Grouping by Arbitrary Mappings

Apply

Pivot Tables

In Pandas

In SQL¹

Counts

Additional Resources

Thank You

References

Scaling Data

The Challenge

As the data get bigger Pandas gets slower.

The Options

Make Pandas Faster: Code

Introducing: lazy evaluation and filtergraphs.

Make Pandas Faster: Storage

Introducing: Arrow and Parquet.

Make Pandas Faster: Scaling

Introducing: Parallelisation.

The Tools

Introducing: Polars, Dask, DuckDB, and Postgres.

What About Polars?

What About the Duck?

What About Dask?

Postgres?

Is it Worth It?

A Simple Comparison

Measuring Performance

The Results

Pandas

Polars

DuckDB

Summarising Data

Useful, But Limited?

Grouping Operations

In Practice

Grouping by Arbitrary Mappings

Apply

Pivot Tables

In Pandas

In SQL1

Counts

Additional Resources

Thank You

References

As the data get bigger
Pandas gets slower.

Introducing:
lazy evaluation and filtergraphs.

Introducing:
Arrow and Parquet.

Introducing:
Parallelisation.

Introducing:
Polars, Dask, DuckDB, and Postgres.

In SQL¹