Scaling Data

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

The Challenge

As the data get bigger
Pandas gets slower.

The Options

  1. Make pandas faster.
  2. Make the storage better.
  3. Make using more cores easier.

Make Pandas Faster: Code

Introducing:
lazy evaluation and filtergraphs.

Make Pandas Faster: Storage

Introducing:
Arrow and Parquet.

Make Pandas Faster: Scaling

Introducing:
Parallelisation.

The Tools

Introducing:
Polars, Dask, DuckDB, and Postgres.

What About Polars?

Pros:

  • Looks like pandas code1, doesn’t run like it.
  • Multi-threaded execution.
  • Arrow behind the scenes.
  • Streaming data processing possible.

Cons:

  • Easy to miss out on the benefits.
  • Designed for single machines2.

What About the Duck?

Pros:

  • Serverless SQL queries against Parquet files.
  • Queries can be returned as Pandas data frames.
  • Select and filter before loading using Arrow.
  • Plugins for geospatial and remote parquet files.

Cons:

  • Need to learn SQL.
  • Still constrained by available memory.
  • Really still optimised for a single machine.

What About Dask?

Pros:

  • Distributed and parallel data processing
  • Queries returned as Pandas data frames
  • Lazy evaluation of code
  • Built-in scaling

Cons:

  • Lots of overhead for small queries.
  • Moving a lot of data around.
  • Not all problems readily parallelisable.

Postgres?

Pros:

  • Fully-fledged database.
  • Industry-leading handling of geospatial data.

Cons:

  • Need to learn SQL.
  • Need to learn how to design databases to maximise gains.
  • Moving lots of data in/out of the database is slow.

Is it Worth It?

A Simple Comparison

import pandas as pd
import polars as pl
import duckdb as duck
import time
import seaborn as sns
sns.load_dataset('titanic').to_parquet('titanic.pq')
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Measuring Performance

We can compare performance using profiling1:

import timeit

def load_pandas():
    pd.read_parquet('titanic.pq', columns=['age']).age.mean()

def load_polars():
    pl.read_parquet('titanic.pq', columns=['age'])['age'].mean()

def load_duck():
    duck.sql("SELECT MEAN(age) FROM read_parquet('titanic.pq')")

num_reps = 1000

The Results

Pandas

pd_execution_time = timeit.timeit(load_pandas, number=num_reps)
print(f"Pandas execution time: {pd_execution_time:.6f} seconds")
Pandas execution time: 1.176813 seconds

Polars

pl_execution_time = timeit.timeit(load_polars, number=num_reps)
print(f"Polars execution time: {pl_execution_time:.6f} seconds")
Polars execution time: 1.050749 seconds

10.71% faster than pandas.

DuckDB

db_execution_time = timeit.timeit(load_duck, number=num_reps)
print(f"DuckDB execution time: {db_execution_time:.6f} seconds")
DuckDB execution time: 0.271305 seconds

76.95% faster than pandas and 74.18% faster than polars.

Summarising Data

Useful, But Limited?

Method Achieves
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

Grouping Operations

In Pandas these follow a split / apply / combine approach:

1. Raw data
class sex age
0 Third m 22.0
1 First f 38.0
2 Third f 26.0
3 First f 35.0
4 Third m 35.0
5 Third m NaN
6 First m 54.0
7 Third m 2.0

 

2a. Grouped First Class
class sex age
1 First f 38.0
3 First f 35.0
6 First m 54.0
2b. Grouped Third Class
class sex age
0 Third m 22.0
2 Third f 26.0
4 Third m 35.0
5 Third m NaN
7 Third m 2.0

 

3a. Summary First Class
class age
First 42.33
3b. Summary Third Class
class age
Third 21.25

 

4. Result
class age
First 42.33
Third 21.25

In Practice

grouped_df = df.groupby(<fields>).agg(['<function1>', ..., '<functionn>'])

For instance, to get multiple summary values:

# This will throw an error when there are non-numeric columns
# grouped_df = df.groupby('class').agg(['sum','mean'])
# This will only aggregate one column
grouped_df = df.groupby('class').age.agg(['sum','mean'])
grouped_df.head()
sum mean
class
First 7111.42 38.233441
Second 5168.83 29.877630
Third 8924.92 25.140620

Grouping by Arbitrary Mappings

In cases where we want to group by a derived value that doesn’t exist in the data we can apply a mapping as part of the process:

mapping = {'First':'Fancy', 'Second':'Not-Fancy', 'Third':'Not-Fancy'}
grouped_df = df.set_index('class').groupby(mapping).age.agg(['sum','mean'])
grouped_df.head()
sum mean
class
Fancy 7111.42 38.233441
Not-Fancy 14093.75 26.692708

Apply

You can take this one step further by ‘applying’ an arbitrary function in a row- or column-wise fashion.

def norm(x): # x is a column from the grouped df
    return x['age'] * x['fare'] 

df.groupby('class').apply(norm)
class     
First  1      2708.7654
       3      1858.5000
       6      2800.5750
       11     1539.9000
       23      994.0000
                ...    
Third  882     231.3674
       884     176.2500
       885    1135.8750
       888          NaN
       890     248.0000
Length: 891, dtype: float64

Pivot Tables

A ‘special case’ of Group By features:

  • Commonly-used in business to summarise data for reporting.
  • Grouping (summarisation) happens along both axes (Group By operates only on one).
  • pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.

In Pandas

age = pd.cut(df['age'], [0, 18, 80])
pd.set_option('display.float_format', '{:,.3f}'.format)
df.pivot_table('survived', ['sex', age], 'class', observed=False)
class First Second Third
sex age
female (0, 18] 0.909 1.000 0.512
(18, 80] 0.973 0.900 0.424
male (0, 18] 0.800 0.600 0.216
(18, 80] 0.375 0.071 0.134

In SQL1

sql = """PIVOT (
    SELECT sex, survived, class, CASE WHEN age < 18 then '(0, 18]' else '(18, 80]' end as age 
    FROM read_parquet('titanic.pq'))
  ON class USING SUM(survived) AS survived GROUP BY sex, age"""
  
pd.set_option('display.float_format', '{:,.0f}'.format)
df = duck.sql(sql).to_df().set_index('sex')
df.columns = ['Age','First','Second','Third']
df
Age First Second Third
sex
female (18, 80] 84 58 53
male (0, 18] 4 9 10
male (18, 80] 41 8 37
female (0, 18] 7 12 19

Counts

Other ways to summarise:

duck.sql("""SELECT class, sex, case WHEN survived=1 THEN 'Y' ELSE 'N' end AS survived, COUNT(*) AS n
  FROM read_parquet('titanic.pq') WHERE sex='female' GROUP BY sex, class, survived
  ORDER BY sex ASC, class ASC, survived DESC""").to_df().set_index(['class','sex'])
survived n
class sex
First female Y 91
female N 3
Second female Y 70
female N 6
Third female Y 72
female N 72

Additional Resources

Thank You

References