Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Pros:
Cons:
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
We can compare performance using profiling1:
Pandas execution time: 1.171555 seconds
Polars execution time: 1.030761 seconds
12.02% faster than pandas.
DuckDB execution time: 0.409600 seconds
65.04% faster than pandas and 60.26% faster than polars.
| Method | Achieves |
|---|---|
count() |
Total number of items |
first(), last() |
First and last item |
mean(), median() |
Mean and median |
min(), max() |
Minimum and maximum |
std(), var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
In Pandas these follow a split / apply / combine approach:
| class | sex | age | |
|---|---|---|---|
| 0 | Third | m | 22.0 |
| 1 | First | f | 38.0 |
| 2 | Third | f | 26.0 |
| 3 | First | f | 35.0 |
| 4 | Third | m | 35.0 |
| 5 | Third | m | NaN |
| 6 | First | m | 54.0 |
| 7 | Third | m | 2.0 |
| class | sex | age | |
|---|---|---|---|
| 1 | First | f | 38.0 |
| 3 | First | f | 35.0 |
| 6 | First | m | 54.0 |
| class | sex | age | |
|---|---|---|---|
| 0 | Third | m | 22.0 |
| 2 | Third | f | 26.0 |
| 4 | Third | m | 35.0 |
| 5 | Third | m | NaN |
| 7 | Third | m | 2.0 |
| class | age | |
|---|---|---|
| First | 42.33 |
| class | age | |
|---|---|---|
| Third | 21.25 |
| class | age | |
|---|---|---|
| First | 42.33 | |
| Third | 21.25 |
For instance, to get multiple summary values:
| sum | mean | |
|---|---|---|
| class | ||
| First | 7111.42 | 38.233441 |
| Second | 5168.83 | 29.877630 |
| Third | 8924.92 | 25.140620 |
In cases where we want to group by a derived value that doesn’t exist in the data we can apply a mapping as part of the process:
You can take this one step further by ‘applying’ an arbitrary function in a row- or column-wise fashion.
A ‘special case’ of Group By features:
pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.| class | First | Second | Third | |
|---|---|---|---|---|
| sex | age | |||
| female | (0, 18] | 0.909 | 1.000 | 0.512 |
| (18, 80] | 0.973 | 0.900 | 0.424 | |
| male | (0, 18] | 0.800 | 0.600 | 0.216 |
| (18, 80] | 0.375 | 0.071 | 0.134 |
sql = """PIVOT (
SELECT sex, survived, class, CASE WHEN age < 18 then '(0, 18]' else '(18, 80]' end as age
FROM read_parquet('titanic.pq'))
ON class USING SUM(survived) AS survived GROUP BY sex, age"""
pd.set_option('display.float_format', '{:,.0f}'.format)
df = duck.sql(sql).to_df().set_index('sex')
df.columns = ['Age','First','Second','Third']
df| Age | First | Second | Third | |
|---|---|---|---|---|
| sex | ||||
| female | (18, 80] | 84 | 58 | 53 |
| male | (0, 18] | 4 | 9 | 10 |
| male | (18, 80] | 41 | 8 | 37 |
| female | (0, 18] | 7 | 12 | 19 |
Other ways to summarise:
| survived | n | ||
|---|---|---|---|
| class | sex | ||
| First | female | Y | 91 |
| female | N | 3 | |
| Second | female | Y | 70 |
| female | N | 6 | |
| Third | female | Y | 72 |
| female | N | 72 |