Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Pandas is probably (together with scipy, numpy, and sklearn) the main reason that Python has become popular for data science. According to ‘Learn Data Sci’ it accounts for 1% of all Stack Overflow question views!
You will want to bookmark these:
Here’s code to read a (remote) CSV file:
import pandas as pd # import package
# Bitly: https://raw.githubusercontent.com/jreades/fsds/master/data/2019-sample-crime.csv
url='https://bit.ly/39SJpfp'
df = pd.read_csv(url) # load a (remote) CSV
print(type(df)) # not simple data type
print(df.columns.to_list()) # column names
print(df.columns.values) # Also works but deprecated
Output:
df.info
is more about data types and memory usage. df.describe
is for summarising information about the distribution of values in every series.
This should be looking eerily familiar:
print(type(df['Latitude'])) # type for column
print(type(df['Latitude'].array)) # type for values
print(df['Latitude'].array[:5]) # first five values
print(f"1: {df['Latitude'].mean()}") # summarise a series/column
print(f"2: {df.Latitude.mean()}") # if no spaces in name
Produces:
Pandas is also ‘Jupyter-aware’, meaning that output can displayed directly in Jupyter in ‘fancy’ ways:
Pandas operates on two principles:
More subtly, operations on a Series or Data Frame return a shallow copy, which is like a ‘view’ in a database…
inplace=True
(where supported).SettingWithCopyWarning
warning.If you need a full copy then use the copy()
method (e.g. df.copy()
or df.Series.copy()
).
# Returns a series but not a column
df.Latitude - 1
# Saves returned series as a new column
df['lat'] = df.Latitude - 1
# Returns a new data frame w/o 'lat'
df.drop(columns=['lat'])
# Modifies df directly
df.drop(columns=['lat'], inplace=True)
# Try to modify a view of df (triggers warning)
df[df['Primary Type']=='BURGLARY'].Latitude = 41.7
Operations on a Data Frame return a DataFrame and operations on a Series return a Series, allowing us to ‘chain’ steps together:
# Returns a selection (Boolean series)
df['Primary Type']=='ASSAULT'
# All rows where Primary Type is ASSAULT
df[ df['Primary Type']=='ASSAULT' ]
# Calculations on a slice (returns mean centroid!)
df[df['Primary Type']=='ASSAULT'][['Longitude','Latitude']].mean()
# Two conditions with a bit-wise AND
df[
(df['Primary Type']=='ASSAULT') &
(df['Description']=='AGGRAVATED: HANDGUN')
]
# Two conditions with a bit-wise OR
df[
(df['Primary Type']=='ASSAULT') |
(df['Primary Type']=='THEFT')
]
A Data Series can only be of one type:
Pandas Dtype | Python Type | Usage |
---|---|---|
object |
str or mixed |
Text or mixed columns (including arrays) |
int64 |
int |
Integer columns |
float64 |
float |
Floating point columns |
bool |
bool |
True/False columns |
datetime64 |
N/A (datetime ) |
Date and time columns |
timedelta[ns] |
N/A (datetime ) |
Datetime difference columns |
category |
N/A (set ) |
Categorical columns |
print(df['Primary Type'].unique()) # Find unique values
print(df['Primary Type'].dtype.name) # Confirm is 'object'
df['Primary Type'] = df['Primary Type'].astype('category')
print(df['Primary Type'].dtype.name) # Confirm is 'category'
print(df['Primary Type'].describe()) # Category column info
Outputs:
What do we do here?
print(df.Date.dtype.name)
# object
df.Date.to_list()[:3]
# ['04/20/2019 11:00:00 PM', '12/02/2019 10:35:00 AM', '10/06/2019 04:50:00 PM']
This shows that Date
is currently a string of dates+times.
Pandas handles date and times using a datetime
type that also works as an index (more on these later):
df['dt'] = pd.to_datetime(df.Date.array,
format="%m/%d/%Y %H:%M:%S %p")
print(df.dt.dtype.name)
# datetime64[ns]
df.dt.to_list()[:3]
# [Timestamp('2019-04-20 11:00:00'), Timestamp('2019-12-02 10:35:00'), Timestamp('2019-10-06 04:50:00')]
These follow the formatting conventions of strftime
(string format time) for conversion.
Examples of strftime
conventions include:
Format | Applies To |
---|---|
%d | 2-digit day |
%m | 2-digit month |
%y | 2-digit year |
%Y | 4-digit year |
%p | AM/PM |
So that is why:
Note the other things happening here:
pd.to_datetime(...)
is not a method, it’s a function from the pandas
package.df.Date.array
(and df.Date.to_numpy()
and df.Data.tolist()
) gives access to the data directly, whereas df.Date
gives access to the Series.From time to time, real-world software projects will change the way things work. Pandas is just such a project!
Warning
We recommend using Series.array
or Series.to_numpy()
, depending on whether you need a reference to the underlying data or a NumPy array. See API Documenation.
So while Series.values
still works, and will continue to work for some time, you are being advised to start using Series.array
or Series.to_numpy()
instead. Meaning, we should consider using df.Date.array
.
This is one way, there are many options and subtleties…
# Fix categories
mapping = {}
# df['Primary Type'].unique().to_list() also works
for x in df['Primary Type'].cat.categories.to_list():
mapping[x]=x.title()
# And update
df['Primary Type'] = df['Primary Type'].cat.rename_categories(mapping)
How would you work out what this code does? 1
To deal with pricing information treated as a string:
Many more examples accessible via Google!
There are multiple ways to drop ‘stuff’:
df2 = df.copy()
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df2.drop(index=range(5,10), inplace=True) # Row 'numbers' or index values
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df.drop(columns=['Year'], inplace=True) # Column name(s)
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
There is also df.dropna()
which can apply to rows or columns with NULL
or np.nan
values.
I often prefer df = df[df.index > 15]
(negative selection) to df.drop(index=range(0,14))
(positive selection).
Index | 0 | 1 | 2 | 3 |
---|---|---|---|---|
ID | Case Number | Date | Primary Type | |
0 | 11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
1 | 11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
2 | 11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
We can interact with rows and columns by position or name:
df.iloc[0:2,0:2] # List selection! (':' means 'all')
df.loc[0:2,['ID','Case Number']] # Dict selection
These actually return different results because of the index:
df.loc
returns the rows labeled 0, 1, and 2 ([0..2]
), whereasdf.iloc
returns the range 0..2 ([0..2)
)!So by default, pandas creates a row index index whose values are 0..n and column index whose values are the column names. You will see this if you print out the head:
The left-most column (without) a name is the index.
Now we see:
Case Number Date ... Longitude dt
ID ...
11667185 JC237601 04/20/2019 11:00:00 PM ... -87.603468 2019-04-20 11:00:00
11909178 JC532226 12/02/2019 10:35:00 AM ... -87.643230 2019-12-02 10:35:00
11852571 JC462365 10/06/2019 04:50:00 PM ... -87.758473 2019-10-06 04:50:00
So ID
is now the index and is not accessible as a column: df.ID
will now throw an error because it’s not longer part of the Column Index.
Notice the change to the data frame:
0 | 1 | 2 | |
---|---|---|---|
ID | Case Number | Date | Primary Type |
11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
And now:
Mnemonic: we used iloc
to select rows/cols based on integer location and we use loc
to select rows/cols based on name location.
P.S. You can reset the data frame using df.reset_index(inplace=True)
.
Pandas can write to a wide range of file types, here are some of the more popular ones:
Command | Saved As… |
---|---|
df.to_csv(<path>) |
CSV file. But note the options to change sep (default is ',' ) and to suppress index output (index=False ). |
df.to_excel(<path>) |
XLSX file. But note the options to specify a sheet_name , na_rep , and so on, as well as to suppress the index (index=False ). |
df.to_feather(<path>) |
Directly usable by R. Requires pyarrow to be installed to access the options. |
df.to_parquet(<path>) |
Directly usable by many languages. Requires pyarrow to be installed to access the options. |
df.to_latex(<path>)) |
Write a LaTeX-formatted table to a file. Display requires booktabs. Could do copy+paste with print(df.to_latex()) . |
df.to_markdown(<path>) |
Write a Markdown-formatted table to a file. Requires tabulate. Could do copy+paste with print(df.to_markdown()) . |
In most cases compression is detected automatically (e.g. df.to_csv('file.csv.gz')
) but you can also specify it (e.g. df.to_csv('file.csv.gz', compression='gzip'))
.1