Pandas is probably (together with scipy, numpy, and sklearn) the main reason that Python has become popular for data science. According to ‘Learn Data Sci’ it accounts for 1% of all Stack Overflow question views!
You will want to bookmark these:
Here’s code to read a (remote) CSV file:
import pandas as pd # import package
# Bitly: https://raw.githubusercontent.com/jreades/fsds/master/data/2019-sample-crime.csv
url='https://bit.ly/39SJpfp'
df = pd.read_csv(url) # load a (remote) CSV
print(type(df)) # not simple data type
print(df.columns.to_list()) # column names
print(df.columns.values) # Also works but deprecated
Output:
df.info
is more about data types and memory usage. df.describe
is for summarising information about the distribution of values in every series.
This should be looking eerily familiar:
print(type(df['Latitude'])) # type for column
print(type(df['Latitude'].array)) # type for values
print(df['Latitude'].array[:5]) # first five values
print(f"1: {df['Latitude'].mean()}") # summarise a series/column
print(f"2: {df.Latitude.mean()}") # if no spaces in name
Produces:
Pandas is also ‘Jupyter-aware’, meaning that output can displayed directly in Jupyter in ‘fancy’ ways:
Pandas operates on two principles:
More subtly, operations on a Series or Data Frame return a shallow copy, which is like a ‘view’ in a database…
inplace=True
(where supported).SettingWithCopyWarning
warning.If you need a full copy then use the copy()
method (e.g. df.copy()
or df.Series.copy()
).
# Returns a series but not a column
df.Latitude - 1
# Saves returned series as a new column
df['lat'] = df.Latitude - 1
# Returns a new data frame w/o 'lat'
df.drop(columns=['lat'])
# Modifies df directly
df.drop(columns=['lat'], inplace=True)
# Try to modify a view of df (triggers warning)
df[df['Primary Type']=='BURGLARY'].Latitude = 41.7
Operations on a Data Frame return a DataFrame and operations on a Series return a Series, allowing us to ‘chain’ steps together:
# Returns a selection (Boolean series)
df['Primary Type']=='ASSAULT'
# All rows where Primary Type is ASSAULT
df[ df['Primary Type']=='ASSAULT' ]
# Calculations on a slice (returns mean centroid!)
df[df['Primary Type']=='ASSAULT'][['Longitude','Latitude']].mean()
# Two conditions with a bit-wise AND
df[
(df['Primary Type']=='ASSAULT') &
(df['Description']=='AGGRAVATED: HANDGUN')
]
# Two conditions with a bit-wise OR
df[
(df['Primary Type']=='ASSAULT') |
(df['Primary Type']=='THEFT')
]
A Data Series can only be of one type:
Pandas Dtype | Python Type | Usage |
---|---|---|
object |
str or mixed |
Text or mixed columns (including arrays) |
int64 |
int |
Integer columns |
float64 |
float |
Floating point columns |
bool |
bool |
True/False columns |
datetime64 |
N/A (datetime ) |
Date and time columns |
timedelta[ns] |
N/A (datetime ) |
Datetime difference columns |
category |
N/A (set ) |
Categorical columns |
print(df['Primary Type'].unique()) # Find unique values
print(df['Primary Type'].dtype.name) # Confirm is 'object'
df['Primary Type'] = df['Primary Type'].astype('category')
print(df['Primary Type'].dtype.name) # Confirm is 'category'
print(df['Primary Type'].describe()) # Category column info
Outputs:
What do we do here?
print(df.Date.dtype.name)
# object
df.Date.to_list()[:3]
# ['04/20/2019 11:00:00 PM', '12/02/2019 10:35:00 AM', '10/06/2019 04:50:00 PM']
This shows that Date
is currently a string of dates+times.
Pandas handles date and times using a datetime
type that also works as an index (more on these later):
df['dt'] = pd.to_datetime(df.Date.array,
format="%m/%d/%Y %H:%M:%S %p")
print(df.dt.dtype.name)
# datetime64[ns]
df.dt.to_list()[:3]
# [Timestamp('2019-04-20 11:00:00'), Timestamp('2019-12-02 10:35:00'), Timestamp('2019-10-06 04:50:00')]
These follow the formatting conventions of strftime
(string format time) for conversion.
Examples of strftime
conventions include:
Format | Applies To |
---|---|
%d | 2-digit day |
%m | 2-digit month |
%y | 2-digit year |
%Y | 4-digit year |
%p | AM/PM |
So that is why:
Note the other things happening here:
pd.to_datetime(...)
is not a method, it’s a function from the pandas
package.df.Date.array
(and df.Date.to_numpy()
and df.Data.tolist()
) gives access to the data directly, whereas df.Date
gives access to the Series.From time to time, real-world software projects will change the way things work. Pandas is just such a project!
Warning
We recommend using Series.array
or Series.to_numpy()
, depending on whether you need a reference to the underlying data or a NumPy array. See API Documenation.
So while Series.values
still works, and will continue to work for some time, you are being advised to start using Series.array
or Series.to_numpy()
instead. Meaning, we should consider using df.Date.array
.
This is one way, there are many options and subtleties…
# Fix categories
mapping = {}
# df['Primary Type'].unique().to_list() also works
for x in df['Primary Type'].cat.categories.to_list():
mapping[x]=x.title()
# And update
df['Primary Type'] = df['Primary Type'].cat.rename_categories(mapping)
How would you work out what this code does? 1
To deal with pricing information treated as a string:
Many more examples accessible via Google!
There are multiple ways to drop ‘stuff’:
df2 = df.copy()
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df2.drop(index=range(5,10), inplace=True) # Row 'numbers' or index values
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df.drop(columns=['Year'], inplace=True) # Column name(s)
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
There is also df.dropna()
which can apply to rows or columns with NULL
or np.nan
values.
I often prefer df = df[df.index > 15]
(negative selection) to df.drop(index=range(0,14))
(positive selection).
Index | 0 | 1 | 2 | 3 |
---|---|---|---|---|
ID | Case Number | Date | Primary Type | |
0 | 11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
1 | 11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
2 | 11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
We can interact with rows and columns by position or name:
df.iloc[0:2,0:2] # List selection! (':' means 'all')
df.loc[0:2,['ID','Case Number']] # Dict selection
These actually return different results because of the index:
df.loc
returns the rows labeled 0, 1, and 2 ([0..2]
), whereasdf.iloc
returns the range 0..2 ([0..2)
)!So by default, pandas creates a row index index whose values are 0..n and column index whose values are the column names. You will see this if you print out the head:
The left-most column (without) a name is the index.
Now we see:
Case Number Date ... Longitude dt
ID ...
11667185 JC237601 04/20/2019 11:00:00 PM ... -87.603468 2019-04-20 11:00:00
11909178 JC532226 12/02/2019 10:35:00 AM ... -87.643230 2019-12-02 10:35:00
11852571 JC462365 10/06/2019 04:50:00 PM ... -87.758473 2019-10-06 04:50:00
So ID
is now the index and is not accessible as a column: df.ID
will now throw an error because it’s not longer part of the Column Index.
Notice the change to the data frame:
0 | 1 | 2 | |
---|---|---|---|
ID | Case Number | Date | Primary Type |
11667185 | JC237601 | 04/20/2020 11:00:00PM | BURGLARY |
11998178 | JC532226 | 12/02/2020 10:35:00AM | DECEPTIVE PRACTICE |
11852571 | JC462365 | 10/06/2020 04:50:00PM | BATTERY |
And now:
Mnemonic: we used iloc
to select rows/cols based on integer location and we use loc
to select rows/cols based on name location.
P.S. You can reset the data frame using df.reset_index(inplace=True)
.
Pandas can write to a wide range of file types, here are some of the more popular ones:
Command | Saved As… |
---|---|
df.to_csv(<path>) |
CSV file. But note the options to change sep (default is ',' ) and to suppress index output (index=False ). |
df.to_excel(<path>) |
XLSX file. But note the options to specify a sheet_name , na_rep , and so on, as well as to suppress the index (index=False ). |
df.to_feather(<path>) |
Directly usable by R. Requires pyarrow to be installed to access the options. |
df.to_parquet(<path>) |
Directly usable by many languages. Requires pyarrow to be installed to access the options. |
df.to_latex(<path>)) |
Write a LaTeX-formatted table to a file. Display requires booktabs. Could do copy+paste with print(df.to_latex()) . |
df.to_markdown(<path>) |
Write a Markdown-formatted table to a file. Requires tabulate. Could do copy+paste with print(df.to_markdown()) . |
In most cases compression is detected automatically (e.g. df.to_csv('file.csv.gz')
) but you can also specify it (e.g. df.to_csv('file.csv.gz', compression='gzip'))
.1
Pandas • Jon Reades