File Formats

Jon Reades

From Files to Data

In order to read a file you need to know a few things:

What distinguishes one record from another?
What distinguishes one field from another?
What ensures that a field or record is valid?
Does the data set have row or column names? (a.k.a. headers & metadata)
Is the metadata in a separate file or embedded in the file?

Structure of a Tabular Data File

Row and column names make it a lot easier to find and refer to data (e.g. the ‘East of England row’ or the ‘Total column’) but they are not data and don’t belong in the data set itself.

Usually, one record (a.k.a. observation) finishes and the next one starts with a ‘newline’ (\n) or ’carriage return (\r) or both (\r\n) but it could be anything (e.g. EOR).

Usually, one field (a.k.a. attribute or value) finishes and the next one starts with a comma (,) which gives rise to CSV (Comma-Separate Values), but it could be tabs (\t) or anything else too (; or | or EOF).

Don’t Reinvent the Wheel

Reading data is a very common challenge so… there is a probably a package or class for that! You don’t need to tell Python how to read Excel files, SQL files, web pages… find a package that does it for you!

Most Common Formats

Extension	Field Separator	Record Separator	Python Package
`.csv`	`,` but separator can appear in fields enclosed by `"`.	`\n` but could be `\r` or `\r\n`.	`csv`
`.tsv` or `.tab`	`\t` and unlikely to appear in fields.	`\n` but could be `\r` or `\r\n`.	`csv` (!)
`.xls` or `.xlsx`	Binary, you need a library to read.	Binary, you need a library to read.	`xlrd`/`xlsxwriter`
`.sav` or `.sas`	Binary, you need a library to read.	Binary, you need a library to read.	`pyreadstat`
`.json`, `.geojson`	Complex (`,`, `[]`, `{}`), but plain text.	Complex (`,`, `[]`, `{}`), but plain text	`json`, `geojson`
`.feather`	Binary, you need a library to read.	Binary, you need a library to read.	`pyarrow`, `geofeather`
`.parquet`	Binary, you need a library to read.	Binary, you need a library to read.	`pyarrow`

One of the reasons we like CSV and TSV files is that they can be opened and interacted with using the Command Line (as well as Excel/Numbers/etc.) directly. As soon as you get into binary file formats you either need the original tool (and then export) or you need a tool that can read those formats. So the complexity level rises very quickly.

Of course, sometimes you can gain (e.g. SPSS or SAS) in terms of obtaining information about variable types, levels, etc. but usually you use these when that’s all that’s available or when you want to write a file for others to use.

The two formats at the bottom of the table are there because they are useful: the feather format was designed for fast reads and for data interachange with R, while Parquet is a highly-compressed, column-oriented storage format for large data. So for modest-sized data sets (a few hundred MB), or situations where you are working across R and Python, then Feather cannot be beat. For ‘big data’ where you need access to parts of the data set and want to do lazy loading, then parquet is the winner.

‘Mapping’ Data Types

You will often see the term ‘mapping’ used in connection to data that is not spatial, what do they mean? A map is the term used in some programming languages for a dict! So it’s about key : value pairs again.

Here’s a mapping

Input (e.g. Excel)	Output (e.g. Python)
NULL, N/A, “”	`None` or `np.nan`
0..n	`int`
0.00…n	`float`
True/False, Y/N, 1/0	`bool`
R, G, B (etc.)	`int` or `str` (technically a `set`, but hard to use with data sets)
‘Jon Reades’, ‘Huanfa Chen’, etc.	`str`
‘3-FEB-2020’, ‘10/25/20’, etc.	`datetime` module (`date`, `datetime` or `time`)

Testing a Mapping

Working out an appropriate mapping (representation of the data) is hugely time-consuming.

It’s commonly held that 80% of data science is data cleaning.

The Unix utilities (grep, awk, tail, head) can be very useful for quickly exploring the data in order to develop a basic understanding of the data and to catch obvious errors.

You should never assume that the data matches the spec.

Why This Isn’t Easy

Label These

Things That Can Go Wrong…

A selection of real issues I’ve seen in my life:

Truncation: server ran out of diskspace or memory, or a file transfer was interrupted.
Translation: headers don’t line up with data.
Swapping: column order differs from spec.
Incompleteness: range of real values differs from spec.
Corruption: field delimitters included in field values.
Errors: data entry errors resulted in incorrect values or the spec is downright wrong.
Irrelevance: fields that simply aren’t relevant to your analysis.

These will generally require you to engage with columns and rows (via sampling) on an individual level.

> Why We Need an Arrow, a Duck, and a Parquet

Arrow and Parquet

Arrow is an in-memory columnar format for data. Data is stored in a structured way in RAM making it blazingly fast for operations.
Parquet is a highly-compressed columnar file format for data. Data is stored in a structured way on your hard drive.
Feather is a raw storage format for Arrow.

TL;DR: for most applications Parquet will give nice, small files on disk and the benefits of columnar file storage; for computationally intensive applications where disk space and interoperability with other systems isn’t an issue then Feather might work.

What About the Duck?

Serverless SQL queries against Parquet files
Queries returned as Pandas data frames
Select and filter before loading
Fast conversion between CSV and Parquet via Arrow

For (Later) Reference

# Notice the engine and dtype_backend options
df = pandas.read_csv(fname, engine='pyarrow', 
                dtype_backend='pyarrow')

# And for parquet files
df = pandas.read_parquet(fname, columns=[...])

# And for DuckDB we can actually joing two
# files before they even get to Python!
q = f'''
  SELECT * FROM 
    read_parquet('epc-ppd-2022-*.parquet') as ppd, 
    read_parquet('epc-ldd-2022-*.parquet') as ldd,
  WHERE ppd.uid=ldd.uid
'''
df = duckdb.query(q).df()

P.S. There’s also a command-line tool for DuckDB so you don’t even need Python.