Data Structures

Jon Reades

It’s a very deep rabbit hole…

cities = {
  'London': [[51.5072, 0.1275], +0], 
  'New York': [[40.7127, 74.0059], -5], 
  'Tokyo': [[35.6833, 139.6833], +8]
}

So:

print(cities['London'][0]) # Prints [51.5072, 0.1275]

But Compare…

Consider how these two data structures differ:

cities = [
  {'name': 'London', 'loc': [51.5072, 0.1275], 'tz': +0}, 
  {'name': 'New York', 'loc': [40.7127, 74.0059], 'tz': -5}, 
  {'name': 'Tokyo', 'loc': [35.6833, 139.6833], 'tz': +8}
]

Or:

cities = {
  'London': {'loc': [51.5072, 0.1275], 'tz': +0}, 
  'New York': {'loc': [40.7127, 74.0059], 'tz': -5}, 
  'Tokyo': {'loc': [35.6833, 139.6833], 'tz': +8}
}

Implications

So we can mix and match dictionaries and lists in whatever way we need to store… ‘data’. The question is then: what’s the right way to store our data?

One more thing…

But Compare…

How do these data structures differ?

Option 1

ds1 = [
  ['lat','lon','name','tz'],
  [51.51,0.13,'London',+0],
  [40.71,74.01,'New York',-5],
  [35.69,139.68,'Tokyo',+8]
]

Option 2

ds2 = {
  'lat': [51.51,40.71,35.69],
  'lon': [0.13,74.01,139.68],
  'tz':  [+0,-5,+8],
  'name':['London','New York','Tokyo']
}

Thinking it Through

Why does this work for both computers and people?

ds2 = {
  'lat': [51.51,40.71,35.69],
  'lon': [0.13,74.01,139.68],
  'tz':  [+0,-5,+8],
  'name':['London','New York','Tokyo']
}

Examples

ds2 = {
  'lat': [51.51,40.71,35.69],
  'lon': [0.13,74.01,139.68],
  'tz':  [+0,-5,+8],
  'name':['London','New York','Tokyo']
}

print(ds2['name'][0]) # London
print(ds2['lat'][0])  # 51.51
print(ds2['tz'][0])   # 0

So 0 always returns information about London, and 2 always returns information about Tokyo. But it’s also easy to ask for the latitude (ds2['lat'][0]) or time zone (ds2['tz'][0]) value once you know that 0 is London!

How is that easier???

Remember that we can use any immutable ‘thing’ as a key. This means…

ds2 = {
  'lat': [51.51,40.71,35.69],
  'lon': [0.13,74.01,139.68],
  'tz':  [+0,-5,+8],
  'name':['London','New York','Tokyo']
}

city_nm = 'Tokyo'
city_idx = ds2['name'].index(city_nm)

print(f"The time zone of {city_nm} is {ds2['tz'][city_idx]}")

We can re-write this into a single line as:

city_nm = 'New York'
print(f"The time zone of {city_nm} is {ds2['tz'][ ds2['name'].index(city_nm)]}")

This achieves several useful things:

It is fast: faster than iterating over a list-of-lists or dictionary-of-dictionaries. In other words, there is no iteration at all!
All data in a list is of the same type so we can easily add checks to make sure that it’s valid.
We can also easily calculate an average/max/min/median and so on (as we’ll see later) without even having to look at any other columns!
We can add more columns instantly and the process of finding something is just as fast as it is now. And adding more rows doesn’t make it much slower either!

Also, notice how in these two examples we don’t try to write the second example in one go: first, we work it out as a set of steps: how do we figure out what ‘row’ (position in the list) Tokyo is in? Now that we’ve got that, how do we retrieve the time zone value for Tokyo? We know that code works, now let’s do variable substitution, as we would if we were doing maths: we can replace the city_idx in the time zone lookup with ds2['name'].index('Tokyo').

This is critical!

Once you get your head around this, then 🤯🤯🤯 because pandas and everything we do next will make a lot more sense.

Resources

8 Data Structures Every Data Scientist Should Know (by a CASA alum)