Linking Spatial Data

Jon Reades

Non-Spatial Linkages can be Spatial

Data Set 1

SensorID Latitude Longitude
1 51.5070 -0.1347
2 51.5071 -0.0042
3 51.5074 -0.1223
4 51.5073 -0.1122
5 51.5072 0.1589
df = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Latitude': [51.5070, 51.5071, 51.5074, 51.5073, 51.5073],
  'Longitude': [-0.1347, -0.0042, -0.1223, -0.1122, 0.1589]
})

Data Set 2

SensorID Parameter Value
1 Temperature 5ºC
1 Humidity 15%
3 Temperature 7ºC
4 Temperature 7ºC
6 Humidity 18%
df1 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Parameter': ['Temperature','Humidity','Temperature','Temperature','Humidity'], 
  'Value': ['5ºC', '15%', '7ºC', '7ºC', '18%']
})

Sjoin vs. Join

Sjoin adds an operator (['intersects','contains','within']) and example code can be found on GitHub.

gdf = gpd.GeoDataFrame(df, 
            geometry=gpd.points_from_xy(df.longitude, df.latitude,
            crs='epsg:4326')).to_crs('epsg:27700')
hackney = boros[boros.NAME=='Hackney']
rs = gpd.sjoin(gdf, hackney, op='within')

Combining Operators & How

Changing how to left, right, or inner changes the join’s behaviour:

rs = gpd.sjoin(gdf, hackney, how='left', op='within')
rs.NAME.fillna('None', inplace=True)
ax = boros[boros.NAME=='Hackney'].plot(edgecolor='k', facecolor='none', figsize=(8,8))
rs.plot(ax=ax, column='NAME', legend=True)

Merging Data

These merge operators apply where a is the left set of features (in a GeoSeries or GeoDataFrame) and b is the right set:

  • Contains: Returns True if no points of b lie outside of a and at least one point of b lies inside a.
  • Within: Returns True if a’s boundary and interior intersect only with the interior of b (not its boundary or exterior).
  • Intersects: Returns True if the boundary or interior of a intersects in any way with those of b.

All binary predicates are supported by features of GeoPandas, though only these three options are available in sjoin directly.

Additional Spatial Operations

These operators apply to the GeoSeries where a is a GeoSeries and b is one or more spatial features:

  • Contains / Covers: Returns a Series of dtype('bool') with value True for each geometry in a that contains b. These are different.
  • Crosses: An object is said to cross other if its interior intersects the interior of the other but does not contain it, and the dimension of the intersection is less than the dimension of the one or the other.
  • Touches: Returns a Series indicating which elements of a touch a point on b.
  • Distance: Returns a Series containing the distance from all a to some b.
  • Disjoint: Returns a Series indicating which elements of a do not intersect with any b.
  • Geom Equals / Geom Almost Equals: strict and loose tests of equality between a and b in terms of their geometry.
  • Buffer, Simplify, Centroid, Representative Point: common transformations.
  • Rotate, Scale, Affine Transform, Skew, Translate: less common transformations.
  • Unary Union: aggregation of all the geometries in a

RTM

In particular, “contains” (and its converse “within”) has an aspect of its definition which may produce unexpected behaviour. This quirk can be expressed as “Polygons do not contain their boundary”. More precisely, the definition of contains is: Geometry A contains Geometry B iff no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A That last clause causes the trap – because of it, a LineString which is completely contained in the boundary of a Polygon is not considered to be contained in that Polygon! This behaviour could easily trip up someone who is simply trying to find all LineStrings which have no points outside a given Polygon. In fact, this is probably the most common usage of contains. For this reason it’s useful to define another predicate called covers, which has the intuitively expected semantics: Geometry A covers Geometry B iff no points of B lie in the exterior of A

Set Operations with Overlay

It is also possible to apply GIS-type ‘overlay’ operations:

These operations return indexes for gdf1 and gdf2 (either could be a NaN) together with a geometry and (usually?) columns from both data frames:

rs_union = geopandas.overlay(gdf1, gdf2, how='union')

The set of operations includes: union, intersection, difference, symmetric_difference, and identity.

Making a Plan…

Think it Through

As your data grows in volume, the consequences of choosing the ‘wrong’ approach become more severe. Making a plan of attack becomes essential and it boils down to the following:

  1. Spatial joins are hard
  2. Non-spatial joins are easy
  3. Key-/Index-based joins are easiest
  4. Addition conditions to joins makes them harder.

So, when you have multiple joins…

  1. Do the easy ones first (they will run quickly on large data sets).
  2. Do the hard ones last (they will benefit most from the filtering process).

What’s the ‘Right’ Order?

Q. Find me a nearby family-style Italian restaurant…

A. Here’s how I’d do this:

  • City = New York (probably a key)
    • Cuisine = Italian (probably a key)
      • Style = Family (probably an enumeration/free text)
        • Location = Within Distance of X from Request (probably a buffered spatial query)

Mis-matched Scales

Keep in mind:

With complex geometries and mis-matched scales, converting the smaller geometry to centroids or representative points can speed things up a lot (within, contains, etc. become much simpler).

And that:

With large data sets, rasterising the smaller and more ‘abundant’ geometry can speed things up a lot.

> Long term: if you continue to work with large spatial data sets you’ll need to look into web services and spatial databases.

Web Services

Acronym Means Does
WMS Web Map Service Transfers map images within an area specified by bounding box; vector formats possible, but rarely used.
WFS Web Feature Service Allows interaction with features; so not about rendering maps directly, more about manipulation.
WCS Web Coverage Service Transfers data about objects covering a geographical area.
OWS Open Web Services Seems to be used by QGIS to serve data from a PC or server.

Spatial Databases

There are many flavours:

  • Oracle: good enterprise support; reasonably feature-rich, but £££ for commercial use.
  • MySQL: free, unless you want dedicated support; was feature-poor (though this looks to have changed with MySQL8); heavyweight.
  • PostgreSQL: free, unless you want dedicated support; standards-setting/compliant; heavyweight (requires PostGIS).
  • Microsoft Access: um, no.
  • SpatiaLite: free; standards-setting/compliant; lightweight
  • GeoParquet+DuckDB: not as full-featured as Postgres, but evolving quickly and much simpler to configure.

Generally:

  • Ad-hoc, modestly-sized, highly portable == SpatiaLite
  • Permanent, large, weakly portable == Postgres+PostGIS