Data Workshop

School Name	School Address	AM Bell Time	PM Bell Time
PS X	601 N Mesa	7:30 AM	2:30 PM

ID	Grade	Address	Stop
123456789	9	711 Hills	St Vrain & E Father Rahm

About me: Like Rohan said, I'm currently a master of information student at U of T. My background is in geography and history. I drifted into data stuff via GIS -- people would hire me to make maps.

This talk draws from work doing transit analysis for some schools in the US. The specific questions and the work involved varied from school to school, but questions generally fell into two bins. The first involved details of individual commutes: if students took public transit, how long would the trip take, how many times would they have to transfer, how far would they have to walk. The second category focused on general transit accessibility: questions of what parts of the city are within reasonable commutes via transit. What constituted 'reasonable' could vary.

To answer these questions, we used the Google Directions API, OpenTripPlanner, Python, and QGIS, but all of this is doable in R, and frankly I think some aspects might be easier in R.

Along the way, I'll touch upon a general Python workflow for getting data from APIs, with a detour into UNIX time and timezones. We'll build our own dataset from the results, which will require learning a little about spatial data representation and common file formats. And finally, we'll look at a simple spatial operation.

To quickly describe the inputs, we got data describing schools -- including school name, address, and bell times -- and datasets about students, with varying states of quality, so a lot of data cleaning and joining occurred even before getting to the API calling stage. Just as a note, none of the data shown here is actual student data, and most of the locations aren't even schools, either.

One common task I won't be going into is geocoding, aka converting addresses into coordinates. It isn't necessary for working with the Directions API, and you can actually extract coordinates from Directions API results. But if it is necessary to geocode separately, Google's geocoder is quite error-tolerant. That error tolerance means we can pass in an intersection, or just a ZIP code, or fudge student addresses by incrementing the house numbers by a random integer, and still get usable results. Some other options include Nominatim, which is built off of OpenStreetMap, and private vendors like Mapbox or HERE.

As I mentioned, we used Google Directions to generate indiviual trips. It's not necessarily the best option, but it was chosen for two related reasons. One, people tend to take Google results for the truth, and two, the results are easy to verify. If anyone thinks a result looks funny, they can go and confirm it. To make API calls, we used the requests library. It's a versatile python library for HTTP requests, so we actually used it to interface with OpenTripPlanner as well. There is a Python Google Maps client, but it doesn't return the full API response, and we wanted some of the pieces it leaves out. If you've ever gotten directions from Google, you can guess the parameters needed to pass to the API. Besides the key, there's the origin and destination locations, which can be addresses. To answer our questions, we also had to specify the travel mode -- transit here -- and either the arrival time for AM trips or the departure time for PM trips.

But there's a catch. The times need to be in UNIX time, which is the number of seconds since midnight on January 1, 1970, Coordinated Universal Time. The bell times we have are all in text. We don't have dates, and the specific date chosen doesn't matter too much. Since we're interested in commutes, and transit schedules differ on weekends and holidays, we need to use non-holiday weekdays. The dates can't be too far into the past or future, either. And, just for added fun, the schools are in different timezones.

We didn't want to hard code UNIX dates into the script, so from the department of over-engineering came a couple of helper functions to turn bell times into that time next Wednesday. Wednesday was chosen just because there weren't any holidays coming up on Wednesdays. Shout-out to the dateutil library for doing the heavy lifting here.

Finally, we're set to make API calls. I'm not going to dwell too long here -- it's a fairly standard workflow of iterating through a pandas DataFrame containing the data to make calls, formatting the parameters, making the request, parsing the results into json objects and writing them to file.

Before looking at the results, I want to mention two caveats. First, I got better results passing in addresses than coordinates, even when those coordinates came from Google. We geocoded the schools for something else and figured may as well use the coordinates because they're more precise, but the results ended up being worse. On the right is an extreme example replicated in the web interface. They're results for the same journey, the only difference is up top two addresses were supplied, while at the bottom I entered the coordinates I got from Google's geocoder for the same address.

Second caveat is check all the trip attributes because the API really wants to return a result, even if it's not appropriate for your purpose. So here, we wanted to model a morning commute, arriving at the destination by 7:30 AM. Google did get a response....you just have to overnight it at your destination.

So, when results come back successfully, they look something like this. You have the status at the bottom -- here it's okay, but if one of the points couldn't be found or there aren't any results, the status will reflect that. Up top are the geocoded waypoints. If you give addresses, this section will just have google's proprietary place IDs and some place categories. If you supply coordinates, they'll reverse geocode here. The key part is the routes section. By default, there will only be one route in the result. Within the route, the key pieces are an overview polyline representing the shape of the journey, and the legs attribute, which contains overall trip distance, duration, start and end locations, and steps, which break down the trip down further. Our trips only have a start and endpoint, with no intermediate stops, so there is only one leg. In the leg, the steps attribute contains detailed information like what routes are taken, what turns to make, and which parts of the journey are walking, transit, etc.

All this is great, but we can't map this as-is. In order to get it to a mappable state, we need to talk about spatial data representation.

The most common spatial data model is the simple features model. Simple features is an ISO/OGC standard for vector data -- geographic data that can be represented as discrete objects. Spatially, data can be modeled as points -- for example, cities -- lines -- e.g. streets or rivers -- or polygons, like park space or countries. The multi- variants are for entities like, say, Japan, that are best represented with more than one shape. Finally, a geometry collection is a catch-all for heterogenous shapes. No matter what though, geometries are represented with coordinate pairs, so that polyline string isn't going to cut it. Most geographic information systems and mapping software use this model, or a variant. Some big exceptions are raster data, so things like temperatures or elevations that are better modeled as continuous fields, and OpenStreetMap, which uses its own data model that better serves its purposes. Moving between OSM and simple features is a common enough task that there are several tools to help with the conversion. It's just good to know that that is a step.

There are file formats that are simple features compliant. We decided to convert the API results into geoJSON. GeoJSON, like JSON, is relatively easy to create programmatically. Up here is a basic skeleton -- we specify that the file is a feature collection, rather than a single feature. Then features contains a list of the features. Each feature consists of a geometry attribute, where we'll need to put in the coordinate pairs, and properties, where we'll put trip attributes like duration.

First, we need to get the polylines into coordinate pairs. Google documented the encoding algorithm, but instead of reinventing the wheel, we used the polyline library to write a decoding function.

Then, another function to extract the attributes,

and finally pop geometries and properties into a feature and add it to the features list. From there we wrote them to file.

So, for a much simpler dummy dataset, the geoJSONs look like this. From there, they can be put on a map and styled by whatever attribute is of interest. Here, trips that are more than half an hour long are orange, while shorter ones are purple. The trip to the southwest corner of the map is the polyline in the example result earlier.

So that gets us mappable data for individual trips. But what if the question is more general? What if, instead of focusing on individual trips, we wanted to show what parts of the city are within reasonable commuting times? In other words, we need isochrones.

Google doesn't have an API for that. Really, for transit isocrhones, low-cost API options are slim. So we went with OpenTripPlanner, which is open source, Java-based multimodal routing software. It works with OpenStreetMap data and GTFS, or generalized transit feed specification, data. It's meant to run server-side, so there are API endpoints for routing and isochrone creation, which means we can interact with it using requests. For the R users in the audience, there are excellent resources.

Building the router and calling the API could be its own talk, so in the interest of time I'll cut to what the isochrones look like. These were morning and afternoon isochrones for 15-minute intervals from 30 to 90 minutes. There were some additional restrictions, like capping the amount of walking, but you can see that this city's transit system really wasn't designed for early school starts.

There is a hiccup here. When I handed the files off to a colleague to work with in QGIS, this happened. The error is a bit small, but it says that there is an invalid geometry. It turns out that the isochrones look good, but they don't always comply with the simple feature specification. They sometimes do things polygons aren't supposed to do, like intersect themselves. When shapes don't follow the simple features spec, you can still map them, but you can get errors like this when you try to perform spatial operations.

The problem is fixable within QGIS, but you don't want to do this for hundreds of files. Fortunately, there's a hacky-but-simple fix in Python. We'll need to geopandas library to do this. Geopandas essentially extends pandas and its data frame model to provide support for simple features -- points, lines, and polygons. It's useful for reading and writing spatial data files, managing coordinate reference systems, and performing spatial operations. Just a quick aside about spatial reference systems -- they're important, but because all the data here is meant to work on web maps, they have the same reference system.

If you load spatial data to a geopandas dataframe you get something like this. You'll have whatever attributes from the data, with the shape represented as a geometry attribute like we've got all the way to the right here.

One operation that does work is buffering. What buffers normally do is create an area x distance away from a shape as-the-crow-flies. So, for example, everything in a 1-km radius of a point, or the area within 100 miles of a border. Here, we'll just buffer by 0 to fix the geometry errors.

It's kind of hard to show the lack of an error message, but now that the polygons are all sorted out it's possible to perform spatial operations on them. Because the OTP isochrones give you more control over constraints than google directions, one question that pops up is which students or stops fall within reasonable commuting range. This is something that can be answered using a spatial join. Just like an regular join merges datasets based on the relationships between attribute values, a spatial join merges datasets based on spatial relationships between records. So here are some randomly generated points over the data...

...and here's what the result of a spatial join looks like. Points are color-coded by the isochrone they fall in. So you don't have the particulars of individual journeys, but you do get an understanding of commutes and mobility for no API credits.

Geographic Data in Python

Agenda

About Me

Project Description

Toolset

Skills/Concepts

Inputs

School data

Student data

A Note on Geocoding

Working with the Google Directions API

Do the Time Warp

Do the Time Warp

Making the Call

Caveat 1

Equivalent addresses and coordinates may yield different results

Caveat 2

Results may meet the letter but not the spirit of the request

Google Directions Results

Google Directions Results Components

Great! But we can't map this.

Simple Features

geoJSON

Decoding Polylines

Extracting Attributes

Putting It All Together

Results!

That works for individual trips. What if I want something more general?

OpenTripPlanner

One Little Problem...

Geopandas to the Rescue

Geopandas DataFrames

Shape Shifting

Fixed Isochrones + Points

Spatial Join Results

Spatial Joining in Geopandas

Thank you!

Useful links and references

Google Documentation

OpenTripPlanner

Useful links and references

Python Libraries

Spatial Data

Useful links and references

GIS Software

Misc