Monday, July 27, 2015

Week 7: data grunginess despite general progress

DSSG 2015 Paratransit group is making progress! We plugging away, making tons of Python scrips and iPython Notebooks for testing our various functions. The workflow is as follows (click for better resolution):



The main issue for the next week or so will be to assemble all of the individual layers into a cohesive, as-bug-free-as-possible main.py script that takes a handful of command-line arguments for parameters such as broken bus run ID, break-down time, and AWS access keys for acquiring the real-time bus scheduling data.

What I've realized through all of this: I'm glad we have a contact at King County Metro to explain the column headers from the rider data to us. The data is really all over the place, from missing latitude/longitude input to an "estimated time of arrival" column listing the time of day as "115000" seconds, i.e. hour ~32 of a day. Previously, we've just been eliminating these rides from the roster, as they're signs of outdated data capturing methods that KCM is in the process of eliminating. Currently, a big issue we're having is working with the 15-minute updated text files that we're receiving from KCM: the columns are stored according to variable space counts, and several columns that were previously combined into one are now space-delimited. We'll need to first organize these columns into a more useful .csv-type format and then QC the real-time data as we have been doing.

To-do: Given the insertion of passenger p on to bus X requiring the minimum additional travel time between any two feasible points (from a time-windows perspective) on bus X's original schedule, bus X will subsequently miss n-t additional time windows. Here, n is the number of missed time windows after p has been serviced by bus X, and t is the number of time windows that the bus is already scheduled to miss according to the quasi-real-time bus schedule stream. We know how much additional time that inserting p onto X will take according to the OSRM routing API, and therefore we can find the additional cost on a per service hour basis. That's great, but we still need to balance this pure costs against n-t, and the scale by which the n-t time windows are missed. For example, suppose inserting passenger p on to X will result in missing only one time window by 1500 seconds (25 minutes), but that inserting passenger p on to Y will result in 4 missed time windows that are only missed by about 200 seconds each; which one is better? That's up to us!


No comments:

Post a Comment