The main issue for the next week or so will be to assemble all of the individual layers into a cohesive, as-bug-free-as-possible main.py script that takes a handful of command-line arguments for parameters such as broken bus run ID, break-down time, and AWS access keys for acquiring the real-time bus scheduling data.
What I've realized through all of this: I'm glad we have a contact at King County Metro to explain the column headers from the rider data to us. The data is really all over the place, from missing latitude/longitude input to an "estimated time of arrival" column listing the time of day as "115000" seconds, i.e. hour ~32 of a day. Previously, we've just been eliminating these rides from the roster, as they're signs of outdated data capturing methods that KCM is in the process of eliminating. Currently, a big issue we're having is working with the 15-minute updated text files that we're receiving from KCM: the columns are stored according to variable space counts, and several columns that were previously combined into one are now space-delimited. We'll need to first organize these columns into a more useful .csv-type format and then QC the real-time data as we have been doing.
To-do: Given the insertion of passenger p on to bus X requiring the minimum additional travel time between any two feasible points (from a time-windows perspective) on bus X's original schedule, bus X will subsequently miss n-t additional time windows. Here, n is the number of missed time windows after p has been serviced by bus X, and t is the number of time windows that the bus is already scheduled to miss according to the quasi-real-time bus schedule stream. We know how much additional time that inserting p onto X will take according to the OSRM routing API, and therefore we can find the additional cost on a per service hour basis. That's great, but we still need to balance this pure costs against n-t, and the scale by which the n-t time windows are missed. For example, suppose inserting passenger p on to X will result in missing only one time window by 1500 seconds (25 minutes), but that inserting passenger p on to Y will result in 4 missed time windows that are only missed by about 200 seconds each; which one is better? That's up to us!