Monday, July 27, 2015

Week 7: data grunginess despite general progress

DSSG 2015 Paratransit group is making progress! We plugging away, making tons of Python scrips and iPython Notebooks for testing our various functions. The workflow is as follows (click for better resolution):



The main issue for the next week or so will be to assemble all of the individual layers into a cohesive, as-bug-free-as-possible main.py script that takes a handful of command-line arguments for parameters such as broken bus run ID, break-down time, and AWS access keys for acquiring the real-time bus scheduling data.

What I've realized through all of this: I'm glad we have a contact at King County Metro to explain the column headers from the rider data to us. The data is really all over the place, from missing latitude/longitude input to an "estimated time of arrival" column listing the time of day as "115000" seconds, i.e. hour ~32 of a day. Previously, we've just been eliminating these rides from the roster, as they're signs of outdated data capturing methods that KCM is in the process of eliminating. Currently, a big issue we're having is working with the 15-minute updated text files that we're receiving from KCM: the columns are stored according to variable space counts, and several columns that were previously combined into one are now space-delimited. We'll need to first organize these columns into a more useful .csv-type format and then QC the real-time data as we have been doing.

To-do: Given the insertion of passenger p on to bus X requiring the minimum additional travel time between any two feasible points (from a time-windows perspective) on bus X's original schedule, bus X will subsequently miss n-t additional time windows. Here, n is the number of missed time windows after p has been serviced by bus X, and t is the number of time windows that the bus is already scheduled to miss according to the quasi-real-time bus schedule stream. We know how much additional time that inserting p onto X will take according to the OSRM routing API, and therefore we can find the additional cost on a per service hour basis. That's great, but we still need to balance this pure costs against n-t, and the scale by which the n-t time windows are missed. For example, suppose inserting passenger p on to X will result in missing only one time window by 1500 seconds (25 minutes), but that inserting passenger p on to Y will result in 4 missed time windows that are only missed by about 200 seconds each; which one is better? That's up to us!


Sunday, July 12, 2015

The algorithm: brute force, intern style

Coming into week 5, I'm pretty psyched.

Tomorrow we have a meeting with Dr. Randy LeVeque, UW Applied Math legend/Professor, and Wednesday he's leading the "Reproducibility Workshop" where we're spending a lot of time on file organization via Git - maybe I'll finally start climbing up this Git learning curve.

Last Friday, July 10th, we had a meeting with Matthew, our primary contact at King County Metro Access operations. Fortunately, our interpretation of the optimization problem was correct. The situation is still this: given an unscheduled request (or set of unscheduled requests), find the most appropriate run on which we can squeeze the new passenger. Compare the additional cost/damage to current schedule to the cost of sending a DEB (dispatched extra boarding?) bus out, or fulfill the ride with a taxi (given the passenger can even ride in a taxi).

We've done a decent amount of literature review (shout-out to Dr. Carl Henrik Hall for his 2013 paper "Improving Paratransit Scheduling Using Ruin and Recreate Methods" being the most relevant paper to our problem thus far). The best method for inserting unscheduled requests is simply called "best insertion," and it's mentioned in Hall's aforementioned paper:


                                             Lol. Deceptive use of the word "simply," eh?


It's just a brute-force algorithm, where you test Y/N "can this new passenger fit on bus K between nodes i and given everyone's time window constraints and the Access bus's capacity constraints (12 ambulatory spots or 3 wheelchairs)?"

On a regular weekday, the peak number of Access bus routes in operation is ~250. If a bus breaks down at a peak ridership time, are we going to test the insertion of b stranded passengers onto 250 routes? Well, hopefully not. My job for tomorrow is to write a reasonable distance cut-off function that will tell us which buses are simply too far away to be able to service the stranded passengers by the end of each passenger's acceptable pick-up or drop-off window.

Over all, we're planning on using the OSRM (Open Street Routing Machine) API to obtain the estimated travel time to leave normally-scheduled node i, arrive at unscheduled request node h, and make it back to normally-scheduled node j by the end of j's time window. If this is the case, then it's feasible to pick up unscheduled request at node h. We'll get a list of these feasible pick-up routes. The next question obviously becomes whether or not we can also feasibly drop off the unscheduled request at their drop-off location without breaking subsequent time windows. One of my top priorities is to minimize the number of API requests our solution will have to make, querying travel times between points. I think I've got a good methodology...

Iterating over all of the unscheduled requests, we'll get a list of wholly feasible bus routes that can accommodate each unscheduled request, and then we'll have to rank each feasible re-route. The ranking might get tricky, because technically all feasible routes will only extend each original bus route schedule by 2(time-window-size)-minutes, assuming that no passenger's time window constraints are broken. But, I anticipate that we might have to change our methodology to let feasible re-routes break time window constraints, in which case the ranking criterion would obviously rank feasible re-route according to who misses the fewest passenger's time windows, and by how much those windows are missed.

Then, there's the whole question of how to put this application into a feasible package... Django anyone?


Wednesday, June 24, 2015

Do Ya Reading

You're not asking a question that's never been asked before.


Chances are, you will rarely ever ask a question that's never been studied before. Why? Well, it's 2015 and the world is much more complex than you ever imagined. People in the early 20th century thought we'd progress so fast that people simply wouldn't need to work by the year 2000.

Alas, I ask you this question: given a traveling salesman who has to visit n customers and return home by the end of the day, in what is the shortest possible route for the salesman that still visits each customer exactly once? What if each customer needs to be visited between certain 30 minute windows? What if there are multiple salesmen? Finally - how do we reschedule the salesmen if one of their cars breaks down? People in 1930 couldn't solve this; they didn't have computers. Thus, the world is much more complicated than assembly lines and the Teapot Dome.

This is basically the Data Science for Social Good Paratransit team's problem. It's a much enhanced Traveling Salesman Problem with Time Windows (TSPTW). Our problem is the following: given that King County Metro Access Operations already has a solution to the Vehicle Routing Problem with Time Windows for the next day's ADA paratransit rides, a bus breaks down or a driver doesn't show up. Access Operations can dispatch an entirely new Access bus, send taxis, or potentially reroute already existing Access bus runs to satisfy the broken bus's run. Which is the cheapest option?

Fortunately, there's an extensive body of literature on the Vehicle Routing Problem and Route Disruption Recovery. We're not asking an entirely new question. We are framing it in a new light: we need real-time vehicle rerouting with time windows that minimizes pre-optimized route disruptions (already studied) with hard constraints so that there are no stop cancellations (not studied) which is particular to paratransit context.

In the end, I guess I've misspoken. You'll never be asking an entirely new question, but you may be tweaking a previously well-understood question to the point where the prior methods for solving are now useless. Let's hope that's not the case here!

Tuesday, June 23, 2015

Big Data Analytics and Buzzwords: the process, coming from an intern.

Where Buzzwords hit the road.


It's the beginning of week two as an inaugural "Data Science for the Social Good" intern, and it's been eye opening. First off, this phrase that's been floating around the news-Facebook-Twitter-HuffPo hyperether, "data science," has many more components than meets the eye. It's not just the simple implementation of some sexy new machine learning algorithm "to do some deep learning," but is instead a steady combination of organization, Git commits, and data QC'ing throughout!

The ML comes at the end folks, as a rewarding confirmation of all your hacking and high-level data manipulation.

Let's relate it to my team's project regarding King County Metro Access - i.e. ADA paratransit serving the Seattle metro region. We're attempting to find the density of costs per passenger boarding, separate the outliers, and draw conclusions about the Access routes that have these expensive constituent rides. We have tons of data! And a lot of it looks like this:




Sweet, good thing I'm a data scientist and I have the statistics background to read this. Just kidding. Right now I'm still depending on my communication skills to ask people what this means, and code in R accordingly. What does it mean for an Access bus to complete a trip? What if there's missing latitude and longitude data? What if the bus leaves the garage but it's never indicated that it returns? WHEN WILL I BE ABLE TO RUN AN SVM MODEL WITH AN RBF KERNEL? I WANT DEEP LEARNING! MODELS MODELS MODELS!


Nope! End of the day: you have QC your data before you can do anything else. That's why I made a .json file that contains separated Access van runs, and flags them according to the quality of each run's data. Step 1: almost complete.