Wednesday, June 24, 2015

Do Ya Reading

You're not asking a question that's never been asked before.


Chances are, you will rarely ever ask a question that's never been studied before. Why? Well, it's 2015 and the world is much more complex than you ever imagined. People in the early 20th century thought we'd progress so fast that people simply wouldn't need to work by the year 2000.

Alas, I ask you this question: given a traveling salesman who has to visit n customers and return home by the end of the day, in what is the shortest possible route for the salesman that still visits each customer exactly once? What if each customer needs to be visited between certain 30 minute windows? What if there are multiple salesmen? Finally - how do we reschedule the salesmen if one of their cars breaks down? People in 1930 couldn't solve this; they didn't have computers. Thus, the world is much more complicated than assembly lines and the Teapot Dome.

This is basically the Data Science for Social Good Paratransit team's problem. It's a much enhanced Traveling Salesman Problem with Time Windows (TSPTW). Our problem is the following: given that King County Metro Access Operations already has a solution to the Vehicle Routing Problem with Time Windows for the next day's ADA paratransit rides, a bus breaks down or a driver doesn't show up. Access Operations can dispatch an entirely new Access bus, send taxis, or potentially reroute already existing Access bus runs to satisfy the broken bus's run. Which is the cheapest option?

Fortunately, there's an extensive body of literature on the Vehicle Routing Problem and Route Disruption Recovery. We're not asking an entirely new question. We are framing it in a new light: we need real-time vehicle rerouting with time windows that minimizes pre-optimized route disruptions (already studied) with hard constraints so that there are no stop cancellations (not studied) which is particular to paratransit context.

In the end, I guess I've misspoken. You'll never be asking an entirely new question, but you may be tweaking a previously well-understood question to the point where the prior methods for solving are now useless. Let's hope that's not the case here!

Tuesday, June 23, 2015

Big Data Analytics and Buzzwords: the process, coming from an intern.

Where Buzzwords hit the road.


It's the beginning of week two as an inaugural "Data Science for the Social Good" intern, and it's been eye opening. First off, this phrase that's been floating around the news-Facebook-Twitter-HuffPo hyperether, "data science," has many more components than meets the eye. It's not just the simple implementation of some sexy new machine learning algorithm "to do some deep learning," but is instead a steady combination of organization, Git commits, and data QC'ing throughout!

The ML comes at the end folks, as a rewarding confirmation of all your hacking and high-level data manipulation.

Let's relate it to my team's project regarding King County Metro Access - i.e. ADA paratransit serving the Seattle metro region. We're attempting to find the density of costs per passenger boarding, separate the outliers, and draw conclusions about the Access routes that have these expensive constituent rides. We have tons of data! And a lot of it looks like this:




Sweet, good thing I'm a data scientist and I have the statistics background to read this. Just kidding. Right now I'm still depending on my communication skills to ask people what this means, and code in R accordingly. What does it mean for an Access bus to complete a trip? What if there's missing latitude and longitude data? What if the bus leaves the garage but it's never indicated that it returns? WHEN WILL I BE ABLE TO RUN AN SVM MODEL WITH AN RBF KERNEL? I WANT DEEP LEARNING! MODELS MODELS MODELS!


Nope! End of the day: you have QC your data before you can do anything else. That's why I made a .json file that contains separated Access van runs, and flags them according to the quality of each run's data. Step 1: almost complete.