Tuesday, June 23, 2015

Big Data Analytics and Buzzwords: the process, coming from an intern.

Where Buzzwords hit the road.


It's the beginning of week two as an inaugural "Data Science for the Social Good" intern, and it's been eye opening. First off, this phrase that's been floating around the news-Facebook-Twitter-HuffPo hyperether, "data science," has many more components than meets the eye. It's not just the simple implementation of some sexy new machine learning algorithm "to do some deep learning," but is instead a steady combination of organization, Git commits, and data QC'ing throughout!

The ML comes at the end folks, as a rewarding confirmation of all your hacking and high-level data manipulation.

Let's relate it to my team's project regarding King County Metro Access - i.e. ADA paratransit serving the Seattle metro region. We're attempting to find the density of costs per passenger boarding, separate the outliers, and draw conclusions about the Access routes that have these expensive constituent rides. We have tons of data! And a lot of it looks like this:




Sweet, good thing I'm a data scientist and I have the statistics background to read this. Just kidding. Right now I'm still depending on my communication skills to ask people what this means, and code in R accordingly. What does it mean for an Access bus to complete a trip? What if there's missing latitude and longitude data? What if the bus leaves the garage but it's never indicated that it returns? WHEN WILL I BE ABLE TO RUN AN SVM MODEL WITH AN RBF KERNEL? I WANT DEEP LEARNING! MODELS MODELS MODELS!


Nope! End of the day: you have QC your data before you can do anything else. That's why I made a .json file that contains separated Access van runs, and flags them according to the quality of each run's data. Step 1: almost complete.



No comments:

Post a Comment