Giter Club home page Giter Club logo

datapalooza-2015's Introduction

Caltrain Rider: A Complete Data Product, IBM Datapalooza 2015

The following material was presented at the IBM Datapalooza, Nov. 9, 2015.

Session overview: In a finished data product, there are a lot of decisions, both technical and strategic. We'll use the Caltrain Rider app to tell the story of deciding architecture, experimentation with data science, deploying as a mobile app, and how to create a positive data feedback loop. Why did we create this app? We love data. We love trains. Our office is next to a Caltrain Station. Building an intuitive view of the Caltrain systems using information from some of our own sensors (video, audio) combined with information from publicly available sensors (Twitter, Caltrain API) is a fun way to hone our data science skills, while helping to fulfill one of our core values: contributing back to our community.

The Silicon Valley Data Science Caltrain Rider app provides an intuitive way for Caltrain commuters to understand their personal train schedule. In the future, the application will provide predictions of schedule delays and relevant notifications for riders' routes. The purpose of this repo is present several of the key data science components that have been completed for the app. Since the data engineering platform and application development environment are already in place, we anticipate that a working prediction model will soon be pushed to the Caltrain Rider app.

Project Outline

  • iOS Application
  • Published schedule refresh (completed and deployed)
  • Realtime schedule pipeline (completed)
  • Twitter pipeline (completed)
  • Sentiment Analysis (completed)
  • Analysis and prediction of delays (in progress)
  • Web Service refactor
  • Integration testing

Sentiment Analysis of Caltrain Tweets

Sentiment Analysis of Caltrain Tweets.ipynb

Twitter is a popular place for people to vent their frustrations or update their fellow passengers on the current state of public transportation, and as such it is a valuable source of real-time data on Caltrain service. As a first step to using Twitter activity as a data source for train prediction, we start with a simple question: How do Twitter users currently feel about Caltrain?

This type of problem is well-studied in natural language processing and machine learning, and can be thought of as a classification task: given the content of a document (in our case, a tweet), classify its sentiment as either positive or negative. In the notebook [Sentiment Analysis Of Caltrain Tweets](Sentiment\ Analysis\ of\ Caltrain\ Tweets.ipynb), we walk through an example of this procedure using scikit-learn in Python.

Trainspotting

trainspotting-demo.ipynb

This demo will walk you through the creation of an algorithm that uses image processing techniques and heuristics unique to our problem of detecting when a train passes by our office and in which direction it traveling. The following steps outline the content of the notebook:

  • Load video
  • Convert to grayscale
  • Smooth the frames
  • Compute running average
  • Detect changes
  • Adjust the difference threshold
  • Compare right/left sides of video
  • Detect train
  • Detect direction of train

Technical note: The trainspotting-demo depends on motion_detector.py and the video file 'test.avi'. Also required is the OpenCV library, which we recommend obtaining with conda install opencv (after which you may need to run conda update numpy).

Detecting Catastrophic Delays from Twitter

caltrain_tweets.ipynb

Caltrain occasionally experiences system-wide perturbations that make train departures unpredictable. When this happens, Caltrain shuts down their real-time API. This occurs during:

  • pedestrian strikes -vehicle strikes
  • stalled trains

The frequency and duration of such events are unpredictable. In some cases, trains experience long delays due to externalities, such as:

  • major sporting events
  • 'heavy ridership'
  • busy holiday weekends
  • construction

While these are often known in advance, their exact effects on train delays are difficult to predict. The purpose of this analysis is to detect such catastrophic delays in order shift the operating regime of our prediction model. As trains begin to operate normally along the corridor, our model must recover reliably.

We thus examine catastrophic events in historical data for two reasons:

  • Exclude them from predictible delays for the purposes of training our model,
  • Identify them in real-time so that the prediction model can be shifted to a different regime

For the initial stages, we manually inspected Caltrain's twitter feed. Tweets are loaded from the data file caltrain_tweets.json. During a 10-day period in October, we identified four major incidents, three of which resulted in service interruptions and major delays.

datapalooza-2015's People

Contributors

beneverson-svds avatar chloe-svds avatar eric-svds avatar stephen-svds avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.