Giter Club home page Giter Club logo

capstone_project's Introduction

Hi there ๐Ÿ‘‹

I'm a data scientist focusing on time-series analyses and forecasting in R. I've worked as a data science manager at an automotive engineering company, where I developed anomaly detection algorithms for engine health monitoring on multivariate IOT sensor data. I've been a 6S Black Belt, Product Validation Engineer, and FEA Analyst (computational stress, vibration and fatigue modeling).

Topics that I'm excite me -

  • anomaly detection
  • time series forecasting
  • reproducible work & production readiness
  • R package development
  • visualization
  • visual astronomy
  • mechatronics (arduino, rpi)

Find me on:

Main projects

Tools for time series analysis and forecasting, and datasets:

portfolio anomaly_db docker postgres

Talks

portfolio

Misc

rsangole

capstone_project's People

Contributors

andrew3cooper avatar kapelinskim6 avatar rsangole avatar stephenhage avatar tedinciong avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

capstone_project's Issues

Modelling Task List - Rahul

Maintaining a task list for myself here.

Data Processing

  • define groups of variables to use as predictors
  • identify near-zero-variance and zero-variance predictors
  • train-val-test splits
  • convert daily data into monthly averages

EDA and Hypothesis Testing

  • what lag terms do we need? perform time series analysis with acf/pacf plots
  • for weather data - can we stick to just using ohare data?
  • for just ohare data, how much weather information do we need? opportunity to reduce vars using PCA?
  • what does clustering show us? anything useful?

Feature Engineering

  • weight of evidence variables for large-level factor variables
  • introduce appropriate lag terms

Reading

  • finish reading the antonUBC kaggle code for inspiration

Modeling

  • setup the code structure to use mlr correctly (especially to use the model comparison codes)

Feature Reduction Activities

  • lasso regression based feature selection
  • random forest's var imp plot based feature selection
  • information value based feature reduction

Reporting

Develop data dictionary

For all the data we shall be consuming, a data dictionary needs to be put together. This can be an excel sheet, or a R dataframe.

  • data dictionary - what data, column definitions, usage comments etc
  • entity relationship diagram

Building data EDA and reshaping

I did a lot of work geocoding and reshaping building violations & abandoned/vacant property data, but I'm not sure yet if it'll be useful as is. Vacancy data in particular is very sparse if you look at small geographic areas (block group rather than community area, for example). Is there anything more we could do? Should we standardize data somehow, e.g. divide number of violations/vacancies by population in that geographic area?

Weather data EDA

There is a lot of weather data available and a lot of missing data. So far, I just pulled in a few variables (precipitation and temperature) from two stations (O'Hare and Midway airport weather stations). We could use additional EDA to see which other weather variables from other stations are populated enough to use and look like they may be useful. Alternatively, we could run a distance matrix or something to find the closest weather station to each trap (but first limit to stations with nearly-complete daily data!).

Calculate distance from trap and add to schools, hospitals, and SC

At a minimum, each of these locations (school, hospital, SC) should provide information to the user on the nearest trap. EG.

Nearest Trap: T009
Location: (Address or neighborhood)
Last Time to Test Positive for WNV: Date
Projected Date to Test Positive for WNV: Date
Likelihood to Test Positive: (From Model)
Severity of Testing Positive: (Based on distance from Trap)

Create code for scoring models

We have train/validation/test splits designated in the WNV trap test result dataset.

We have two proposed target variables (separate models for number of mosquitos [regression problem] and presence of WNV [binary classification problem]).

We'll be running a lot of models in the near future.

We need to settle on metrics for evaluating models and code for implementing them. Personally, I'd prefer to model scoring functions in R since I'm doing little with Python, but that's just me.

Exploratory Data Analysis - Summarizations, Graphs and Plots

Everyone:

You can contribute to the EDA phase of the project by performing various visualizations and summaries. These should be published in some form back to the team - either rmarkdown documents, jupyter notebooks, PDFs, or simply a word document.

Ideas for visualizations:

  • Time series plots for various continuous predictors
  • Time series plots for response variable
  • Boxplots for categorical variables
  • Scatter plots color coded by categoricals
  • Scatter plots color coded by response
  • Maps + overlaps of various predictors and response
  • Clustering and cluster plots
  • TSNE plots
  • Density plots grouped by categoricals and response

Linking two new issues Andrew has created: #20 and #21

create shiny server

Currently trying to install the VM shiny server, but unable to run fab vagrant setup_vagrant

Rendering issues in Shiny

I need to fine tune the rendering specs and make them dynamic for the medium. Maps/Plots should render to full page on website as well as mobile app. Currently, the layer options renders is "off-screen"

Obtain weather data -- since NOAA site is down, maybe try scraping weather data 2007-18

Since data.NOAA.gov site is down due to government shutdown, someone could try scraping weather data. Try following the Python tutorial posted in Canvas.

We do have kaggle's NOAA extract for 2007-14. They included data from two unidentified weather stations. If we can get data from multiple weather stations closer to the WNV traps, that should improve our models.

Obtain demographic and socioeconomic data (Census American Community Survey)

Census American Community Survey. The 5-year summary files are huuuuge. I need to go back and figure out which variables to extract. Once I have that done, I can upload the subset here. I'm also trying to decide whether to just use one 5-year summary or a different summary depending on the WNV trap date.

Develop Dashboards

  • Dashboard for # of Mosquitoes
  • Dashboard for # of WNV Positive Traps
  • Enhance dashboards with school, hospital, senior center markers
  • Basic data summaries (bar charts, line graphs, etc.)
  • Dashboards down to long/lat level
  • Explore map overlay options
  • Prepare options for viz/maps to show forecasted data

Modeling Ideas

Team,

Please populate your modeling ideas here.

What types of models can be built? What ideas do you have in mind?

Additional lagged weather columns

Could we add the following columns to the data, when you get a chance?

  • 2wk_precip - For each day, the preceding 2-week average for precip
  • 4wk_precip
  • 90day_precip
  • 2wk_tavg
  • 4wk_tavg
  • 2wk_mintemp
  • 4wk_mintemp

Fix Ohare heat map

When I look at the count for positive WNV, ORD has 300+ cases of WNV. But when I plot positive WNV as a heat map, ORD does not show. Investigating...

Spray data EDA

Does someone want to run EDA on the two years of mosquito spray data we have from kaggle?

I'm curious what we'll see, and I'd like to confirm that spray data is readily predicted by recent past history of positive WNV test results, plus whether mosquito counts are inversely related to recent past history of spraying. The latter two will require modeling but it doesn't have to be fancy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.