Giter Club home page Giter Club logo

anomaly-detector's Introduction

Anomaly-Detector

Built a few anomaly detection models to determine the anomalies from the data

D(St)reams of Anomalies

The real world does not slow down for bad data

  1. Set up a data science project structure in a new git repository in your GitHub account
  2. Download the benchmark data set from:
  1. Load the one of the data set into panda data frames
  2. Formulate one or two ideas on how feature engineering would help the data set to establish additional value using exploratory data analysis
  3. Build one or more anomaly detection models to determine the anomalies using the other columns as features
  4. Document your process and results
  5. Commit your notebook, source code, visualizations and other supporting files to the git repository in GitHub

Data Description

This dataset (nyc_taxi) contains the number of NYC taxi passengers, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets.

What we want to do?

  • Main goal: to detect anomalies from New york city taxi data
  • First we shall go through dataset and understand what features can we use to improve results of our anomaly detector.
  • We created new feature changeValue which stores the difference between two continuous samples' values.
  • We also created another feature moving_average, which stores moving averages of 5 continous samples

Results:

One Class SVM based Anomaly detector:

  • It has detected almost 29 outliers but we only had 5 actual ones, so it didn't perform well in our case.

Isolation Forest based Anomaly detector:

  • It has detected NYC marathon, Thanksgiving, Christmas, New Years day, overall it performed well.

Local Outlier Factor based Anomaly detector:

  • It has also detected NYC marathon, Thanksgiving, New Years day, this model has also performed well.

Conclusion and Further improvements:

  • Our models performed quite well on this data but am not sure if these models perform the best with other similar data too, it would only be fair to test on other data and evaluate its performance.
  • From this project, we see that Local Outlier Factor and Isolation Forest performed better than OneClassSVM model
  • In future, we can tweak certain parameters such as contamination which we used in this project and others to get the most of these models.

Project Structure:

Readme.md

  • Project description

Data

  • Contains link to dataset

Notebooks

  • Jupyter Notebook for Exploratory data analysis, Visualization, Feature Engineering and Anomaly Detection.

Reports

  • plot- change in values
  • Plot- raw data visualization
  • Plot- moving averages
  • Plot- OneClassSVM based Anomaly detector
  • plot- Local Outlier Factor based Anomaly detector
  • Plot- Isolation Forest based Anomaly detector

Requirements.txt

  • Info about Tools, frameworks and libraries required to reproduce the work flow

anomaly-detector's People

Contributors

sflazarus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.