Giter Club home page Giter Club logo

anomaly_detection_using_ml's Introduction

Anomaly_detection_using_ML

Introduction

Anomaly detection also known as outlier detection is a data mining process used to determine types of anomalies found in a dataset and to determine details about their occurrences. Automatic anomaly detection is critical in today’s world where the sheer volume of data makes it impossible to tag outliers manually. Auto anomaly detection has a wide range of applications such as fraud detection, system health monitoring, fault detection, and event detection systems in sensor networks, and so on.

Motivation

I was curious to see how different unsupervised ML algorithms (K-means clustering, Isolation Forest, One Class SVM, Gaussian Distribution) identify anomalies in a dataset.

Data

I used a dataset from Personalize Expedia Hotel Searches, which can be found here.

For a simplicity of the analysis I selected:

  • train.csv;
  • one single hotel, which has the largest number of data points (prop_id = 124342);
  • visitor_location_country_id = 219;
  • search_room_count = 1;
  • key features for the analysis: date_time, price_usd, srch_booking_window, srch_saturday_night_bool.

Getting started

You need an installation of Python, plus the following libraries:

  • numpy
  • pandas
  • matplotlib.pyplot
  • seaborn
  • sklearn

Summary and key findings

The price anomaly detection was done using 4 different methods (K-means clustering, Isolation Forest, One Class SVM, Gaussian Distribution (EllipticEnvelope)). It is clear that the methods provide different results such as:

  • With K-means clustering approach anomalies are found for apartments with ~ midium and high range of prices;
  • The anomalies range for Isolation Forest technique belongs to both mid and high ranges of price for a rent. However, Isolation Forest techniques gives more anomalies in a higher range of prices than k-means clustering;
  • With One Class SVM we could find anomalies for low, mid and high price ranges of a rent. Moreover, this is the only method which gives anomalies for a low range prices;
  • With Gaussian Distribution method we got outliers in the upper range of price, while for other techinques we got a distribution of outliers within different ranges.

After building the models, we have no idea how well it is doing as we have nothing to test it against. Hence, the results of those methods need to be tested in the field before placing them in the critical path.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.