Giter Club home page Giter Club logo

disaster_response_pipelines's Introduction

Disaster Response Pipeline Project

Trained classifier result from input text "Need tents, water and food. Children are sick with fever."

The goal of this project:

  • To train a machine learning model that will classify any input message into 36 categories.

    • The categories are: related, request, offer, aid_related, medical_help, medical_products, search_and_rescue, security, military, child_alone, water, food, shelter, clothing, money, missing_people, refugees, death, other_aid, infrastructure_related, transport, buildings, electricity, tools, hospitals, shops, aid_centers, other_infrastructure, weather_related, floods, storm, fire, earthquake, cold, other_weather, direct_report.

The dataset:

sqlite> SELECT COUNT(*) FROM messages;
COUNT(*)
26180
sqlite> SELECT SUM(related) FROM messages;
SUM(related)
20067
sqlite> SELECT SUM(child_alone) FROM messages;
SUM(child_alone)
0
  • Cleaned data contains 26180 samples and is stored in 2 csv tables: disaster_categories.csv, disaster_messages.csv. The data is imbalanced in such way that some categories are present much frequently ("related" has 20067 occurrences) than others ("child_alone" has 0 occurences).

    • Table "disaster_categories.csv" contains columns:

      • "id" - Unique identifier of a sample message.
      • "categories" - Labeles (categories) of a sample message.
    • Table "disaster_messages.csv" contains columns:

      • "id" - Unique identifier of a sample message.
      • "message" - Sample text (input) for a machine learning model.
      • "original" - Sample text before translation into English.
      • "genre" - One of 3 sources the data was collected from.
        • News
        • Direct
        • Social

Steps to complete:

  • ETL

    • Extract from csv tables into Pandas DataFrame.
    • Transform into the multilabeled dataset and clean (remove duplicates). Then dump to SQL (sqlite3) DB.
    • Load from SQL DB.
  • Machine Learning

    • Pipeline

      • Vectorization (creating sparse matrix of vocabulary tokens). Using custom tokenizer:

        1. Converting text to lowercase.
        2. Replacing any URLs with "urlplaceholder".
        3. Removing any non-alphanumeric symbols.
        4. Keeping only words that are not in NLTK English stopwords.
        5. Lemmatization
        6. Stemming
      • Normalization (Term frequency inverse document frequency) of the values in the vocab tokens matrix on the scale from 0 to 1.

      • Classification (applying final estimator as MultiOutputClassifier).

    • Training includes choosing the best hyperparameters combination with GridSearchCV.

    • Evaluation of the best model on the test set and printing output of classification_report with "precision", "recall" and "f1-score" on each of 36 categories and total summary.

  • Deployment

    • Website with Flask web framework
    • SQLite database
    • Trained model saved in pickle file
    • Plotly charts

Trained model evaluation:

Labels Precision Recall F1-score Support
related 0.84 0.87 0.85 984
request 0.67 0.41 0.51 236
offer 1.00 0.00 0.00 10
aid_related 0.72 0.63 0.67 550
medical_help 0.61 0.24 0.34 106
medical_products 0.66 0.32 0.43 60
search_and_rescue 0.67 0.27 0.38 30
security 1.00 0.00 0.00 24
military 0.62 0.40 0.48 53
child_alone 1.00 1.00 1.00 0
water 0.67 0.78 0.72 86
food 0.79 0.77 0.78 137
shelter 0.76 0.61 0.67 112
clothing 0.56 0.50 0.53 20
money 0.43 0.10 0.17 29
missing_people 0.67 0.29 0.40 7
refugees 0.74 0.30 0.42 57
death 0.71 0.64 0.67 55
other_aid 0.48 0.11 0.19 174
infrastructure_related 0.44 0.05 0.09 83
transport 0.75 0.16 0.27 55
buildings 0.86 0.26 0.40 72
electricity 0.63 0.38 0.47 32
tools 1.00 0.00 0.00 7
hospitals 0.00 0.00 0.00 11
shops 1.00 0.00 0.00 7
aid_centers 1.00 0.00 0.00 16
other_infrastructure 1.00 0.00 0.00 56
weather_related 0.81 0.80 0.80 361
floods 0.85 0.60 0.70 102
storm 0.70 0.70 0.70 126
fire 0.67 0.67 0.67 9
earthquake 0.86 0.88 0.87 128
cold 0.59 0.37 0.45 27
other_weather 0.56 0.27 0.36 75
direct_report 0.63 0.36 0.46 257
--- --- --- --- ---
micro avg 0.76 0.58 0.66
macro avg 0.72 0.38 0.43
weighted avg 0.74 0.58 0.62
samples avg 0.77 0.72 0.60

Requirements

  • click==8.0.1
  • Flask==2.0.1
  • greenlet==1.1.1
  • itsdangerous==2.0.1
  • Jinja2==3.0.1
  • joblib==1.0.1
  • MarkupSafe==2.0.1
  • nltk==3.6.2
  • numpy==1.21.2
  • pandas==1.3.3
  • plotly==5.3.1
  • python-dateutil==2.8.2
  • pytz==2021.1
  • regex==2021.8.28
  • scikit-learn==0.24.2
  • scipy==1.7.1
  • six==1.16.0
  • sklearn==0.0
  • SQLAlchemy==1.4.23
  • tenacity==8.0.1
  • threadpoolctl==2.2.0
  • tqdm==4.62.2
  • Werkzeug==2.0.1

Instructions

  1. Run the following commands in the project's root directory to set up your database and model.

    • To run ETL pipeline that cleans data and stores in database

       python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
    • To run ML pipeline that trains classifier and saves

      python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
  2. Run the following command in the app's directory to run your web app.

    python run.py
  3. Go to http://0.0.0.0:3001/

disaster_response_pipelines's People

Contributors

abitfrosty avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.