Giter Club home page Giter Club logo

sparkify's Introduction

README

Installations

The following packages are needed to run the notebook:

  • numpy 1.16.2
  • python 3.7.2
  • pandas 0.24.2
  • numpy 1.16.2
  • pyspark 2.4.0
  • matplotlib 3.0.3
  • ua-parser 0.8.0

The sample data mini_sparkify_event_data.json of 128MB is not provided in this repository but can be accessed through the Spark Project Workspace at Udacity's Data Science Nanodegree. However, the transformed data transformed_user_data which is used in the modeling and tuning section is available in this repository.

Project Motivation

A common problem in service providing companies (e.g. Spotify or Netflix) is the prediction of churn. Churn is, simply put, the loss of clients and it is a very common KPI to track. Predicting churn allows the companies to take action to avoid the loss of customers before it happens, helping to increase their customer base. Iy is a complex problem that involves handling large amounts of data. This notebook sets out to demonstrate an approach to solving this problem using Spark and a sample event dataset.

The tasks of this project are:

  • Load large datasets into Spark and manipulate them using SparkSQL or Spark dataframes.
  • Using Spark dataframes, engineer features that will help in the prediction of churn.
  • Use Spark ML to build and tune machine learning models that will predict churn

At the end of this work the reader should have a better understanding of the problems faced when dealing with churn prediction. This medium article gathers the results of the project.

Repository Contents:

This repository contains:

  • A jupyter notebook (Sparkify.ipynb)containing the main core of work of this project.
  • A README file (README.md) describing the project purpose and how to interact with it.
  • A parquet file (transformed_user_data) containing the transformed data after the ETL process.
  • a picture folder with the analysis results of the jupyter notebook.

To run this project locally you'll need to download the repository to your machine and obtain the mini_sparkify_event_data.json file. This can be obtained at Udacity's DataScience Nanodegree. You can also use other sample datasets with the same schema as the one presented in the notebook. This work was performed on a sample dataset of 128MB. It is still posible to run the modeling and tuning section (5) of the notebook without the mini_sparkify_event_data.json file, since a transformed data set is provided with the repository.

Notebook structure:

  1. Import modules and create spark session
  2. Load and Clean Dataset
  3. Exploratory Data Analysis
    1. Explore Data
    2. User behaviour
    3. Define Churn
  4. Feature Engineering
    1. Event-based features
    2. User-based features
    3. Save transformed data
  5. Modeling
    1. Load and split data
    2. Define Evaluators
    3. Training and tuning
      • Logistic Regression Classifier
      • Random Forest Classifier
      • Multilayer Perceptron Classifier

sparkify's People

Contributors

pedro-hf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.