Giter Club home page Giter Club logo

customer_churn_prediction_using_pyspark's Introduction

Customer Churn Prediction in OTT Platforms

  • This project repository is created in partial fulfillment of the requirements for the Big Data Analytics course offered by the Master of Science in Business Analytics program at the Carlson School of Management, University of Minnesota
  • The project was collaborated with Yi Fang, Sridhar Iyer, Jayadev KP, Yufan Li, and Chandra Mouli Kambhampati

Executive Summary

The OTT platforms in the United States have seen tremendous growth in the past couple of years. The United States OTT Market is expected to register a CAGR of approximately 11.22% in terms of revenue and approximately 2.25% in terms of subscribers over the forecast period (2021-2026). The Covid-19 has exacerbated an already intensifying shift from cable to digital. The entertainment industry is shifting from TV and pay-per-view to subscription-based content viewing experiences.

But there is intensifying competition too. In the past year alone, more than 300 new subscription-based media companies have joined the OTT market. Everyone from news companies to lifestyle companies is introducing OTT platforms. The market is heading towards gradual consolidation. Therefore, user numbers are one of the most critical metrics to these OTT platforms in this competitive landscape. There are already signs of saturation in signing new customers, and customer attention is also dropping due to media fatigue. High churn has had OTT platforms scrambling to retain customers by offering higher personalization and incentives. Platforms that have taken a proactive data-backed approach toward customer retention have enjoyed comparatively lower churn rates (e.g., Netflix).

Our project aims to leverage big data technologies coupled with machine learning to help OTT platforms predict churn at a customer level in real-time. This would help their customer engagement teams take timely action. We have also constructed a real-time dashboard that can help them understand churn by geography, subscriber type, and other metrics. We have primarily leveraged Amazon S3 as the data storage, Databricks for data processing, spark MLlib for machine learning, and Amazon Quicksight for visualization. We further envision that this setup can be leveraged to assist in personalization and AB testing.

Project Description

The objective of this project is to predict “customer churn” which is if a listener is likely to cancel their subscription. Our goal is to identify these customers via their interactions with the website. The dataset contains 18 features which can be used to predict the probability of churn.

Project Process

alt text

Data description:

This is a public dataset named Million Song Dataset and can be downloaded under json format prepared by Udacity. It contains 18 columns which have the information of customers(gender, name, etc.) and API events(login, playing next song, etc.) The experiment period is from 2018–10–01 to 2018–12–01. The data set is a 12 GB user log data which is hosted on a AWS S3 repository.

To access the dataset:

Dataset is available on S3 and can be accessed as below:

Step1: Create spark session

spark = SparkSession
.builder
.appName("Sparkify")
.getOrCreate()

Step2: Read in sparkify dataset

full_data = "s3n://udacity-dsnd/sparkify/sparkify_event_data.json"
mini_data = "s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json"

df = spark.read.json(full_data)

EDA for the dataset through dasboard

alt text

Target audience

The target audience of our analysis would be digital music service firms and their stakeholders who would be interested in preventing customer churn.

Big Data Tools Used

Ingestion, ETL, exploration, Analysis: AWS S3, Databricks(Spark)
Visualization: AWS quicksight

Instructions to run the scripts

  • Sparkify_code.ipynb is the first script that needs to be run. This script performs the data pre-processing, feature engineering & model building.
  • Stream.ipynb is the second script that has to be run. It assumes that the customer churn model is already built from the previous script & implements streaming analysis using the churn model.

Reference

1.https://github.com/CapAllen/Sparkify
2.https://medium.com/@olivier.klein/sparkify-udacity-data-scientist-nanodegree-capstone-project-65e3181ea2b0
3.https://github.com/Tselmeg-C/Churn_prediction_Udacity_Capstone
4.https://www.brid.tv/ott-statistics-for-2021-infographic/
5.https://www.statista.com/outlook/amo/media/tv-video/ott-video/worldwide
6.https://www.parksassociates.com/blog/article/pr-12152021
7.https://dazeinfo.com/2021/09/22/adoption-of-ott-in-us-over-82-million-households-streamed-ott-content-june-2021/

customer_churn_prediction_using_pyspark's People

Contributors

evelyncy96 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.