alteryx / predict-customer-churn Goto Github PK

A general-purpose framework for solving problems with machine learning applied to predicting customer churn

Home Page: https://blog.featurelabs.com/how-to-create-value-with-machine-learning/

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.88% Python 0.12%

predict-customer-churn's Introduction

A Machine Learning Framework with an Application to Predicting Customer Churn

This project demonstrates applying a 3 step general-purpose framework to solve problems with machine learning. The purpose of this framework is to provide a scaffolding for rapidly developing machine learning solutions across industries and datasets.

The end outcome is a both a specific solution to a customer churn use case, with a reduction in revenue lost to churn of more than 10%, as well as a general approach you can use to solve your own problems with machine learning.

Framework Steps

Prediction engineering

State business need
Translate business requirement into machine learning task by specifying problem parameters
Develop set of labels along with cutoff times for supervised machine learning

Feature Engineering

Create features - predictor variables - out of raw data
Use cutoff times to make valid features for each label
Apply automated feature engineering to automatically make hundreds of relevant, valid features

Modeling

Train a machine learning model to predict labels from features
Use a pre-built solution with common libraries
Optimize model in line with business objectives

Machine learning currently is an ad-hoc process requiring a custom solution for each problem. Even for the same dataset, a slightly different prediction problem requires an entirely new pipeline built from scratch. This has made it too difficult for many companies to take advantage of the benefits of machine learning. The standardized procedure presented here will make it easier to solve meaningful problems with machine learning, allowing more companies to harness this transformative technology.

Application to Customer Churn

The notebooks in this repository document a step-by-step application of the framework to a real-world use case and dataset - predicting customer churn. This is a critical need for subscription-based businesses and an ideal application of machine learning.

The dataset is provided by KKBOX, Asia's largest music streaming service, and can be downloaded here.

Within the overall scaffolding, several standard data science toolboxes are used to solve the problem:

Featuretools: automated feature engineering
Pandas: data munging and engineering
Scikit-Learn: standard machine learning algorithms
Apache Spark with PySpark: Running comptutations in parallel
TPOT (Tree-based Pipeline Optimization Tool): model selection optimization using genetic algorithms

Results

The final results comparing several models are shown below:

Model	ROC AUC	Recall	Precision	F1 Score
Naive Baseline (no ml)	0.5	3.47%	1.04%	0.016
Logistic Regression	0.577	0.51%	2.91%	0.009
Random Forest Default	0.929	65.2%	14.7%	0.240
Random Forest Tuned for 75% Recall	0.929	75%	8.31%	0.150
Auto-optimized Model	0.927	2.88%	64.4%	0.055
Auto-optimized Model Tuned for 75% Recall	0.927	75%	9.58%	0.170

Final Confusion Matrix

Feature Importances

Notebooks

Partitioning Data: separate data into independent subsets to run operations in parallel.
Prediction Engineering: create labels based on the business need and historical data.
Feature Engineering: implement automated feature engineering workflow using label times and raw data
Feature Engineering on Spark: parallelize feature engineering calculations by distributing across multiple machines
Modeling: develop machine learning algorithms to predict labels from features; use automated genetic search tools to search for best model.

Feature Engineering with Spark

To scale the feature engineering to a large dataset, the data was partitioned and automated feature engineering was run in parallel using Apache Spark with PySpark.

Featuretools supports scaling to multiple cores on one machine natively or to multiple machines using a Dask cluster. However, this approach shows that Spark can also be used to parallelize feature engineering resulting in reduced run times even on large datasets.

The notebook Feature Engineering on Spark demonstrates the procedure. The article Featuretools on Spark documents the approach.

Feature Labs

Featuretools is an open source project created by Feature Labs. To see the other open source projects we're working on visit Feature Labs Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

Contact

Any questions can be directed to [email protected]

predict-customer-churn's People

Contributors

Stargazers

Watchers

Forkers

drheatherwalker faadal huyhoang17 steccami mahmud83 rakeshsukumar jbdatascience emanuelfontelles porwalpa venu27 jgromaris macintoshxz alexhock evaxue abhizz nathcibirka deyanlazarov snowdj carmine977 shellwang hieuqtran bukosabino y4ssin3 maltequast vivisjtu gaybro8777 chrevo feros5 austospumanto hoangthienan95 trandinhtien95 walace-datascientist edenhuangsh samirchar batermj rogernet jamesawright mvimont felix-ai pengyuange zhoujialinmumu guptdivya0212 konvyzas zdexin dineshgit bigdatamatta lisafitasari stenpiren robertmcpherson thiagodataengineer amitgoel08 btseytlin gilbertlangat jiaywu-deloitte brian8903 databill86 gubo2012 rajashekarmandadi adewin rajdharmendraiiitg minghao2016 chetanmehra eliekawerk ssitb laurawong233 josebrunods kevin-hartman f820914768 mannuiitd sanyam07 bobbydyr zihao-myob airvetra tysoncung elletsai skyline22 dr-vibhora farshadsm charlieqyn angel-rc alejandronotario degaleesanp kanish84in fishewyz aagostinelli86 hereiamravi ykay007 jackattackyang eshwihdi stvhanna michaele919 hxin08 morkarslaanesh sumit33k raghuvar mohamedebouguerra frandolin vikram687 neyson sharanparamesh

predict-customer-churn's Issues

Notebook links are not correct

I notice that the links for notebooks in README are not correct

CSV not correct

Is it possible to replicate the notebook code from step 3. I don't see the cutoff dates csv in kaggle? Do you know where it is?
https://github.com/Featuretools/predict-customer-churn/blob/main/churn/3.%20Feature%20Engineering.ipynb

(This link was in notebook 2)

https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data?select=user_logs_v2.csv.7z

Question regarding "3. Feature Engineering.ipynb"

In the notebook under the "Interesting Values" section, you're using "is_cancel" as a feature. Doesn't that create a bias since you're providing future information to the model telling it that the customer has cancelled the subscription?

Is it possible to add more than one time_index?

For example, my client table has more than one time_index I would like to use as a feature. Account open date, and last activity date.

sample_submission_v2.csv

Hi,
i looking for the file sample_submission_v2.csv. should i build from Train....csv with pandas test_train_split from Pandas.

Potential data leakage due to some features?

Firstly, thanks for the detailed notebooks! Learnt a lot from it.

In https://github.com/Featuretools/predict-customer-churn/blob/main/churn/3.%20Feature%20Engineering.ipynb,
I assume that the features num_25, num_50, num_75, num_985, num_100, and num_unq, etc represent the number of specific types of transactions (e.g., number of songs played with 25% completion, 50% completion, etc.) for each customer.

Wouldnt using these features (and various transformations/aggregations on it) in the windows lead to data leakage?

Because basically we are using the information from the future or outside the given time window to create features or train a model.

I have a similar use-case where we have few customer columns like ltv, nr_orders etc which reflect value "as of today". I am not sure how to handle these in the windows that are created.

Issue with customers having a single transaction

Hi, I was trying out the code in "2. Prediction Engineering.ipynb", particularly the "label_customer()" with a dataset that I have and in it I have several customers who only have a single transaction. Now for example if they registered for the service on 10th Jan and the membership expired on 15th Feb the code labels it as a false churn since there is no next transaction date. Is this a bug?

What does DIFF() measure exactly?

I have a dataframe of person information that includes each person's quality of health (from 0 to 100)

e.g. if I extract just the Name and HealthQuality columns:

John: 70
Mary: 20
Paul: 40

etc etc.

After applying featuretools I noticed a new DIFF(HealthQuality) variable.
Difference between what exactly?

According to the docs, this is what DIFF does:

"Compute the difference between the value in a list and the previous value in that list."

Is featuretools calculating the difference between Mary and John's health quality in this instance?

Issues about notebook "3. Feature Engineering.ipynb"

After running the jupyter notebook "3. Feature Engineering.ipynb" in the update-reqs, I get two errors:

(1) in chunk [39]:
ValueError: ('Unknown transform primitive weekend. ', 'Call ft.primitives.list_primitives() to get', ' a list of available primitives') ,
I think this is because you specify transformation primitives as trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous'], but 'weedend' is not a primitive, 'is_weekend' is a primitive.

(2) in chunk [42]:
'AttributeError: Cutoff time DataFrame must contain a column with either the same name as the target entity time_index or a column named "time" ',
I have check the dataframe of cutoff_time and members, they have the common columns "msno", so I have no idea about this issue?

Issues on the running

I was running the program (3. Feature Engineering.ipynb) in the colab. The program can not go through by some reasons. It gave :"tuple is not allowed for map key". Can you help me to solve it? Thanks.