Customer-Behavior-Prediction

These scripts are used to predict wether a website customer will make a purchase in an online store based on their browsing history. They could be used as part of a realtime webpage control system in order to improve customer purchase behavior. The system would compute the buying probability given the current customer history combined with each of several possible webpage responses. The best response say, 'make button b larger and display ad C') would be actually executed.

The learning component could be periodically run offline on historical data

This directory contains the data files, scripts, outputs and plots.

Strategies

There are three learning strategies implemented - one (1) based simply on total time the user spends on the webpage; the other two based on a Naive Bayes classifier approach, using either the individual pages (2a) visited or the sequences of 3 pages (2b) visited.

Strategy 2b has the greatest area under the precision - recall curve, so this one is the best of the three. Its results marking the test.csv data are in predict2b.dat.

Probability of Buy

The files predict* are generated using the indicated strategies, marking the final element of each line with a number between 0 and 1. This number represents the predicted probability of buy : 1 is a certain buy, 0 is a certain no buy. To convert to a binary classification, simply select a threshold between 0 and 1. Then mark the lines with p_buy greater than the threshold as a buy and those less as a no-buy. Note that different choices of threshold will correpond to different amounts of false positives, false negatives, true negatives, and true positives. A higher threshold will produce fewer false positives but more false negatives. The overall p_buy in the training data is ~0.23. A threshold of ~0.4 is suggested.

The scorer.py script is used to evaluate the strategies, by training against either half or all the training data and then evaulating the perfomrance of each strategy using either the other half of the training data or all the training data as the source of truth.

The *.png files are the plots generated to analyze the training data and understand the performance of the strategies.

See comments in the Strat1.py and Strat2.py scripts for more information about the method, assumptions, and intermediate results.

libardo1 / customer-behavior-prediction Goto Github PK