Performance measures of customer churn at a telecom company in a logistic regression model.
Customer churn is an important metric for service companies like telecommunications. Retaining customers is more cost-effective than acquiring new ones (Gallo, 2014). Therefore, a good predictive model can help organizations anticipate and prevent customer churn. In this project, five performance measures are calculated on the telecommunication company customer churn logistic regression model; two use all variables, and three use the most significant predictors. A series of screenshots, a procedure summary, an interpretation of the results, and impressions of the experience follow.
The project reads in the telecommunications CSV file using the getwd() for the current working directory and the read.csv() functions. The file is saved in the telco data frame. The str() command displays the telco object's structure, as seen in Figure 1. This data frame shows the number of observations and variables and a list of the independent and dependent variables with characteristics in the data set. After installing the caret package and pulling in its library, the script to prepare and partition the data into training and testing sets is completed, as seen in Figure 2. The intrain data frame holds the information for the dependent variable and the cutoff point of .7 or 70%.
Next, the logistic model for the churn variable using the training data is performed and demonstrated in Figure 3 using the glm() function and binomial family. A summary of the data demonstrates the most significant predictors for churn. Figure 4 evaluates the model error rate with the testing data by setting “yes” as the most important variable equal to 1 and utilizing the predict() function. Any values over 0.5 will show “yes” for churn, and those below 0.5 will show “no.” The mean() function provides summary statistics for the testing data. Figure 5 calculates and prints the logistic model accuracy using the paste() function and the misClasificError variable. Figure 6 displays the confusion matrix utilizing the table() command and pulling in the fitted.results data from the prediction model. Finally, Figure 7 has four figures demonstrating performance measures in three separate models using the three most significant predictors independently and collectively in the last image. This is done using the same script above, with modifications for each predictor.
The logistic model demonstrates the most significant predictors as Contract, PaperlessBilling, and tenure_group. The evaluation using testing data demonstrates a value of 0.2011385, suggesting an error rate of about 20%, indicating the model is approximately 80% accurate. The accuracy of the logistic model confirms this information, showing a value of 0.799. The confusion matrix details the errors and accuracy. Of the 1,704 actual "0" or “no” responses—meaning the customer did not churn—290 were misclassified as “1” or “yes.” This gives a Class 1 error rate of about 17% (290/1704), with sensitivity—the ability to correctly identify positive results—of about 83% (Berrier et al., 2018). Of the 404 “yes” responses, 134 were incorrectly classified as “no,” giving a Class 0 error rate of about 33% (134/404) and specificity—the ability to correctly identify negative results—of about 67% (Berrier et al., 2018). This is a pretty good model to predict churn.
Performance measures for several models using three significant predictors are completed.The contract, paperless billing, and tenure variables are evaluated independently. All three demonstrate that 560 of the 2108 values in the testing set were classified as “yes” when they should have been "no.” This gives a Class 1 error rate of about 27% and sensitivity of about 73%, lower than the primary model. Collectively, the three significant predictors show a Class 1 error rate of approximately 20%, with a sensitivity of 80%, and a Class 0 error rate of about 41%, with a specificity of 59%. Overall, these are fair results, but the accuracy of the primary model with all variables considered seems to be a better fit.
Berrier, J, Nestler, S., Pardoe, I., Sturdivant, R.X., & Watts, K. (2018). Fundamentals of Data Analytics R. Zyante Inc. Gallo, A. (2014, October 29). The value of keeping the right customers. Harvard Business Review. Retrieved on September 14, 2022, from https://hbr.org/2014/10/the-value-of-keeping-the-right-customers