Machine Learning case study A ride-sharing company (Company X) is interested in predicting rider retention. To help explore this question a sample dataset of a cohort of users who signed up for an account in January 2014 is used. The data was pulled on July 1, 2014. So an assumption is made: a user is retained if they were “active” (i.e. took a trip) in the preceding 30 days (from the day the data was pulled). In other words, a user is "active" if they have taken a trip since June 1, 2014.
The task is not only to build a model that minimizes error, but also a model that allows to interpret the factors that contributed to the predictions.
A Model's performance is estimated on the train set and later the performance is evaluated on the unseen data in the test set.
Here is a detailed description of the data:
city
: city this user signed up in phone: primary device for this usersignup_date
: date of account registration; in the formYYYYMMDD
last_trip_date
: the last time this user completed a trip; in the formYYYYMMDD
avg_dist
: the average distance (in miles) per trip taken in the first 30 days after signupavg_rating_by_driver
: the rider’s average rating over all of their tripsavg_rating_of_driver
: the rider’s average rating of their drivers over all of their tripssurge_pct
: the percent of trips taken with surge multiplier > 1avg_surge
: The average surge multiplier over all of this user’s tripstrips_in_first_30_days
: the number of trips this user took in the first 30 days after signing upluxury_car_user
: TRUE if the user took a luxury car in their first 30 days; FALSE otherwiseweekday_pct
: the percent of the user’s trips occurring during a weekday