We are predicting Tanzanian Water Wells into three classifications:
- Functional
- Non-Functional
- Functional, Needs Repair
I will be going through a variety of models using the data I was provided and try to choose the best model out of all of them. I want to submit my predictions to the competition here . I will be editing and updating my models when possible in order to get the best score I possibly can. The information we gain from our predictions will help with resource management and maintenance so that the people in Tanzania have access to clean water when they need it.
The data provided is from Taarifa and the Tanzanian Ministry of Water. There are a few things to note about the data.
- 59,400 entries with 40 entries.
- The classification for each entry is in a separate csv.
- There are potentially a lot of null values depending on the column we consider.
- Some of the columns provide the same or very similar information.
- After going through the test values for our submission, some of the categorical values are not the same and I will have to factor this in at a later point when editing.
There is a significant amount of cleaning we need to do. We will also have to consider which columns to keep depending on the information provided and if other columns provide the same or similar information.
After cleaning and encoding categorical variables, I will begin building models. I will make a train-test-split once for cross validation. Then, on my training set, I will perform another train-test-split. This is so I can try to minimize any influence our test set would have on our model. I will make the following models and ensemble methods:
- K-Nearest Neighbors
- Naive Bayes
- Random Forest
- Bagging
- Extra Trees
- XG Boost
I will also use SMOTE to help with balancing classes since there is a huge imbalance with wells that are classified as needing repair. I will also run a grid search to hyperparameter tune.
My best model appears to be a Random Forest Model after using SMOTE to balance classes equaly. Here are the current metrics and confusion matrix. My confusion matrix includes the precision of our predicted values. A quick breakdown:
- Precision: Ratio of our correct guesses to the total guesses for that category.
- Recall: Ratio of our correct guesses to the total actual values for that category.
- f1-Score: Balance between our precision and recall, and gives a better measure of how our model is doing. The closer it is to 1, the better our model does for predicting that classification.
For a more in depth explanation of these metrics, go here .
precision | recall | f1-score | |
---|---|---|---|
functional | 0.81 | 0.87 | 0.84 |
needs repair | 0.49 | 0.32 | 0.39 |
non functional | 0.83 | 0.78 | 0.80 |
The X-axis is what our model predicts, and the Y-axis is what the actual value would be. The diagonal of this confusion matrix tells us the precision of our classifications.
Here are the top 10 features my model is using to make classifications.
Here I explore some of the features my model uses to predict our classifications.
This gives us a quick look of how each classification is spread in Tanzania. A significant amount of wells that are non-functional are not only in highly populated areas, but also in areas that are along the borders or in more remote areas.
This gives the percentage of wells that are non-functional for each well type.
This gives the percentage of wells that are non-functional for each water qunatity group.
This gives us the top 5 well installers in Tanzania and the percentege of wells they made. The Department of Water Engineer(DWE) have made almost 60% of all the wells in Tanzania.
I want to continue editing the dataframe and which columns are considered. I would love to engineer where possible. A feature with longitude and latitude would be my first choice, most likely creating a center point on the map and seeing how the distance from this point impacts the prediction of the well's condition. I also want to take a better look at some of the features and find a better way to categorize the values. For example, the installer feature has a large amount of unique installers, many of which have only built one well. I want to find a better way to categorize them and perhaps avoid using other since that may become the majority class after combining them all. I would also like to continue to hyperparameter tune to get a better estimator for my model.