A stroke is a medical condition in which poor blood flow to the brain causes cell death [1]. This is basically a classification problem. The aim of this study is to check how well it can be predicted if patient will have barin stroke based on the available health data such as glucose level, age, gender etc.
We checked the data and performed some explanotary data analysis first, then processed the data as needed. Fortunately there was no Null
values in any of the features. One of the major challenges that we faced was that the data was imbalaced. The ratio of the people who had no stroke was 96.30% and the people who had stroke was 3.70%. The gap between these two classes was huge.
Since the false negatives (predicting someone as won't have stroke but he/she has it) is far dangerous than false negatives (predicting someone as he/she will have stroke but he/she hasn't it) it's better to try to maximize the Recall.
After removing the outliers and checking the distribution of the numeric and categorical features, we noticed that the number of observations (data points) with stroke history was really low which we suspect that the ML algorithms would not generalize properly. That's why we decided first to compare the learning performance of the ML algorithms on datsets with and wihtout outliers, then apply the imblearn
techniques. We compare the below algorithms:
- Support Vector Classifier (SVC)
- Decision Tree Classifier (DTC)
-
$k$ -Nearest Neighbor ($k$ -NN) - Linear Discrimenant Analyses (LDA)
- Gaussian Naive Bayes (GNB)
- Random Forest Classifier (RFC)
- Logistic Regression (LR)
- AdaBoost Classifier (ABC)
- Gradient Boosting Classifier (GBC)
As you may guess, the generilization ability of the algorithms with outliers was better.
Then we used RandomUnderSampler
function of imblearn
library with 0.5 sampling_strategy
to make the labels more balanced. We checked the performance of S algorthims with the given dataset by using 10-fold cross validation. We preferred stratified strategy as the dataset was in imbalanced state.
As you can see from the above image, the best mean Recall score with the lowest standard deviation was provided by GNB.
As as next step, we over sampled the data using SMOTE
[2] function of imblearn
library and check the performance of above mentioned algorithms.
As you can see from the above image, now KNN provides the best mean Recall score which is almost more than 0.95 with the lowest standard devidation.However, considering that we used k nearest minority class neighbors to synthetically generate data points, it seems reasonable. That's why it's hard to rely on this model for real-world implementation. Also adding more data points increased the model evaluation time dramatically which is worth to mention at this point.
We decided to continue with the GNB with under-sampled dataset for hyperparameter tuning.
After tuning the hyperparameters of GNB we managed to achieve 0.71 Recall score. There are still some room for the improvement. We can enrich the dataset by collecting some other data points from other patients or adding extra feature such as weekly alcohol consumption, stroke history in the family, diet habits etc. Apart from enriching the dataset, we can also utilize the other algorithms such as XGBoosting.
[1] Stroke [2] SMOTE for Imbalanced Classification with Python