Stroke Prediction Using Machine Learning Approaches

Check out the code

Stroke Prediction Using Machine Learning:
https://www.kaggle.com/lcchennn/stroke-prediction-using-machine-learning
HTML Parsing with Beautiful Soup - NHANES Data Codebook with SAS label:
https://www.kaggle.com/lcchennn/html-parsing-nhanes-data-codebook-with-sas-label

I. Introduction

Stroke is the fifth cause of death in the United States, according to the Heart Disease and Stroke Statistics 2020 report. Those who suffer from stroke, if luckily survived, may also suffer from expensive medical bills and even disability. Foreseeing the underlying risk factors of stroke is highly valuable to stroke screening and prevention. In this project, the National Health and Nutrition Examination Survey (NHANES) data from the National Center for Health Statistics (NCHS) is used to develop machine learning models. The NHANES dataset holds an abundance of variables, ranging from demographics, medical history, physical examinations, biochemistry to dietary and lifestyle questionnaires. XGBoost (eXtreme Gradient Boosting) Classifier is used for feature selection since the model tolerates missing values and performs well with high-dimensional data.

II. Data Loading and Cleaning

Exclude rows will null value or NA for the target variable "MCQ160F" (“Has a doctor or other health professional ever told you that you had a stroke?”).
Exclude columns with over 50% missing data.
Missing data inputation with strategy='most_frequent'.

III. Feature Selection

Train/Test split
Train/predict with XGBoost before data preprocessing, result showed false high accuracy where recall for the minority class is very low.
Oversampling with imblearn.SMOTE (Synthetic Minority Oversampling TEchnique).
Using XGBoost for feature selection after oversampling.

Class imbalance

24 top features selected

Feature correlation

IV. Model Training

Train/Test split
Data normalization with sklearn.preprocessing.NormalScaler
Logistic Regressio, Random Forest, Adaptive Boosting, Gradient Boosting, SVM, XGBoost model training and prediction.

V. Model Comparison

Accuracy score comparison.

VI. Future Work

Model optimization/hyperparameter tuning
Ensemble
ROC/AUC comparison

References

NHANES Datasets from 2013-2014:
https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey
A data-driven approach to predicting diabetes and cardiovascular disease with machine learning:
https://pubmed.ncbi.nlm.nih.gov/31694707/
Codebook:
https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Demographics&CycleBeginYear=2013

lcchennn / stroke_prediction Goto Github PK

stroke_prediction's Introduction

Stroke Prediction Using Machine Learning Approaches

Check out the code

I. Introduction

II. Data Loading and Cleaning

III. Feature Selection

Class imbalance

24 top features selected

Feature correlation

IV. Model Training

V. Model Comparison

VI. Future Work

References

stroke_prediction's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent