- Stroke Prediction Using Machine Learning:
https://www.kaggle.com/lcchennn/stroke-prediction-using-machine-learning - HTML Parsing with Beautiful Soup - NHANES Data Codebook with SAS label:
https://www.kaggle.com/lcchennn/html-parsing-nhanes-data-codebook-with-sas-label
Stroke is the fifth cause of death in the United States, according to the Heart Disease and Stroke Statistics 2020 report. Those who suffer from stroke, if luckily survived, may also suffer from expensive medical bills and even disability. Foreseeing the underlying risk factors of stroke is highly valuable to stroke screening and prevention. In this project, the National Health and Nutrition Examination Survey (NHANES) data from the National Center for Health Statistics (NCHS) is used to develop machine learning models. The NHANES dataset holds an abundance of variables, ranging from demographics, medical history, physical examinations, biochemistry to dietary and lifestyle questionnaires. XGBoost (eXtreme Gradient Boosting) Classifier is used for feature selection since the model tolerates missing values and performs well with high-dimensional data.
- Exclude rows will null value or NA for the target variable "MCQ160F" (โHas a doctor or other health professional ever told you that you had a stroke?โ).
- Exclude columns with over 50% missing data.
- Missing data inputation with strategy='most_frequent'.
- Train/Test split
- Train/predict with XGBoost before data preprocessing, result showed false high accuracy where recall for the minority class is very low.
- Oversampling with imblearn.SMOTE (Synthetic Minority Oversampling TEchnique).
- Using XGBoost for feature selection after oversampling.
- Train/Test split
- Data normalization with sklearn.preprocessing.NormalScaler
- Logistic Regressio, Random Forest, Adaptive Boosting, Gradient Boosting, SVM, XGBoost model training and prediction.
- Model optimization/hyperparameter tuning
- Ensemble
- ROC/AUC comparison
- NHANES Datasets from 2013-2014:
https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey - A data-driven approach to predicting diabetes and cardiovascular disease with machine learning:
https://pubmed.ncbi.nlm.nih.gov/31694707/ - Codebook:
https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Demographics&CycleBeginYear=2013