View Code? Open in Web Editor
NEW
Machine learning project for cardiovascular risk prediction. The goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD)
cardiovascular_risk_prediction's Introduction
Objective: Predict the 10-year risk of future coronary heart disease (CHD) in patients.
Dataset: Cardiovascular study data from Framingham, Massachusetts.
Attributes: 15 attributes, including demographic, behavioral, and medical risk factors.
Target Variable: 10-year risk of CHD (binary: 1 = "Yes", 0 = "No").
Demographic:
Sex: Male or female ("M" or "F").
Age: Age of the patient (continuous).
Behavioral:
Is Smoking: Current smoker status ("YES" or "NO").
Cigs Per Day: Number of cigarettes smoked per day (continuous).
Medical (History):
BP Meds: On blood pressure medication (nominal).
Prevalent Stroke: History of stroke (nominal).
Prevalent Hyp: History of hypertension (nominal).
Diabetes: Diabetes status (nominal).
Medical (Current):
Tot Chol: Total cholesterol level (continuous).
Sys BP: Systolic blood pressure (continuous).
Dia BP: Diastolic blood pressure (continuous).
BMI: Body Mass Index (continuous).
Heart Rate: Heart rate (continuous).
Glucose: Glucose level (continuous).
Exploratory Data Analysis (EDA)
Comparisons: Target variable CHD vs. independent variables.
Techniques: Feature engineering, feature transformation, outlier handling, categorical and numerical data analysis.
Visualizations: Histograms, violin plots, bar plots to analyze correlations and attribute effects on CHD.
Handling Missing Values: Imputed with mean and mode.
Duplicate Values: None found.
Outliers: Imputed rather than deleted.
Label Encoding: For categorical variables (sex, is smoking).
Feature Selection:
Methods: Variance threshold, chi-square tests.
Outcome: Identified and eliminated less relevant features.
Balancing Target Variable:
Technique: SMOTE (Synthetic Minority Oversampling Technique).
Logistic Regression:
Accuracy: 85.84%, increased to 85.98% with hyper-parameter tuning.
K-Nearest Neighbors (KNN):
Accuracy: 84.36%, increased to 85.69% with hyper-parameter tuning.
Decision Tree:
Accuracy: 74.92%, increased to 85.69% with hyper-parameter tuning.
Random Forest Classifier:
Significant Features: BP Meds, prevalent stroke, diabetes, smoking status, age, systolic and diastolic BP.
Multicollinearity Handling: Retained cigsPerDay over is_smoking, retained diaBP over sysBP.
Variance Threshold: Discarded features not meeting 99% variance threshold.
Smoking: Higher CHD risk for heavy smokers.
Gender: Males smoke more, but females have higher CHD risk.
Age Group: Highest CHD risk in ages 35-50.
Feature Irrelevance: Education was irrelevant (identified via chi-square test).
Outliers: 14% of data were outliers, imputed with median values.
Categorical Features: Label encoded.
Irrelevant Features: Adjusted for multicollinearity and low variance.
Imbalanced Target: Balanced using SMOTE.
Complex Computations: Required careful handling to avoid data loss.
Best Performing Models: Logistic Regression, KNN, and Random Forest.
Ease of Implementation: Logistic regression was simplest; Random Forest was time-intensive.
Overall Accuracy: Models improved with hyper-parameter tuning and feature selection.
cardiovascular_risk_prediction's People
Contributors
Watchers