The cardiovascular_risk_prediction from shantanumokhale

Problem Statement

Objective: Predict the 10-year risk of future coronary heart disease (CHD) in patients.
Dataset: Cardiovascular study data from Framingham, Massachusetts.
Attributes: 15 attributes, including demographic, behavioral, and medical risk factors.
Target Variable: 10-year risk of CHD (binary: 1 = "Yes", 0 = "No").

Data Features

Demographic:
- Sex: Male or female ("M" or "F").
- Age: Age of the patient (continuous).
Behavioral:
- Is Smoking: Current smoker status ("YES" or "NO").
- Cigs Per Day: Number of cigarettes smoked per day (continuous).
Medical (History):
- BP Meds: On blood pressure medication (nominal).
- Prevalent Stroke: History of stroke (nominal).
- Prevalent Hyp: History of hypertension (nominal).
- Diabetes: Diabetes status (nominal).
Medical (Current):
- Tot Chol: Total cholesterol level (continuous).
- Sys BP: Systolic blood pressure (continuous).
- Dia BP: Diastolic blood pressure (continuous).
- BMI: Body Mass Index (continuous).
- Heart Rate: Heart rate (continuous).
- Glucose: Glucose level (continuous).

Exploratory Data Analysis (EDA)

Comparisons: Target variable CHD vs. independent variables.
Techniques: Feature engineering, feature transformation, outlier handling, categorical and numerical data analysis.
Visualizations: Histograms, violin plots, bar plots to analyze correlations and attribute effects on CHD.

Data Preprocessing

Handling Missing Values: Imputed with mean and mode.
Duplicate Values: None found.
Outliers: Imputed rather than deleted.
Label Encoding: For categorical variables (sex, is smoking).

Feature Engineering

Feature Selection:
- Methods: Variance threshold, chi-square tests.
- Outcome: Identified and eliminated less relevant features.
Balancing Target Variable:
- Technique: SMOTE (Synthetic Minority Oversampling Technique).

Algorithms and Models

Logistic Regression:
- Accuracy: 85.84%, increased to 85.98% with hyper-parameter tuning.
K-Nearest Neighbors (KNN):
- Accuracy: 84.36%, increased to 85.69% with hyper-parameter tuning.
Decision Tree:
- Accuracy: 74.92%, increased to 85.69% with hyper-parameter tuning.
Random Forest Classifier:
- Accuracy: 84.80%.

Model Insights

Significant Features: BP Meds, prevalent stroke, diabetes, smoking status, age, systolic and diastolic BP.
Multicollinearity Handling: Retained cigsPerDay over is_smoking, retained diaBP over sysBP.
Variance Threshold: Discarded features not meeting 99% variance threshold.

Observations

Smoking: Higher CHD risk for heavy smokers.
Gender: Males smoke more, but females have higher CHD risk.
Age Group: Highest CHD risk in ages 35-50.
Feature Irrelevance: Education was irrelevant (identified via chi-square test).

Challenges

Outliers: 14% of data were outliers, imputed with median values.
Categorical Features: Label encoded.
Irrelevant Features: Adjusted for multicollinearity and low variance.
Imbalanced Target: Balanced using SMOTE.
Complex Computations: Required careful handling to avoid data loss.

Conclusion

Best Performing Models: Logistic Regression, KNN, and Random Forest.
Ease of Implementation: Logistic regression was simplest; Random Forest was time-intensive.
Overall Accuracy: Models improved with hyper-parameter tuning and feature selection.

shantanumokhale / cardiovascular_risk_prediction Goto Github PK

cardiovascular_risk_prediction's Introduction

Problem Statement

Data Features

Exploratory Data Analysis (EDA)

Data Preprocessing

Feature Engineering

Algorithms and Models

Model Insights

Observations

Challenges

Conclusion

cardiovascular_risk_prediction's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent