Giter Club home page Giter Club logo

cardiovascular_risk_prediction's Introduction

image

Problem Statement

  • Objective: Predict the 10-year risk of future coronary heart disease (CHD) in patients.
  • Dataset: Cardiovascular study data from Framingham, Massachusetts.
  • Attributes: 15 attributes, including demographic, behavioral, and medical risk factors.
  • Target Variable: 10-year risk of CHD (binary: 1 = "Yes", 0 = "No").

image

Data Features

  • Demographic:
    • Sex: Male or female ("M" or "F").
    • Age: Age of the patient (continuous).
  • Behavioral:
    • Is Smoking: Current smoker status ("YES" or "NO").
    • Cigs Per Day: Number of cigarettes smoked per day (continuous).
  • Medical (History):
    • BP Meds: On blood pressure medication (nominal).
    • Prevalent Stroke: History of stroke (nominal).
    • Prevalent Hyp: History of hypertension (nominal).
    • Diabetes: Diabetes status (nominal).
  • Medical (Current):
    • Tot Chol: Total cholesterol level (continuous).
    • Sys BP: Systolic blood pressure (continuous).
    • Dia BP: Diastolic blood pressure (continuous).
    • BMI: Body Mass Index (continuous).
    • Heart Rate: Heart rate (continuous).
    • Glucose: Glucose level (continuous).

image

Exploratory Data Analysis (EDA)

  • Comparisons: Target variable CHD vs. independent variables.
  • Techniques: Feature engineering, feature transformation, outlier handling, categorical and numerical data analysis.
  • Visualizations: Histograms, violin plots, bar plots to analyze correlations and attribute effects on CHD.

image

Data Preprocessing

  • Handling Missing Values: Imputed with mean and mode.
  • Duplicate Values: None found.
  • Outliers: Imputed rather than deleted.
  • Label Encoding: For categorical variables (sex, is smoking).

image

Feature Engineering

  • Feature Selection:
    • Methods: Variance threshold, chi-square tests.
    • Outcome: Identified and eliminated less relevant features.
  • Balancing Target Variable:
    • Technique: SMOTE (Synthetic Minority Oversampling Technique).

image

Algorithms and Models

  1. Logistic Regression:
    • Accuracy: 85.84%, increased to 85.98% with hyper-parameter tuning.
  2. K-Nearest Neighbors (KNN):
    • Accuracy: 84.36%, increased to 85.69% with hyper-parameter tuning.
  3. Decision Tree:
    • Accuracy: 74.92%, increased to 85.69% with hyper-parameter tuning.
  4. Random Forest Classifier:
    • Accuracy: 84.80%.

image

Model Insights

  • Significant Features: BP Meds, prevalent stroke, diabetes, smoking status, age, systolic and diastolic BP.
  • Multicollinearity Handling: Retained cigsPerDay over is_smoking, retained diaBP over sysBP.
  • Variance Threshold: Discarded features not meeting 99% variance threshold.

image

Observations

  • Smoking: Higher CHD risk for heavy smokers.
  • Gender: Males smoke more, but females have higher CHD risk.
  • Age Group: Highest CHD risk in ages 35-50.
  • Feature Irrelevance: Education was irrelevant (identified via chi-square test). image

Challenges

  • Outliers: 14% of data were outliers, imputed with median values.
  • Categorical Features: Label encoded.
  • Irrelevant Features: Adjusted for multicollinearity and low variance.
  • Imbalanced Target: Balanced using SMOTE.
  • Complex Computations: Required careful handling to avoid data loss. image

Conclusion

  • Best Performing Models: Logistic Regression, KNN, and Random Forest.
  • Ease of Implementation: Logistic regression was simplest; Random Forest was time-intensive.
  • Overall Accuracy: Models improved with hyper-parameter tuning and feature selection. image

cardiovascular_risk_prediction's People

Contributors

shantanumokhale avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.