Predict whether a mammogram mass is benign or malignant.
The data used for this project comes from the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass).
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
- BI-RADS assessment (ordinal) - 1 to 5
- Age (integer) - Patient's age in years
- Mass shape (nominal)- Mass shape: round=1 oval=2 lobular=3 irregular=4
- Mass margin (nominal) - circumscribed=1; micro-lobulated=2; obscured=3; ill-defined=4; spiculated=5
- Mass density (ordinal) - high=1; iso=2; low=3; fat-containing=4
- Severity (binomial) - benign=0 or malignant=1
Data pre-processing:
- Data exploration
- Handling missing data
- Feature selection
- Normalization
Apply several different classification supervised machine learning techniques and see which one yields the highest accuracy.
Models tested:
- Logistic Regression
- KNN
- Naive Bayes
- Decision Tree
- Random Forest
- SVM
- Neural network
Models performances are measured using K-fold cross validation (K=10).