Learning Goals
Part 1:
- Work with scikit-learn library, train-test set split, report different scores.
- Decision Trees.
Part 2:
- Work with Pipelines (with DecisionTrees), imputers, scalers and encoders.
- Grid Search.
Exercise Statement
Part 1:
Apply different Decision Trees to train a model for detecting breast cancer using the breast-cancer-wisconsin-diagnostic-dataset (scikit-learn 7.2.7. Breast cancer wisconsin (diagnostic) dataset).
Goal is to predict whether breast cancer is Malignant or Bening.
Part 2:
Apply various transformations, imputers, encoders-scalers using Pipelines with DecisionTreeClassifiers. Work with gridsearch to find the best parameters. Goal is to predict whether income exceeds $50K/yr based on census data.
Prerequisites
DecisionTreeClassifier
Pipeline
SimpleImputer
StandardScaler
OneHotEncoder
ColumnTransformer
GridSearchCV
Data source/summary:
Part 1:
569 instances with 30 numeric attributes. Class distribution: 212 - Malignant, 357 - Benign
Follow the link below for the full description of the dataset.
https://scikit-learn.org/stable/datasets/#breast-cancer-wisconsin-diagnostic-dataset
Part 2:
income.csv is used for training set.
32561 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
income_test.csv is used for testing and report scores.
15315 instances with 14 attributes, 6 numeric (e.x. age, capital gain, hours-per-week ) and 8 categorical (e.x. workclass, education, race).
Goal is to predict whether income exceeds $50K/yr based on census data.
Link: http://archive.ics.uci.edu/ml/datasets/Adult
(Optional) Further Links/Credits to Relevant Resources:
This exercise was assigned in the machine learning course at Aristotle University of THessaloniki and the solution was my submission at this.