User Forest Cover Type Prediction

Predicting Colorado forest cover types using diverse ML models for classification. Baseline creation, feature selection, comparison ,and tuning optimize accuracy on Forest Cover Type Prediction in this University of Ottawa Master's Machine Learning course final project (2023).

Required libraries: scikit-learn, pandas, matplotlib.
Execute cells in a Jupyter Notebook environment.
The uploaded code has been executed and tested successfully within the Google Colab environment.

Multi-class classification problem

Task is to classify the Forest Cover Type Prediction dataset into seven types: Spruce/Fir, Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, Aspen, Douglas-fir, and Krummholz.

Independent Variables:

54 geographical Features include 'Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1' to 'Wilderness_Area4', and 'Soil_Type1' to 'Soil_Type40'.

Target variable:

'Cover Type' column represents the target with 7 classes

Key Tasks Undertaken

Problem’s Overview:
- Create a conceptual figure showcasing the end-to-end data flow.
- Illustrate insights into the problem through data flow visualization.
Dataset’s Overview (EDA):
- Present numerical information about the dataset.
General Flowchart:
- Develop a detailed flowchart illustrating each step of the project's implementation.
Visualize Training and Test Sets:
- Generate TSNE plots separately for the training and test sets to understand the problem's complexity.
  - Problem Complexity
  - Reduction and Transformation
Obtain Baseline Performance:
- Apply diverse ML methods (KNN, LogisticRegression, SVM, DecisionTreeClassifier, Naive Bayes Classifier) to establish a baseline.
- Champion Model

First Improvement Strategy: Feature Selection:

Implement feature selection methods, including
- Filter Selection Methods (Information Gain/Mutual Information , Feature Selection , Variance Threshold ,Chi-Square)
- Wrapper Selection Methods (Forward Feature Elimination- Backward Feature Elimination- Recursive Feature Elimination

Proceed with the best-performing feature subset and ML model for subsequent stages.

Champion Model in Filter Selection: Information Gain

   Maximum of Feature Selection-K-Nearest Neighbors: 73.96721311475409
   Best number of n_components Feature Selection-K-Nearest Neighbors: 12

   Maximum of Feature Selection-Decision Tree Classifier: 76.65573770491804
   Best number of n_components Feature Selection-Decision Tree Classifier: 8

Champion Model in Wrapper Selection: Recursive

   Maximum of Recursive_FE-K-Nearest Neighbors: 73.96721311475409
   Best number of n_components Recursive_FE-K-Nearest Neighbors: 12

   Maximum of Recursive_FE-Decision Tree Classifier: 76.26229508196721
   Best number of n_components Recursive_FE-Decision Tree Classifier: 10

Adding More Machine Learning Models:
- Implement advanced models (Random Forest, ensemble techniques) to enhance performance.
- Compare new technique performance with the initial improvement through confusion matrices.

rimtouny / user-forest-cover-type-prediction Goto Github PK