This notebook demonstrates how to create a Gradient-boosted tree classifier to predict wine quality based on characteristics.
Used: Python, Spark, seaborn
Steps:
Set Env variables for projet (Settings -> Engine -> Environmental Variables )
- make sure that the following pyspark env variables point to the same python interpreter (and that the interpreter exists)
- PYSPARK3_PYTHON
- PYSPARK_DRIVER_PYTHON
ex :
- echo $PYSPARK3_PYTHON => /usr/local/bin/python3
- ls /usr/local/bin/python3
Open Workbench => Editor : Workbench Engine Kernel : Python 3
-installs python libraries ( SKLearn ) -copies data to HDFS (/tmp/wine_pred) -and create hive table ( wineDS_ext ) Note : Assumes that Hive is setup correctly and can talk to the cluster
Start a Python 3 session using Workbench editor use either the session input window or terminal session to run setup
- in session : !./setup/setup.sh
- in terminal : ./setup/setup.sh
Start a python3 session using Workbensh editor run "analyse.py"
Start a Python 3 session using Workbench editor Run "fit.py:
Note : using RandomForest from spark ML library params:
- numTrees ( number of trees to spawn )
- maxDepth (depth of leafs for trees )
=> Copies model to root of projet as "spark_rf.tar"
Start a Python 3 session using Jupyter Notebook editor => user at least 2 CPU and 4GB of RAM => Wait 15-20 secs for notebook to show up
In the "Jupyter Notebook" folder, run "fit_SKLEARN.ipynb" Note : using RandomForest from Sklearn params:
- estimator ( number of trees to spawn )
- maxDepth (depth of leafs for trees )
- Give the model a name and description
- Confifure runtime params
- Script: "model.py"
- Function : "predict"
- (optional) give example input :
{"feature": "7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4"} {"feature": "7.3;0.65;0.0;1.2;0.065;15.0;21.0;0.9946;3.39;0.47;10.0"}