I am trying to work with one binary classification example same as titanic problem demo, But i am getting Null value error , I have already checked the csv file there are no null values but still it is showing null values.
My data looks like this main file code is:
/*
* Copyright (c) 2017, Salesforce.com, Inc.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of Salesforce.com nor the names of its contributors may
* be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*/
package com.salesforce.hw
import com.salesforce.op._
import com.salesforce.op.evaluators.Evaluators
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector
import com.salesforce.op.stages.impl.classification.ClassificationModelsToTry._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* Define a case class corresponding to our data file (nullable columns must be Option types)
*
* @param id passenger id
* @param survived 1: survived, 0: did not survive
* @param pClass passenger class
* @param name passenger name
* @param sex passenger sex (male/female)
* @param age passenger age (one person has a non-integer age so this must be a double)
* @param sibSp number of siblings/spouses traveling with this passenger
* @param parCh number of parents/children traveling with this passenger
* @param ticket ticket id string
* @param fare ticket price
* @param cabin cabin id string
* @param embarked location where passenger embarked
*/
case class Passenger
(
id: Option[Int],
blue_ca_1: Option[Int],
blue_ca_2: Option[Int],
blue_ca_3: Option[Int],
blue_ca_4: Option[Int],
blue_ca_5: Option[Int],
blue_ca_6: Option[Int],
blue_ca_7: Option[Int],
blue_ca_8: Option[Int],
blue_ca_9: Option[Int],
blue_ca_10: Option[Int],
blue_ca_11: Option[Int],
blue_ca_12: Option[Int],
blue_ca_13: Option[Int],
blue_ca_14: Option[Int],
blue_ca_15: Option[Int],
blue_ca_16: Option[Int],
blue_ca_17: Option[Int],
outcome: Int
)
/**
* A simplified TransmogrifAI example classification app using the Titanic dataset
*/
object OpTitanicSimple {
/**
* Run this from the command line with
* ./gradlew sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs=/full/path/to/csv/file
*/
def main(args: Array[String]): Unit = {
if (args.isEmpty) {
println("You need to pass in the CSV file path as an argument")
sys.exit(1)
}
val csvFilePath = args(0)
println(s"Using user-supplied CSV file path: $csvFilePath")
// Set up a SparkSession as normal
val conf = new SparkConf().setAppName(this.getClass.getSimpleName.stripSuffix("$"))
implicit val spark = SparkSession.builder.config(conf).getOrCreate()
////////////////////////////////////////////////////////////////////////////////
// RAW FEATURE DEFINITIONS
/////////////////////////////////////////////////////////////////////////////////
// Define features using the OP types based on the data
val blue_ca_1 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_1.toIntegral).asPredictor
val blue_ca_2 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_2.toIntegral).asPredictor
val blue_ca_3 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_3.toIntegral).asPredictor
val blue_ca_4 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_4.toIntegral).asPredictor
val blue_ca_5 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_5.toIntegral).asPredictor
val blue_ca_6 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_6.toIntegral).asPredictor
val blue_ca_7 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_7.toIntegral).asPredictor
val blue_ca_8 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_8.toIntegral).asPredictor
val blue_ca_9 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_9.toIntegral).asPredictor
val blue_ca_10 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_10.toIntegral).asPredictor
val blue_ca_11 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_11.toIntegral).asPredictor
val blue_ca_12 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_12.toIntegral).asPredictor
val blue_ca_13 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_13.toIntegral).asPredictor
val blue_ca_14 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_14.toIntegral).asPredictor
val blue_ca_15 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_15.toIntegral).asPredictor
val blue_ca_16 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_16.toIntegral).asPredictor
val blue_ca_17 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_17.toIntegral).asPredictor
val outcome = FeatureBuilder.RealNN[Passenger].extract(_.outcome.toRealNN).asResponse
////////////////////////////////////////////////////////////////////////////////
// TRANSFORMED FEATURES
/////////////////////////////////////////////////////////////////////////////////
// Do some basic feature engineering using knowledge of the underlying dataset
// Define a feature of type vector containing all the predictors you'd like to use
val passengerFeatures = Seq(
blue_ca_1, blue_ca_2, blue_ca_3, blue_ca_4, blue_ca_5, blue_ca_6,
blue_ca_7, blue_ca_8, blue_ca_9, blue_ca_10, blue_ca_11,
blue_ca_12, blue_ca_13, blue_ca_14, blue_ca_15, blue_ca_16, blue_ca_17
).transmogrify()
// Optionally check the features with a sanity checker
val sanityCheck = true
val finalFeatures = if (sanityCheck) outcome.sanityCheck(passengerFeatures) else passengerFeatures
// Define the model we want to use (here a simple logistic regression) and get the resulting output
val (prediction, rawPrediction, prob) =
BinaryClassificationModelSelector.withTrainValidationSplit()
.setModelsToTry(LogisticRegression)
.setInput(outcome, finalFeatures).getOutput()
val evaluator = Evaluators.BinaryClassification()
.setLabelCol(outcome)
.setRawPredictionCol(rawPrediction)
.setPredictionCol(prediction)
.setProbabilityCol(prob)
////////////////////////////////////////////////////////////////////////////////
// WORKFLOW
/////////////////////////////////////////////////////////////////////////////////
import spark.implicits._ // Needed for Encoders for the Passenger case class
// Define a way to read data into our Passenger class from our CSV file
val trainDataReader = DataReaders.Simple.csvCase[Passenger](
path = Option(csvFilePath),
key = _.id.toString
)
// Define a new workflow and attach our data reader
val workflow =
new OpWorkflow()
.setResultFeatures(outcome, rawPrediction, prob, prediction)
.setReader(trainDataReader)
// Fit the workflow to the data
val fittedWorkflow = workflow.train()
println(s"Summary: ${fittedWorkflow.summary()}")
// Manifest the result features of the workflow
println("Scoring the model")
val (dataframe, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
println("Transformed dataframe columns:")
dataframe.columns.foreach(println)
println("Metrics:")
println(metrics)
}
}
And Passenger file ( variables definition file ) looks like this:
{
"type" : "record",
"name" : "Passenger",
"namespace" : "com.salesforce.hw.tpo",
"fields" : [ {
"name" : "blue_ca_1",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_2",
"type" : [ "double", "null" ],
"default": 0
}, {
"name" : "blue_ca_3",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_4",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_5",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_6",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_7",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_8",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_9",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_10",
"type" : [ "double", "null" ]
}, {
"name" : "blue_ca_11",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_12",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_13",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_14",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_15",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_16",
"type" : [ "int", "null" ]
}, {
"name" : "blue_ca_17",
"type" : [ "int", "null" ]
}, {
"name" : "outcome",
"type" : [ "int", "null" ]
} ]
}
Here are all the files https://github.com/monk1337/TransmogrifAI-Auto-ml
I Have two questions :
response variables should be cast in real ( float ) ? because outcome will be probability of two classes which will be float values so should i do this ? :
val outcome = FeatureBuilder.RealNN[Passenger].extract(_.outcome.toRealNN).asResponse
or this :
val outcome = FeatureBuilder.Integral[Passenger].extract(_.outcome.toIntegral).asResponse
Second thing , How to solve that Null value issue and run this successfully ?
Thanks in advance !