Giter Club home page Giter Club logo

Comments (31)

tovbinm avatar tovbinm commented on May 18, 2024 1

Thanks @wsuchy and @CodingCat

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

@albertodema I don't see any explicit errors emitted except the error status. I would recommend asking on https://github.com/dmlc/xgboost/issues

from transmogrifai.

albertodema avatar albertodema commented on May 18, 2024

@tovbinm thanks for your input but they will likely to ask me how transmogrifai is calling their module , with which parameters , etc..
The error is happening on both IntelliJ IDE and in a spark cluster (standalone) on the titanic dataset, can you do a quick test with the code provided and if you have the same error share with me the way XGBoost has been called so I can raise a proper request to xgboost team?
Thanks,
Alberto.

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

@albertodema I think it might be related this issue dmlc/xgboost#2449, since when we do cross validation we train multiple models in parallel. So I tried setting the parallelism to 1 - and the error still happens sometimes. So my bet that there is some race condition that happens which I am not sure how to track yet.

@CodingCat might have some ideas?

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

which version of xgb are you using?

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

The latest - 0.81 with Spark 2.3.2.

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

Ok..we are supposed to have fixed this issue in 0.81...and I can actually run cross validation without any issue...can you provide a way to reproduce consistenly?

from transmogrifai.

albertodema avatar albertodema commented on May 18, 2024

Here the code (use the following commit d0785f0 , the input file is here (the arg(0) parameter):
https://github.com/salesforce/TransmogrifAI/blob/master/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv

import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.{ OpXGBoostClassifier}
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager}

/**
 * A minimal Titanic Survival example with TransmogrifAI
 */
object OpTitanicMini {

  case class Passenger
  (
    id: Long,
    survived: Double,
    pClass: Option[Long],
    name: Option[String],
    sex: Option[String],
    age: Option[Double],
    sibSp: Option[Long],
    parCh: Option[Long],
    ticket: Option[String],
    fare: Option[Double],
    cabin: Option[String],
    embarked: Option[String]
  )

  def main(args: Array[String]): Unit = {
    LogManager.getLogger("com.salesforce.op").setLevel(Level.ERROR)
    implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
    import spark.implicits._

    // Read Titanic data as a DataFrame
    val pathToData = Option(args(0))
    val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset().toDF()
   
    // Automated feature engineering
    val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
  val passengerId = features.find(_.name == "id").map(_.asInstanceOf[FeatureLike[Integral]]).get
    val featureVector = features.transmogrify()

    // Automated feature selection
    val checkedFeatures = survived.sanityCheck(featureVector, checkSample = 1.0, removeBadFeatures = true)

    // Automated model selection
    val prediction = BinaryClassificationModelSelector
      .withCrossValidation(modelTypesToUse = Seq(OpXGBoostClassifier))
      .setInput(survived, checkedFeatures).getOutput()
    val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()

    println("Model summary:\n" + model.summaryPretty())
  }

}

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

@albertodema I will start looking into this...where did you run this, a laptop or a cluster?

from transmogrifai.

albertodema avatar albertodema commented on May 18, 2024

@CodingCat on a laptop with IntelliJ first than inside a docker container, I tried to launch spark also in single core mode but the error happens the same.

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

Here is how to reproduce. I train 10 xgboost models in parallel and it fails:

val sparse = RandomVector.sparse(RandomReal.uniform[Real](), 1000).take(10000)
val labels = RandomBinary(0.5).withProbabilityOfEmpty(0.0).take(10000).map(b => b.toDouble.toRealNN(0.0))
val sample = sparse.zip(labels).toSeq
val (data, features, label) = TestFeatureBuilder(sample)

(1 to 10).par.map { _ =>
  val x = new XGBoostClassifier().setLabelCol(label.name).setFeaturesCol(features.name)
  x.set(x.trackerConf, TrackerConf(0L, "scala"))
  val xm = x.fit(data)
  val xtransformed = xm.transform(data)
  xtransformed.show()
}

Error:

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:364)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

so it only happens with parallel model training?

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

With parallel execution it is constantly reproducible. Sometimes it also comes up when training multiple models sequentially, but it's rather rare.

from transmogrifai.

zhenchuan avatar zhenchuan commented on May 18, 2024

Is this question still being followed up? I also encountered the same problem.

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

Yes, we are aware of the problem, but we were unable to track down the reason for it yet. Perhaps you want to look into it? @zhenchuan this would be a very valuable contribution :)

from transmogrifai.

timsetsfire avatar timsetsfire commented on May 18, 2024

Is it possible related to this dmlc/xgboost#4054

I had instances where it would sometimes work and sometimes wouldn't (within transmogrifai). So i went to just a vanilla xgboost-spark and found the same thing (in both staigth model training and crossvalidation). Training would fail, and then there would be an issue with dead letters.

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

@timsetsfire thanks. I will give it a try. Also xgboost 0.82 is out and might have some related fixes 🤞

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

Same error persists also with xgboost 0.82

Here is another error of the same type dmlc/xgboost#3418

@CodingCat any suggestions on how to overcome it?

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

are you actually using scala-version of rabit tracker?

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

@chenqin

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

Yes, it fails with Scala tracker (Python implementation of rabbit tracker on Databricks works great).

from transmogrifai.

CodingCat avatar CodingCat commented on May 18, 2024

ah......scala tracker.....out of maintenance for a while......

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

We will update project with the upcoming 0.83 (once available).
@albertodema for now please use TrackerConf(0L, "python") and follow the instructions on XGBoost project page on how to setup Python RabitTracker.

from transmogrifai.

shenzgang avatar shenzgang commented on May 18, 2024

Hello, I also encountered the same problem, I use a spark - 2.3.2, xgboost4j - spark used is 0.90, and then throw model training failure (ml. DMLC. Xgboost4j. Java. XGBoostError: XGBoostModel training failed).

Container id: container_e03_1568625988058_7223_01_000004
Exit code: 255
Stack trace: ExitCodeException exitCode=255: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2019-09-25 10:31:13 [WARN] Model OpXGBoostClassifier attempted in model selector with failed with following issue: 
com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$1.applyOrElse(OpValidator.scala:326)
org.apache.spark.SparkException: Job 45 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1848)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1761)
	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
	at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1372)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1371)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:352)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:285)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
	at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:99)
	at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:67)
	at org.apache.spark.ml.Estimator.fit(Estimator.scala:61)
	at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:321)
	at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:320)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

My code is as follows:

val (response,feature) = FeatureBuilder.fromDataFrame[RealNN](frame,label)
        println(s"response = ${response}")
        val features = feature.dropWhile{case x=>x.name==id}
        println("============== opFeatures ==============")
        features.foreach(println(_))

        val transmogrifyFeature = features.transmogrify()
        val checkedFeature = response.sanityCheck(transmogrifyFeature,removeBadFeatures = true)

        val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
            modelTypesToUse = Seq(OpXGBoostClassifier)
        ).setInput(response, checkedFeature).getOutput()
        
        val evaluator = Evaluators.BinaryClassification().setLabelCol(label).setPredictionCol(prediction)
        val workflow = new OpWorkflow().setInputDataset(frame,(row: Row)=>row.get(0).toString).setResultFeatures(prediction)
        println("============training===========")
        val model = workflow.train()
        println(s"Model Summary:\n ${model.summaryPretty()}")

Thank you for reading and look forward to your reply,thanks!

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

@zhenchuan which TransmogrifAI version are you using?

from transmogrifai.

shenzgang avatar shenzgang commented on May 18, 2024

@tovbinm Hello, I am using version 0.60

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

@shenzgang XGBoost fix to this issue comes with this PR - #402. So you can either try compiling your local version of TransmogrifAI by pulling the repo, checkout the branch revert-399-mt/revert-spark-2.4 and then ./gradlew publishToMavenLocal.

Or you can wait until we released the next version of TransmogrifAI. Perhaps @gerashegalov @leahmcguire can comment out when to be precise.

from transmogrifai.

shenzgang avatar shenzgang commented on May 18, 2024

@tovbinm and when I use OpNaiveBayes for dichotomous cross-training, I throw the following exceptions:
Caused by: java.lang.IllegalArgumentException: requirement failed: Naive Bayes requires nonnegative feature values but found (2078,[0,1,2,3,4,6,7,8,10,11,12,13,15,21,26,30,118,460,599,1021,1022,1348,1356,1393,1948],[0.2588190451025203,-0.9659258262890684,-0.22252093395631434,0.9749279121818236,1.0,-0.861701759948068,0.5074150932938458,1205.0,4.0,6.0,33.0,76.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.classification.NaiveBayes$.requireNonnegativeValues(NaiveBayes.scala:232)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:165)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:163)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1$$anonfun$apply$6.apply(PairRDDFunctions.scala:172)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)
at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:144)
at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
Some data sets have this error. Will you try other algorithms to continue training when OpNaiveBayes training fails? Or throw an exception?,
my code modelTypesToUse=Seq(OpLogisticRegression,
OpRandomForestClassifier,
OpGBTClassifier,
OpLinearSVC,
OpNaiveBayes,
OpDecisionTreeClassifier)

from transmogrifai.

tovbinm avatar tovbinm commented on May 18, 2024

I think this might have been fixed in this PR - #404

Try using TransmogrifAI 0.6.1 release

from transmogrifai.

shenzgang avatar shenzgang commented on May 18, 2024

Ok, thanks! I'll keep following transmogrifai!

from transmogrifai.

leahmcguire avatar leahmcguire commented on May 18, 2024

TransmogrifAI 0.6.1 was released 2 weeks ago. Are you asking when we will release with the updated spark version?

from transmogrifai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.