Comments (31)
Thanks @wsuchy and @CodingCat
from transmogrifai.
@albertodema I don't see any explicit errors emitted except the error status. I would recommend asking on https://github.com/dmlc/xgboost/issues
from transmogrifai.
@tovbinm thanks for your input but they will likely to ask me how transmogrifai is calling their module , with which parameters , etc..
The error is happening on both IntelliJ IDE and in a spark cluster (standalone) on the titanic dataset, can you do a quick test with the code provided and if you have the same error share with me the way XGBoost has been called so I can raise a proper request to xgboost team?
Thanks,
Alberto.
from transmogrifai.
@albertodema I think it might be related this issue dmlc/xgboost#2449, since when we do cross validation we train multiple models in parallel. So I tried setting the parallelism to 1 - and the error still happens sometimes. So my bet that there is some race condition that happens which I am not sure how to track yet.
@CodingCat might have some ideas?
from transmogrifai.
which version of xgb are you using?
from transmogrifai.
The latest - 0.81
with Spark 2.3.2
.
from transmogrifai.
Ok..we are supposed to have fixed this issue in 0.81...and I can actually run cross validation without any issue...can you provide a way to reproduce consistenly?
from transmogrifai.
Here the code (use the following commit d0785f0 , the input file is here (the arg(0) parameter):
https://github.com/salesforce/TransmogrifAI/blob/master/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv
import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.{ OpXGBoostClassifier}
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager}
/**
* A minimal Titanic Survival example with TransmogrifAI
*/
object OpTitanicMini {
case class Passenger
(
id: Long,
survived: Double,
pClass: Option[Long],
name: Option[String],
sex: Option[String],
age: Option[Double],
sibSp: Option[Long],
parCh: Option[Long],
ticket: Option[String],
fare: Option[Double],
cabin: Option[String],
embarked: Option[String]
)
def main(args: Array[String]): Unit = {
LogManager.getLogger("com.salesforce.op").setLevel(Level.ERROR)
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._
// Read Titanic data as a DataFrame
val pathToData = Option(args(0))
val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset().toDF()
// Automated feature engineering
val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
val passengerId = features.find(_.name == "id").map(_.asInstanceOf[FeatureLike[Integral]]).get
val featureVector = features.transmogrify()
// Automated feature selection
val checkedFeatures = survived.sanityCheck(featureVector, checkSample = 1.0, removeBadFeatures = true)
// Automated model selection
val prediction = BinaryClassificationModelSelector
.withCrossValidation(modelTypesToUse = Seq(OpXGBoostClassifier))
.setInput(survived, checkedFeatures).getOutput()
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()
println("Model summary:\n" + model.summaryPretty())
}
}
from transmogrifai.
@albertodema I will start looking into this...where did you run this, a laptop or a cluster?
from transmogrifai.
@CodingCat on a laptop with IntelliJ first than inside a docker container, I tried to launch spark also in single core mode but the error happens the same.
from transmogrifai.
Here is how to reproduce. I train 10 xgboost models in parallel and it fails:
val sparse = RandomVector.sparse(RandomReal.uniform[Real](), 1000).take(10000)
val labels = RandomBinary(0.5).withProbabilityOfEmpty(0.0).take(10000).map(b => b.toDouble.toRealNN(0.0))
val sample = sparse.zip(labels).toSeq
val (data, features, label) = TestFeatureBuilder(sample)
(1 to 10).par.map { _ =>
val x = new XGBoostClassifier().setLabelCol(label.name).setFeaturesCol(features.name)
x.set(x.trackerConf, TrackerConf(0L, "scala"))
val xm = x.fit(data)
val xtransformed = xm.transform(data)
xtransformed.show()
}
Error:
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:364)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
from transmogrifai.
so it only happens with parallel model training?
from transmogrifai.
With parallel execution it is constantly reproducible. Sometimes it also comes up when training multiple models sequentially, but it's rather rare.
from transmogrifai.
Is this question still being followed up? I also encountered the same problem.
from transmogrifai.
Yes, we are aware of the problem, but we were unable to track down the reason for it yet. Perhaps you want to look into it? @zhenchuan this would be a very valuable contribution :)
from transmogrifai.
Is it possible related to this dmlc/xgboost#4054
I had instances where it would sometimes work and sometimes wouldn't (within transmogrifai). So i went to just a vanilla xgboost-spark and found the same thing (in both staigth model training and crossvalidation). Training would fail, and then there would be an issue with dead letters.
from transmogrifai.
@timsetsfire thanks. I will give it a try. Also xgboost 0.82
is out and might have some related fixes 🤞
from transmogrifai.
Same error persists also with xgboost 0.82
Here is another error of the same type dmlc/xgboost#3418
@CodingCat any suggestions on how to overcome it?
from transmogrifai.
are you actually using scala-version of rabit tracker?
from transmogrifai.
from transmogrifai.
Yes, it fails with Scala tracker (Python implementation of rabbit tracker on Databricks works great).
from transmogrifai.
ah......scala tracker.....out of maintenance for a while......
from transmogrifai.
We will update project with the upcoming 0.83 (once available).
@albertodema for now please use TrackerConf(0L, "python")
and follow the instructions on XGBoost project page on how to setup Python RabitTracker.
from transmogrifai.
Hello, I also encountered the same problem, I use a spark - 2.3.2, xgboost4j - spark used is 0.90, and then throw model training failure (ml. DMLC. Xgboost4j. Java. XGBoostError: XGBoostModel training failed).
Container id: container_e03_1568625988058_7223_01_000004
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-09-25 10:31:13 [WARN] Model OpXGBoostClassifier attempted in model selector with failed with following issue:
com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$1.applyOrElse(OpValidator.scala:326)
org.apache.spark.SparkException: Job 45 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1848)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1761)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1372)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1371)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:352)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:294)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:256)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:255)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:200)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:99)
at com.salesforce.op.stages.sparkwrappers.specific.OpPredictorWrapper.fit(OpPredictorWrapper.scala:67)
at org.apache.spark.ml.Estimator.fit(Estimator.scala:61)
at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:321)
at com.salesforce.op.stages.impl.tuning.OpValidator$$anonfun$9$$anonfun$10$$anonfun$apply$3.apply(OpValidator.scala:320)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My code is as follows:
val (response,feature) = FeatureBuilder.fromDataFrame[RealNN](frame,label)
println(s"response = ${response}")
val features = feature.dropWhile{case x=>x.name==id}
println("============== opFeatures ==============")
features.foreach(println(_))
val transmogrifyFeature = features.transmogrify()
val checkedFeature = response.sanityCheck(transmogrifyFeature,removeBadFeatures = true)
val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(
modelTypesToUse = Seq(OpXGBoostClassifier)
).setInput(response, checkedFeature).getOutput()
val evaluator = Evaluators.BinaryClassification().setLabelCol(label).setPredictionCol(prediction)
val workflow = new OpWorkflow().setInputDataset(frame,(row: Row)=>row.get(0).toString).setResultFeatures(prediction)
println("============training===========")
val model = workflow.train()
println(s"Model Summary:\n ${model.summaryPretty()}")
Thank you for reading and look forward to your reply,thanks!
from transmogrifai.
@zhenchuan which TransmogrifAI version are you using?
from transmogrifai.
@tovbinm Hello, I am using version 0.60
from transmogrifai.
@shenzgang XGBoost fix to this issue comes with this PR - #402. So you can either try compiling your local version of TransmogrifAI by pulling the repo, checkout the branch revert-399-mt/revert-spark-2.4
and then ./gradlew publishToMavenLocal
.
Or you can wait until we released the next version of TransmogrifAI. Perhaps @gerashegalov @leahmcguire can comment out when to be precise.
from transmogrifai.
@tovbinm and when I use OpNaiveBayes for dichotomous cross-training, I throw the following exceptions:
Caused by: java.lang.IllegalArgumentException: requirement failed: Naive Bayes requires nonnegative feature values but found (2078,[0,1,2,3,4,6,7,8,10,11,12,13,15,21,26,30,118,460,599,1021,1022,1348,1356,1393,1948],[0.2588190451025203,-0.9659258262890684,-0.22252093395631434,0.9749279121818236,1.0,-0.861701759948068,0.5074150932938458,1205.0,4.0,6.0,33.0,76.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]).
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.classification.NaiveBayes$.requireNonnegativeValues(NaiveBayes.scala:232)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$4.apply(NaiveBayes.scala:140)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:165)
at org.apache.spark.ml.classification.NaiveBayes$$anonfun$7.apply(NaiveBayes.scala:163)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1$$anonfun$apply$6.apply(PairRDDFunctions.scala:172)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)
at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:144)
at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more
Some data sets have this error. Will you try other algorithms to continue training when OpNaiveBayes training fails? Or throw an exception?,
my code modelTypesToUse=Seq(OpLogisticRegression,
OpRandomForestClassifier,
OpGBTClassifier,
OpLinearSVC,
OpNaiveBayes,
OpDecisionTreeClassifier)
from transmogrifai.
I think this might have been fixed in this PR - #404
Try using TransmogrifAI 0.6.1 release
from transmogrifai.
Ok, thanks! I'll keep following transmogrifai!
from transmogrifai.
TransmogrifAI 0.6.1 was released 2 weeks ago. Are you asking when we will release with the updated spark version?
from transmogrifai.
Related Issues (20)
- Did the documentation site's domain name expire? HOT 2
- cannot be cast to [Lcom.salesforce.op.stages.impl.feature.TextStats; HOT 5
- Model saving and loading behavior changed since #475 HOT 1
- MultiClassClassificationModelsToTry and BinaryClassificationModelsToTry not contains OpMultilayerPerceptronClassifier HOT 2
- Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146) HOT 9
- Testing something HOT 1
- Unnecessary codec factory initialization in readAsString HOT 1
- Release drafter
- UV Computation HOT 2
- Normalize special characters in string
- CDH 6.3.2 not worked,throw NoClassDefFoundError( com.fasterxml.jackson.module.scala.modifiers.EitherModule) HOT 3
- How to use feature selection with no model training and optimization? HOT 8
- Failed to run titanic example, got java.lang.AbstractMethodError HOT 2
- build fails on AArch64, Fedora 33 HOT 1
- Changing imputation for nulls in DateToUnitCircleTransformer
- Make RecordInsightsLOCO perform reasonable calculation on numeric features and fix the name to reflect actual calculation. HOT 1
- The effect of random seeds on results ? HOT 5
- Migrating Documentation Page to Docusaurus 2
- Two cache miss case
- เปิด
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transmogrifai.