intel-analytics / analytics-zoo Goto Github PK
View Code? Open in Web Editor NEWDistributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Home Page: https://analytics-zoo.readthedocs.io/
License: Apache License 2.0
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Home Page: https://analytics-zoo.readthedocs.io/
License: Apache License 2.0
Py4JJavaError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic(u'time', u'', u'# Boot training process\nlenet_model.fit(x=train_data,\n batch_size=2048,\n nb_epoch=20,\n validation_data=test_data)')
/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119
in time(self, line, cell, local_ns)
/usr/local/lib/python2.7/dist-packages/IPython/core/magic.pyc in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
/usr/local/lib/python2.7/dist-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1187 if mode=='eval':
1188 st = clock2()
-> 1189 out = eval(code, glob, local_ns)
1190 end = clock2()
1191 else:
in ()
/tmp/zoo/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/keras/engine/topology.py in fit(self, x, y, batch_size, nb_epoch, validation_data, distributed)
161 batch_size,
162 nb_epoch,
--> 163 validation_data)
164 else:
165 if validation_data:
/tmp/zoo/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in callBigDlFunc(bigdl_type, name, *args)
586 error = e
587 if "does not exist" not in str(e):
--> 588 raise e
589 else:
590 return result
Py4JJavaError: An error occurred while calling o25.zooFit.
: java.lang.ExceptionInInitializerError
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:893)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:204)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:220)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1314)
at java.util.concurrent.ThreadPoolExecutor.(ThreadPoolExecutor.java:1237)
at java.util.concurrent.Executors.newFixedThreadPool(Executors.java:151)
at com.intel.analytics.bigdl.parameters.AllReduceParameter$.(AllReduceParameter.scala:47)
at com.intel.analytics.bigdl.parameters.AllReduceParameter$.(AllReduceParameter.scala)
... 15 more
We should copy this doc generation job into PR validation otherwise Nightlybuild would fail although there's no exception in PR validation.
in the doc(https://analytics-zoo.github.io/master/#ScalaUserGuide/install/),
the name of spark should be lower case.
[SPARK_1.6.2|SPARK_2.1.1|SPARK_2.2.0|SPARK_2.3.1]
should be
[spark_1.6.2|spark_2.1.1|spark_2.2.0|spark_2.3.1]
I used the model=Net.load_keras("file:///"+BASE_PATH+"/cpu_300000_arda_model.json") api to load my model trained by keras 1.2.2 on cpu, it will throw the error :
Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 225, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 217, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 211, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 102, in trainNetwork
model=Net.load_keras("file:///"+BASE_PATH+"/gpu_keras1_2_model.json","file:///"+BASE_PATH+"/model_gpu_1540000.h5")
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/net.py", line 178, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 791, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 59, in load_weights_from_json_hdf5
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 368, in from_json_path
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 372, in from_json_str
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/layer_utils.py", line 27, in layer_from_config
class_name = config['class_name']
TypeError: string indices must be integers
The same situation when I load the model trained by keras1.2.2 on gpu,
When I use TFOptimizer to train a tensorflow model by slim.
I got a error.
x_rdd = sc.parallelize(images)
y_rdd = sc.parallelize(labels)
train_rdd = x_rdd.zip(y_rdd).map(lambda rec_tuple: [rec_tuple[0], np.array(rec_tuple[1])])
dataset = TFDataset.from_rdd(train_rdd,
names=["features", "label"],
shapes=[[SIZE_W, SIZE_H, 3], [1]],
types=[tf.float32, tf.int32])
data_images, data_labels = dataset.tensors
squeezed_labels = tf.squeeze(data_labels)
with slim.arg_scope(resnet_v1.resnet_arg_scope()):
logits, end_points = resnet_v1.resnet_v1_200(data_images, num_classes=len(label_to_num), is_training=True)
loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=squeezed_labels))
from zoo.pipeline.api.net import TFOptimizer
from bigdl.optim.optimizer import MaxIteration, Adam, MaxEpoch, TrainSummary
optimizer = TFOptimizer(loss, Adam(1e-3))
optimizer.set_train_summary(TrainSummary("/tmp/resnet_v2", "train"))
optimizer.optimize(end_trigger=MaxEpoch(5))
I run the https://github.com/intel-analytics/analytics-zoo/blob/5212eb75956965fbedc64a0f0bb563bfc0b855b6/pyzoo/zoo/examples/tensorflow/distributed_training/train_lenet.py,get same error.
Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 44, localhost, executor driver): java.util.concurrent.ExecutionException: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
[[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:298)
at org.tensorflow.Session$Runner.run(Session.java:248)
at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:264)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
[[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:298)
at org.tensorflow.Session$Runner.run(Session.java:248)
at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:263)
at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Driver stacktrace:
I get an error in AbstractInferenceModel here
modelQueue = new LinkedBlockingQueue<>(supportedConcurrentNum);
error: Diamond types are not supported at language level '5'
But my language level has been set to 8 in project structures.
I used the model=Net.load_keras("file:///"+BASE_PATH+"/cpu_300000_arda_model.json") api to load my model trained by keras 1.2.2 on cpu, it will throw the error :
Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 225, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 217, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 211, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 102, in trainNetwork
model=Net.load_keras("file:///"+BASE_PATH+"/gpu_keras1_2_model.json","file:///"+BASE_PATH+"/model_gpu_1540000.h5")
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/net.py", line 178, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 791, in load_keras
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 59, in load_weights_from_json_hdf5
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 368, in from_json_path
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/keras/converter.py", line 372, in from_json_str
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 213, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/layer_utils.py", line 27, in layer_from_config
class_name = config['class_name']
TypeError: string indices must be integers
The same situation when I load the model trained by keras1.2.2 on gpu,
>>> import zoo
2018-05-14 14:00:00 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-05-14 14:00:00 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2018-05-14 14:00:00 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/zoo/__init__.py", line 25, in <module>
check_version()
File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 46, in check_version
_check_spark_version(sc, report_warn)
File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 58, in _check_spark_version
version_info = _get_bigdl_verion_conf()
File "/usr/local/lib/python2.7/dist-packages/zoo/common/nncontext.py", line 105, in _get_bigdl_verion_conf
" is located in zoo/target/extra-resources")
RuntimeError: Error while locating file zoo-version-info.properties, please make sure the mvn generate-resources phase is executed and a zoo-version-info.properties file is located in zoo/target/extra-resources
We met errors when using examples/imageclassification/Predict.scala to predict inception v1 w/ ImageNet val. But it reported java.lang.IllegalArgumentException
for 10k images and java.lang.ArrayIndexOutOfBoundsException
for 5k images. Predicting 1000 images can pass.
Execution script:
#!/bin/sh
master="local[28]"
modelPath=/mnt/disk1/analytics-zoo-dataset/imageclassification/analytics-zoo_inception-v1_imagenet_0.1.0
imagePath=/mnt/disk1/analytics-zoo-dataset/imageclassification/imagenet/
ZOO_HOME=/root/analytics-zoo
ZOO_JAR_PATH=${ZOO_HOME}/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-jar-with-dependencies.jar
spark-submit \
--verbose \
--master $master \
--conf spark.executor.cores=28 \
--conf spark.driver.maxResultSize=6g \
--total-executor-cores 28 \
--driver-memory 200g \
--executor-memory 40g \
--class com.intel.analytics.zoo.examples.imageclassification.Predict \
${ZOO_JAR_PATH} -f $imagePath --model $modelPath --partition 28 --topN 5
error when predicting 10000 images:
2018-05-24 15:07:37 INFO ThreadPool$:79 - Set mkl threads to 1 on thread 1
2018-05-24 15:07:39 INFO Engine$:103 - Auto detect executor number and executor cores number
2018-05-24 15:07:39 INFO Engine$:105 - Executor number is 1 and executor cores number is 28
2018-05-24 15:07:39 INFO Engine$:373 - Find existing spark context. Checking the spark conf...
[Stage 0:===============> (3 + 8) / 11]2018-05-24 15:10:57 ERROR Executor:91 - Exception in task 3.0 in stage 0.0 (TID 3)
Layer info: ImageClassifier[analytics-zoo_inception-v1_imagenet_0.1.0]/SpatialConvolution[conv1/7x7_s2](3 -> 64, 7 x 7, 2, 2, 3, 3)
java.lang.IllegalArgumentException: requirement failed: input channel size 2 is not the same as nInputPlane 3
at scala.Predef$.require(Predef.scala:224)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:262)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:54)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
at com.intel.analytics.bigdl.nn.StaticGraph.updateOutput(StaticGraph.scala:59)
at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:67)
at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:800)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at com.intel.analytics.bigdl.optim.Predictor$.predictImageBatch(Predictor.scala:48)
error when predicting 5000 images:
[Stage 0:> (0 + 4) / 5]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy$mcF$sp(TensorNumeric.scala:721)
at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:715)
at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:503)
at com.intel.analytics.bigdl.dataset.MiniBatch$.copy(MiniBatch.scala:460)
at com.intel.analytics.bigdl.dataset.MiniBatch$.copyWithPadding(MiniBatch.scala:380)
at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:209)
at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:111)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:348)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
I have migrated the keras version flappybird to zoo.keras version flappybird, I use model.fit(state_t,targets) to train my model in distributed mode, the batch size is set to default which is 32, and I submitted the code with submit-spark-with-zoo.sh which is:
#!/bin/bash
export SPARK_HOME=/opt/work/spark-2.1.1-bin-hadoop2.7
export MASTER=spark://Almaren-Node-075:7077
export FTP_URI=$FTP_URI
export ANALYTICS_ZOO_HOME=/root/workspace/analytics-zoo
export ANALYTICS_ZOO_HOME_DIST=$ANALYTICS_ZOO_HOME/dist
export ANALYTICS_ZOO_JAR=find ${ANALYTICS_ZOO_HOME_DIST}/lib -type f -name "analytics-zoo*jar-with-dependencies.jar"
export ANALYTICS_ZOO_PYZIP=find ${ANALYTICS_ZOO_HOME_DIST}/lib -type f -name "analytics-zoo*python-api.zip"
export ANALYTICS_ZOO_CONF=${ANALYTICS_ZOO_HOME_DIST}/conf/spark-analytics-zoo.conf
export PYTHONPATH=${ANALYTICS_ZOO_PYZIP}:$PYTHONPATH
if [ -z "${ANALYTICS_ZOO_HOME}" ]; then
echo "Please set ANALYTICS_ZOO_HOME environment variable"
exit 1
fi
if [ -z "${SPARK_HOME}" ]; then
echo "Please set SPARK_HOME environment variable"
exit 1
fi
if [ ! -f ${ANALYTICS_ZOO_CONF} ]; then
echo "Cannot find ${ANALYTICS_ZOO_CONF}"
exit 1
fi
if [ ! -f ${ANALYTICS_ZOO_PY_ZIP} ]; then
echo "Cannot find ${ANALYTICS_ZOO_PY_ZIP}"
exit 1
fi
if [ ! -f ${ANALYTICS_ZOO_JAR} ]; then
echo "Cannot find ${ANALYTICS_ZOO_JAR}"
exit 1
fi
${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--driver-cores 32
--driver-memory 180g
--total-executor-cores 128
--executor-cores 32
--executor-memory 180g
--properties-file ${ANALYTICS_ZOO_CONF}
--py-files ${ANALYTICS_ZOO_PYZIP},${ANALYTICS_ZOO_HOME}/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py
--jars ${ANALYTICS_ZOO_JAR}
${ANALYTICS_ZOO_HOME}/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py
-m "Train"
--conf spark.driver.extraClassPath=${ANALYTICS_ZOO_JAR}
--conf spark.executor.extraClassPath=${ANALYTICS_ZOO_JAR}
$*
but once the training process begin it will throw this error:
TIMESTEP 400 / STATE observe / EPSILON 0.1 / ACTION 0 / REWARD 0.1 / Loss 0
TIMESTEP 401 / STATE explore / EPSILON 0.1 / ACTION 0 / REWARD 0.1 / Loss 0
2018-05-22 15:50:13 INFO DistriOptimizer$:871 - caching training rdd ...
2018-05-22 15:50:14 INFO DistriOptimizer$:664 - Cache thread models...
2018-05-22 15:50:14 INFO DistriOptimizer$:666 - Cache thread models... done
2018-05-22 15:50:14 INFO DistriOptimizer$:136 - Count dataset
2018-05-22 15:50:15 INFO DistriOptimizer$:140 - Count dataset complete. Time elapsed: 0.110093395s
2018-05-22 15:50:15 INFO DistriOptimizer$:148 - config {
maxDropPercentage: 0.0
computeThresholdbatchSize: 100
warmupIterationNum: 200
isLayerwiseScaled: false
dropPercentage: 0.0
}
2018-05-22 15:50:15 INFO DistriOptimizer$:152 - Shuffle data
2018-05-22 15:50:15 INFO DistriOptimizer$:155 - Shuffle data complete. Takes 0.031867857s
2018-05-22 15:50:15 ERROR TaskSetManager:70 - Task 1 in stage 10.0 failed 4 times; aborting job
2018-05-22 15:50:15 ERROR TaskSetManager:70 - Task 1 in stage 10.0 failed 4 times; aborting job
2018-05-22 15:50:15 ERROR DistriOptimizer$:939 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 fnt failure: Lost task 1.3 in stage 10.0 (TID 284, 172.16.0.178, executor 3): java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:312)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:914)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:227)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:249)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Traceback (most recent call last):
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 226, in
main()
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 218, in main
playGame(args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 212, in playGame
trainNetwork(model,args)
File "/root/workspace/analytics-zoo/pyzoo/zoo/examples/flappybird/flappybird_qlearning.py", line 167, in trainNetwork
model.fit(state_t,targets)
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/pipeline/api/keras/engine/topology.py", line 162,
File "/root/workspace/analytics-zoo/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 588, in callBigDlFunc
py4j.protocol.Py4JJavaError: An error occurred while calling o35.zooFit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 4 times, most recent failure: Lost task 1.3 in s16.0.178, executor 3): java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:312)
at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:914)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:227)
at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:249)
at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at com.intel.analytics.bigdl.dataset.CachedDistriDataSet$$anonfun$data$2$$anon$2.next(DataSet.scala:280)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:211)
at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$5.apply(DistriOptimizer.scala:202)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
I also set the batch size to 128 with the same submit parameters, I will encounter the same problem as before.
https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/run-example-tests.sh
Change ANALYTICS_ZOO_HOME to ANALYTICS_ZOO_ROOT
ANALYTICS_ZOO_HOME_DIST to ANALYTICS_ZOO_HOME
To make things consistent and not confusing
From zoo created by changlinzhang : intel-analytics/zoo#187
When I migrate the object detection example from BigDL API to ZOO API
It failed to call "init_engine()" in the code
from bigdl.util.common import *
...
JavaCreator.set_creator_class("com.intel.analytics.zoo.models.pythonapi.PythonModels")
init_engine()
using
${SPARK_HOME}/bin/pyspark --properties-file ${BIGDL_CONF} --py-files ${ZOO_PY_ZIP} --jars ${ZOO_JAR} ...
The error message as follows:
TypeError Traceback (most recent call last)
<ipython-input-2-eda297cc30af> in <module>()
26
27 JavaCreator.set_creator_class("com.intel.analytics.zoo.models.pythonapi.PythonModels")
---> 28 init_engine()
/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in init_engine(bigdl_type)
415
416 def init_engine(bigdl_type="float"):
--> 417 callBigDlFunc(bigdl_type, "initEngine")
418 # Spark context is supposed to have been created when init_engine is called
419 get_spark_context()._jvm.org.apache.spark.bigdl.api.python.BigDLSerDe.initialize()
/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in callBigDlFunc(bigdl_type, name, *args)
577 gateway = _get_gateway()
578 error = Exception("Cannot find function: %s" % name)
--> 579 for jinvoker in JavaCreator.instance(bigdl_type, gateway).value:
580 # hasattr(jinvoker, name) always return true here,
581 # so you need to invoke the method to check if it exist or not
/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in instance(cls, bigdl_type, *args)
54 with cls._lock:
55 if not cls._instance:
---> 56 cls._instance = cls(bigdl_type, *args)
57 return cls._instance
58
/tmp/spark-8206877b-9835-45ee-b27f-e7dd68d068cd/userFiles-a958e415-3434-4132-91ca-d18bce0ad9b8/zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py in __init__(self, bigdl_type, gateway)
91 jclass = getattr(gateway.jvm, creator_class)
92 if bigdl_type == "float":
---> 93 self.value.append(getattr(jclass, "ofFloat")())
94 elif bigdl_type == "double":
95 self.value.append(getattr(jclass, "ofDouble")())
TypeError: 'JavaPackage' object is not callable
Environment info:
The spark version is 1.6 and the ZOO is compiled by spark 1.6
Follow the instructions here to try SSD with BigDL, found many NPE in executor's log, the command used is as below:
spark-submit --master spark://bb-node1:7077 --executor-cores 5 --num-executors 10 --total-executor-cores 50 --driver-memory 30G --executor-memory 200G --driver-class-path /mnt/disk1/SSD_Predict/models/ssd/jars/object-detection-0.1-SNAPSHOT-jar-with-dependencies-and-spark.jar --class com.intel.analytics.zoo.pipeline.ssd.example.Predict /mnt/disk1/SSD_Predict/models/ssd/jars/object-detection-0.1-SNAPSHOT-jar-with-dependencies-and-spark.jar -f hdfs://bb-node1:8020/dlbenchmark/data/PASCAL/seq/test/ --folderType seq --caffeDefPath /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/SSD_300x300/test.prototxt --caffeModelPath /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300_iter_120000.caffemodel --classname /mnt/disk1/SSD_Predict/models/ssd/caffe/VGGNet/VOC0712/classname.txt -b 200 -r 300 -p 50 -q false
executor log:
18/03/02 13:30:41 WARN FeatureTransformer$: failed /mnt/disk1/SSD_Predict/data/PASCAL/VOCdevkit/VOC2007/JPEGImages/009934.jpg in transformer class com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize
java.lang.NullPointerException
at org.opencv.imgproc.Imgproc.resize(Imgproc.java:2761)
at com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize$.transform(Resize.scala:69)
at com.intel.analytics.bigdl.transform.vision.image.augmentation.Resize.transformMat(Resize.scala:53)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer.transform(FeatureTransformer.scala:58)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.ChainedFeatureTransformer.transform(FeatureTransformer.scala:111)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer$$anonfun$apply$1.apply(FeatureTransformer.scala:80)
at com.intel.analytics.bigdl.transform.vision.image.FeatureTransformer$$anonfun$apply$1.apply(FeatureTransformer.scala:80)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1076)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1128)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The code in WideAndDeepExample.scala seems that does not support multi value indicators , like this:
//com.intel.analytics.zoo.models.recommendation.Utils.scala
// setup deep tensor
def getDeepTensor(r: Row, columnInfo: ColumnFeatureInfo): Tensor[Float] = {
val deepColumns1 = columnInfo.indicatorCols
val deepColumns2 = columnInfo.embedCols ++ columnInfo.continuousCols
val deepLength = columnInfo.indicatorDims.sum + deepColumns2.length
val deepTensor = Tensor[Float](deepLength).fill(0)
// setup indicators
var acc = 0
(0 to deepColumns1.length - 1).map {
i =>
val index = r.getAs[Int](columnInfo.indicatorCols(i))
val accIndex = if (i == 0) index
else {
acc = acc + columnInfo.indicatorDims(i - 1)
acc + index
}
deepTensor.setValue(accIndex + 1, 1)
}
// setup embedding and continuous
(0 to deepColumns2.length - 1).map {
i =>
deepTensor.setValue(i + 1 + columnInfo.indicatorDims.sum,
r.getAs[Int](deepColumns2(i)).toFloat)
}
deepTensor
}
The example calls the Utils.row2Sample()
method, and the method deals with indicators & continuous Cols like the above code.
We can see that it assumes that continuousCols are integers rather than floats , and it only takes the first value of indicators. I understand that it's only a demo , but will it be better if we fix this?
code like this works well for python 2.7, but failed for python 3.
wide_n_deep = WideAndDeep(5, column_info, "wide_n_deep")
creating: createZooWideAndDeep
Traceback (most recent call last):
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/apps/recommendation/wide_n_deep.py", line 142, in
wide_n_deep = WideAndDeep(5, column_info, "wide_n_deep")
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/zoo/models/recommendation/wide_and_deep.py", line 118, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 667, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/nn/layer.py", line 130, in init
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 588, in callBigDlFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 584, in callBigDlFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 629, in callJavaFunc
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 629, in
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 656, in _py2java
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 656, in
File "/opt/work/jenkins/workspace/ZOO-PR-Python-AppTests/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 671, in _py2java
File "/opt/work/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/opt/work/spark-2.1.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/work/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.bigdl.api.python.BigDLSerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.bigdl.api.python.BigDLSerDeBase.loads(BigDLSerde.scala:57)
at org.apache.spark.bigdl.api.python.BigDLSerDe.loads(BigDLSerde.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Looks like com.intel.analytics.zoo.pipeline.api.keras.layers.Recurrent is not used in anywhere? and zoo.keras.layers.LSTM is extend from bigdl.nn.keras.Recurrent? If I update LSTM to extend from zoo.keras.Recurrent, there will be exception:
(isKerasStyle=false):
InternalRecurrent[f454c033]ArrayBuffer(TimeDistributed[5379272d]Linear[53d69303](12 -> 128), LSTM(12, 32, 0.0))
at com.intel.analytics.bigdl.nn.abstractnn.InferShape$class.excludeInvalidLayers(InferShape.scala:98)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.excludeInvalidLayers(AbstractModule.scala:58)
at com.intel.analytics.bigdl.nn.abstractnn.InferShape$class.validateInput(InferShape.scala:108)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.validateInput(AbstractModule.scala:58)
at com.intel.analytics.bigdl.nn.keras.Sequential.add(Topology.scala:299)
at com.intel.analytics.zoo.pipeline.api.keras.layers.Recurrent.doBuild(Recurrent.scala:41)
at com.intel.analytics.bigdl.nn.keras.KerasLayer.build(KerasLayer.scala:225)
And I agreed we do need a zoo.keras.Recurrent extends from bigdl.keras.Recurrent so we can add more functions.
It looks like SSD Test.scala has been hard coded with VOC imageset.
I was just checking the code in ImageChannelNormalize, and in BigDL
https://github.com/intel-analytics/BigDL/blob/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/transform/vision/image/augmentation/ChannelNormalize.scala#L52
I find the following code:
def apply(meanR: Float, meanG: Float, meanB: Float,
stdR: Float = 1, stdG: Float = 1, stdB: Float = 1): ChannelNormalize = {
new ChannelNormalize(Array(meanB, meanG, meanR), Array(stdR, stdG, stdB))
}
Notice that new ChannelNormalize(Array(meanB, meanG, meanR), Array(stdR, stdG, stdB))
, mean and std have inconsistent order. This looks like a bug.
If using --model=local[*] as required in readme,
Error: Caused by: java.lang.IllegalArgumentException: requirement failed: total batch size: 128 should be divided by total core number: 28
If using correct command, eg --model=local[16], nothing outputted while running.
Script:
export ANALYTICS_ZOO_JAR=${ANALYTICS_ZOO_HOME}/lib/analytics-zoo-0.1.0-SNAPSHOT-jar-with-dependencies.jar
export BASE_DIR=/home/sangtian/zoo_test/textclassification/
spark-submit
--master=local[16]
--driver-memory 20g
--executor-memory 20g
--class com.intel.analytics.zoo.examples.textclassification.TextClassification
${ANALYTICS_ZOO_JAR}
--baseDir ${BASE_DIR}
I need to implement a style transfer example with CycleGan, which combines cyclic loss and adversarial loss. And the compile method of model in keras, zoo only support single loss.
Just checking if we’re using the consistent preprocessing for ResNet.
In BigDL: https://github.com/intel-analytics/BigDL/blob/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/example/loadmodel/DatasetUtil.scala#L94
In Zoo: https://github.com/intel-analytics/zoo/blob/master/zoo/src/main/scala/com/intel/analytics/zoo/models/image/imageclassification/ImageClassificationConfig.scala#L111
Right now they are using different mean and std. If our model is from BigDL, Shall we use the same preprocess
Since we are trying to use keras api for seq2seq pr, it would be great if we can have convlstm(2D/3D) keras layers.
https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/tfnet
Follow the guide to run this example, we noticed that it didn't process all the images.
When loading a textClaasifier model locally and do text prediction, usually we just call the method loadModel and predict like this:
val textClassificationModel = TextClassifier.loadModel[Float]("file:///home/yidiyang/workspace/model/text.bigdl") val results = textClassificationModel.predict(sampleRDD).collect()(0)
However it will throw an error : module size should be 1 instead of 2
Only when adding these codes before loading can it works (can the module size be 1):
val model = TextClassifier(classNum, tokenLength, sequenceLength, param.encoder, param.encoderOutputDim)
This maybe a BigDL issue , but I came across this problem when I tried to run W&D model , thus I decide to submit it here.
Here is the problem : when I start train model on spark , I always fail on huge data and large features. Spark executor keep on running out of memory even when I set the heap memory / direct memory very large. And I notice that the storage memory used is quite low. I'm skilled on spark, so please believe me I konw how to deal with OOM on spark.
For instance, I assigned 2048GB memory total to the application , and data cached (rdd names are "training rdd" & "thread models") only occupies 100GB memory.
Then the executor fails , telling me that it cannot find a rdd file like this
spark .FetchFailedException: Failure while fetching StreamChunkId no such file exception
,
and has to re-calculate the rdd , and it fails again. The executor log shows that the MemoryStorage tries to store a single partition rdd of about 50GB , while the executor memory cannot cache such a big file in memory.
I checked the source code and found two places:
//DataSet.scala
override def cache(): Unit = {
buffer.count()
indexes.count()
isCached = true
}
when dataset tries to cache data , it simply calls an action and it triggers the code below:
def rdd[T: ClassTag](data: RDD[T]): DistributedDataSet[T] = {
val nodeNumber = Engine.nodeNumber()
new CachedDistriDataSet[T](
data.coalesce(nodeNumber, true)
.mapPartitions(iter => {
Iterator.single(iter.toArray)
}).setName("cached dataset")
.cache()
)
}
It calls the cache() method of spark ,which will just cache the data in memory ,even if the executor memory cannot hold the partition. I believe that's the first reason why we came across the loss of the rdd and OOM exception. To avoid this , we should just replace all cache()
in dataset to persist(MEMORY_AND_DISK)
or persist(MEMORY_AND_DISK_SER)
. This may cause the application to run more slowly ,but better slow than error ,isn't it?
def rdd[T: ClassTag](data: RDD[T]): DistributedDataSet[T] = {
val nodeNumber = Engine.nodeNumber()
new CachedDistriDataSet[T](
data.coalesce(nodeNumber, true)
.mapPartitions(iter => {
Iterator.single(iter.toArray)
}).setName("cached dataset")
.cache()
)
}
Same code here. We can see that when I run my W&D application, I have about 100 million train Sample
, and the code in Iterator.single(iter.toArray)
turns the whole rdd into one big Array. The array is of couse huge , for it contains 100 million objects.
I just can't understand why we convert the distributed rdd into one big array , Just for zip with immutable index? We can assign index with distributed rdds , and other actions like shuffle. Is it just a bug ?
Sorry for cannot provide a demo for this issue, besides.
Per readme, the batchsize can be set by user, but actually it's hardcoded as 8000.
integrated with maven.
For Python Apps, you can do something similar.
[ 1 ] Download instructions:
https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/textclassification#download-analytics-zoo
[ 2 ] Run command:
https://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/analytics/zoo/examples/textclassification#run-this-example
Also, to make safe, after modifying the running command, better manually or update Jenkins scripts to make sure the command can work successfully.
original issue refer to https://github.com/intel-analytics/zoo/issues/366
Feel free to raise suggestions if you have a better way to elaborate the README. (edited by Kai)
Some issues in anomaly detection notebook problem:
In Readme, it says : export ZOO_HOME=the root directory of the Analytics Zoo project. But in jupyter-with-zoo.sh, it needs ZOO_HOME to be the dist directory or build directory. We need to clarify this environment. I suggest to set another variable, like ZOO_SOURCE to the root directory of the Analytics Zoo project. And keep ZOO_HOME pointing to the build. In the notebook code, it also use ZOO_HOME as the root directory of the Analytics Zoo project.
In jupyter-with-zoo.sh, it would find zoo jar and python zip with "zooxxx", need to change to "analytics-zooxxx". Also, --allow-root option is not working in my jupyter.
Running notebook, I met an issue in cell 8 with command "df['hours'] = df['datetime'].dt.hour":
AttributeError Traceback (most recent call last)
in ()
1 # the hours and if it's night or day (7:00-22:00)
----> 2 df['hours'] = df['datetime'].dt.hour
3 df['daylight'] = ((df['hours'] >= 7) & (df['hours'] <= 22)).astype(int)
/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in getattr(self, name)
1813 return self[name]
1814 raise AttributeError("'%s' object has no attribute '%s'" %
-> 1815 (type(self).name, name))
1816
1817 def setattr(self, name, value):
AttributeError: 'Series' object has no attribute 'dt'
/usr/lib/python2.7/dist-packages/simplejson/encoder.py:262: DeprecationWarning: Interpreting naive datetime as local 2018-05-09 13:16:01.759843. Please add timezone info to timestamps.
chunks = self.iterencode(o, _one_shot=True)
We only support "accuracy" for now and also the error message should be enriched.
def to_bigdl_metrics(metrics):
metrics = to_list(metrics)
bmetrics = []
for metric in metrics:
if metric.lower() == "accuracy":
bmetrics.append(Top1Accuracy())
else:
raise TypeError("Unsupported metrics: %s" % metric)
return bmetrics
From zoo created by jason-dai : intel-analytics/zoo#185
Hi,
Is there any way to normalize image pixel values to between 0-1 in ChainedPreprocessing? I'm seeing an ImageChannelNormalize()
function, but that seems to be used primarily for zero-centering each channel, rather than for 0-1 normalizing. E.g., if all the pixel values are between 0-255, is it possible to simply plug in a lambda
into the ChainedPreprocessing
constructor to divide each pixel value by 255? Or would this require defining a custom Preprocessing
step?
When implementing CycleGan in keras, I need to train the generator and discriminator alternately, which requires train on batch/iteration. However, we only support model.fit() now.
A reference bigdl issue is intel-analytics/ipex-llm#2500
Looks like there is a typo in below code, modelPath is passed to dataPath.
optString
.text("pretrained model path")
.action((x, c) => c.copy(dataPath = x))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.