Comments (10)
@mayiming can you post your command line? Otherwise, you can try to run the example on a single box using the Spark Standalone instructions. Please note that in our YARN instructions, we were using 27GB as a proxy for a GPU, since YARN doesn't support scheduling by GPUs. So, if you're running on CPU, you should be able to run with much smaller memory.
from tensorflowonspark.
Hi,
Thanks a lot for the reply. The command line is almost identical to the mnist example:
${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--num-executors 4
--executor-memory 27G
--py-files /export/home/yma/TensorFlowOnSpark/tfspark.zip,/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
--archives hdfs:///user/yma/Python.zip#Python
--conf spark.executorEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.driver.extraLibraryPath="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py
--images /user/yma/mnist/csv/train/images
--labels /user/yma/mnist/csv/train/labels
--model "hdfs://default/user/yma/mnist_model"
If I reduce the executor memory to 10GB, the job will stuck, whereas it should have finished in 5 minutes. I will give a try to local spark.
Thanks again,
Yiming.
from tensorflowonspark.
@mayiming have you been able to get local spark working fine? FWIW, I tried running the MNIST example in a "low-memory" configuration, and I was able to successfully run at 2G (without tuning any spark memory settings). Note that the Spark executor itself requires some amount of memory to run.
from tensorflowonspark.
@leewyang Thank you very much for looking into the issue. It appears that my Linux OS is having some issues with gRPC, which could cause extra memory overhead.
Could you let me know the Linux OS version that you used for your experiment, so I can replicate the result? I'm using a RHEL 6.6 cluster. Are you aware any issues running TFoS on this platform?
Thanks a lot,
Yiming.
from tensorflowonspark.
@mayiming We're running RHEL, but I'm not sure which specific version, and even then, it's likely customized for our env anyways.
That said, I'm not aware of anyone else reporting similar issues. Which version of tensorflow are you using? Public, pip-installed? Or git-cloned and compiled locally?
from tensorflowonspark.
@leewyang I built it from source, the TF version is 0.12. To make it compiled, I need to install devtoolset-4. Python env. is 2.7.
Also, when I increase the number of executors to 10, I observe typically the last worker just wait for the chief work indefinitely. Have you observed this behavior?
Thanks a lot,
Yiming.
from tensorflowonspark.
If you can, I'd recommend trying a pre-built pip package for TensorFlow, especially if you aren't using GPUs or RDMA. This should hopefully avoid any build/compile issues you might be seeing (e.g. gRPC).
As for the MNIST example hanging with an increased the number of executors, you might need to increase the number of --steps
and/or --epochs
. Note that the default settings are tuned to "not take too long" with 4 executors. By increasing the number of executors, there's a chance that there's not enough data being produced to "fill" each worker's queue. You can see if this is the case by looking at the yarn logs of the executors, and seeing where it's hanging...
from tensorflowonspark.
@leewyang Thanks again for the pointers. After increasing the steps and number of epochs, it does run well now. However, I still need to specify around 20GB memory through yarn.executor.memoryOverhead parameter. Have you observe similar issue?
from tensorflowonspark.
@mayiming I just re-tried a similar command-line in my environment, and I'm able to go down to --executor-memory 1G
w/ no memoryOverhead
setting, so not sure why you're seeing that. Again, you may want to try the pre-built pip package just to see if that helps.
from tensorflowonspark.
Closing due to inactivity.
from tensorflowonspark.
Related Issues (20)
- MNIST SPARK on Standalone Cluster inside Docker Container HOT 11
- Writing checkpoints to HDFS takes long HOT 2
- when using mnist_spark.py , serializer.dump_stream Timeout while feeding partition HOT 2
- pkg_resources.DistributionNotFound: The 'tensorflow' distribution was not found and is required by the application HOT 3
- MNIST example - Exception in TF background thread HOT 2
- the doubt about the data policy HOT 1
- Performance issues in the program HOT 2
- Performance issues in examples/mnist/estimator (by P3) HOT 3
- Retaining original columns after inference HOT 2
- tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 'cosn' not implemented HOT 2
- Model Saved with TF-2.5.0 HOT 3
- How to integrate a model into Spark cluster HOT 12
- Get stuck at "Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster HOT 1
- ExitCode: 13 executing mnist_data_setup.py on a yarn cluster HOT 3
- can it run on tensorflow-cpu? HOT 1
- can it run use ParameterServerStrategy HOT 3
- do we support scala & java code write tensorflow model with tenorflow-core-api ? HOT 3
- Evalator hangs while training HOT 1
- yarn mode error HOT 1
- error while running mnist_tf_ds.py HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorflowonspark.