Giter Club home page Giter Club logo

Comments (10)

leewyang avatar leewyang commented on May 18, 2024

@mayiming can you post your command line? Otherwise, you can try to run the example on a single box using the Spark Standalone instructions. Please note that in our YARN instructions, we were using 27GB as a proxy for a GPU, since YARN doesn't support scheduling by GPUs. So, if you're running on CPU, you should be able to run with much smaller memory.

from tensorflowonspark.

mayiming avatar mayiming commented on May 18, 2024

Hi,

Thanks a lot for the reply. The command line is almost identical to the mnist example:

${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--num-executors 4
--executor-memory 27G
--py-files /export/home/yma/TensorFlowOnSpark/tfspark.zip,/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
--archives hdfs:///user/yma/Python.zip#Python
--conf spark.executorEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python
--conf spark.driver.extraLibraryPath="$JAVA_HOME/jre/lib/amd64/server:./Python/lib"
/export/home/yma/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py
--images /user/yma/mnist/csv/train/images
--labels /user/yma/mnist/csv/train/labels
--model "hdfs://default/user/yma/mnist_model"

If I reduce the executor memory to 10GB, the job will stuck, whereas it should have finished in 5 minutes. I will give a try to local spark.

Thanks again,
Yiming.

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

@mayiming have you been able to get local spark working fine? FWIW, I tried running the MNIST example in a "low-memory" configuration, and I was able to successfully run at 2G (without tuning any spark memory settings). Note that the Spark executor itself requires some amount of memory to run.

from tensorflowonspark.

mayiming avatar mayiming commented on May 18, 2024

@leewyang Thank you very much for looking into the issue. It appears that my Linux OS is having some issues with gRPC, which could cause extra memory overhead.

Could you let me know the Linux OS version that you used for your experiment, so I can replicate the result? I'm using a RHEL 6.6 cluster. Are you aware any issues running TFoS on this platform?

Thanks a lot,
Yiming.

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

@mayiming We're running RHEL, but I'm not sure which specific version, and even then, it's likely customized for our env anyways.

That said, I'm not aware of anyone else reporting similar issues. Which version of tensorflow are you using? Public, pip-installed? Or git-cloned and compiled locally?

from tensorflowonspark.

mayiming avatar mayiming commented on May 18, 2024

@leewyang I built it from source, the TF version is 0.12. To make it compiled, I need to install devtoolset-4. Python env. is 2.7.

Also, when I increase the number of executors to 10, I observe typically the last worker just wait for the chief work indefinitely. Have you observed this behavior?

Thanks a lot,
Yiming.

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

If you can, I'd recommend trying a pre-built pip package for TensorFlow, especially if you aren't using GPUs or RDMA. This should hopefully avoid any build/compile issues you might be seeing (e.g. gRPC).

As for the MNIST example hanging with an increased the number of executors, you might need to increase the number of --steps and/or --epochs. Note that the default settings are tuned to "not take too long" with 4 executors. By increasing the number of executors, there's a chance that there's not enough data being produced to "fill" each worker's queue. You can see if this is the case by looking at the yarn logs of the executors, and seeing where it's hanging...

from tensorflowonspark.

mayiming avatar mayiming commented on May 18, 2024

@leewyang Thanks again for the pointers. After increasing the steps and number of epochs, it does run well now. However, I still need to specify around 20GB memory through yarn.executor.memoryOverhead parameter. Have you observe similar issue?

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

@mayiming I just re-tried a similar command-line in my environment, and I'm able to go down to --executor-memory 1G w/ no memoryOverhead setting, so not sure why you're seeing that. Again, you may want to try the pre-built pip package just to see if that helps.

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

Closing due to inactivity.

from tensorflowonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.