Giter Club home page Giter Club logo

Comments (22)

leewyang avatar leewyang commented on May 18, 2024 1

It looks like you haven't installed tensorflow into your Python distribution that is shipped to the executors via --archives hdfs:///user/${USER}/Python.zip#Python, or you're not setting the following env vars:

export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

@bobo2001281 You have several options:

  1. If your dataset fits into memory, you can just use any existing code to load it into the memory of each executor and train as usual. In this case, you'll need to ship the MAT file to the executors, much like the way we ship the "mnist.zip" file in the data conversion example
  2. If your dataset doesn't fit into memory, you'll need a way to split it into files that are easily "readable" by either Spark (e.g. sc.textFile() or sc.sequenceFile()) or TensorFlow (e.g. TFRecord)

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

In the first way:
mnist_data_setup.py is different according what the data is.
The data in the other sample (cifar10,imagenet,slim) is download from github directly according README.md.

How can I ship my data? There doesnot seem to have the label data in my data.

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

Shipping the data to the executors can be done with --archives option. So, for the data conversion example, specifying --archives mnist/mnist.zip#mnist tells Spark to copy the mnist.zip file to each executor and extract it into it's current working dir into a folder named mnist.

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

now

  1. I put the data to hdfs by
    hadoop fs -put ./data
  2. I modified the VDSY.py and add VDSR_spark.py
  3. ${SPARK_HOME}/bin/spark-submit
    --master yarn
    --deploy-mode cluster
    --queue ${QUEUE}
    --num-executors 4
    --executor-memory 27G
    --py-files TensorFlow_VDSR/tfspark.zip,TensorFlow_VDSR/VDSR.py
    --conf spark.dynamicAllocation.enabled=false
    --conf spark.yarn.maxAppAttempts=1
    --archives hdfs:///user/${USER}/Python.zip#Python
    --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server"
    TensorFlow_VDSR/VDSR_spark.py
    --images hdfs:///user/${USER}/data/train
    --model_path ./model_VDSR_me
    and error occured

17/05/03 16:39:30 INFO yarn.Client: Application report for application_1493036076768_0092 (state: ACCEPTED)
17/05/03 16:39:31 INFO yarn.Client: Application report for application_1493036076768_0092 (state: FAILED)
17/05/03 16:39:31 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1493036076768_0092 failed 1 times due to AM Container for appattempt_1493036076768_0092_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1493036076768_0092_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1493800746814
final status: FAILED
tracking URL: http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092
user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1493036076768_0092 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/05/03 16:39:31 INFO util.ShutdownHookManager: Shutdown hook called
17/05/03 16:39:31 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b8f69730-babb-4b65-93e2-8bbc317bcebc

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

Please grab the yarn logs via: yarn logs -applicationId application_1493036076768_0092 and search for any errors/exceptions.

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

hadoop@u10-121-135-150:~/hadoop-2.7.1/logs$ yarn logs -applicationId application_1493036076768_0106
/tmp/logs/hadoop/logs/application_1493036076768_0106 does not exist.
Log aggregation has not completed or is not enabled.

The log is not exist both in the middle of the running of the spark submit and after the command.

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

I have modified the yarn-site.xml with below and now I can see my logs.

  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
 <name>yarn.nodemanager.remote-app-log-dir</name>
 <value>/app-logs</value>
  <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
  <value>logs</value>

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

😢
how to attach file?

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

17/05/05 11:40:07 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GB
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:42573
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/05/05 11:40:07 INFO executor.Executor: Starting executor ID 2 on host u10-121-135-150
17/05/05 11:40:07 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36225.
17/05/05 11:40:07 INFO netty.NettyBlockTransferService: Server created on u10-121-135-150:36225
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:10 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/05/05 11:40:10 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/05/05 11:40:10 INFO client.TransportClientFactory: Successfully created connection to /10.121.135.150:39980 after 2 ms (0 ms spent in bootstraps)
17/05/05 11:40:10 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.8 KB, free 14.2 GB)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 306 ms
17/05/05 11:40:11 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KB, free 14.2 GB)
2017-05-05 11:40:11,669 INFO (MainThread-46435) connected to server at ('u10-121-135-150', 43125)
2017-05-05 11:40:11,670 INFO (MainThread-46435) TFSparkNode.reserve: {'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'tb_port': 0, 'addr': ('u10-121-135-150', 42646), 'ppid': 46414, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 44238}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': ('u10-121-135-150', 42646), 'task_index': 0, 'job_name': 'ps', 'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'ppid': 46414, 'port': 44238, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-xTRaz/listener-zKctvL', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x9e\xbd\xb8gJ\xc4@"\x93Q\x9bd\x8c\x85\x10S', 'worker_num': 1, 'host': 'u10-121-135-150', 'ppid': 46417, 'port': 42615, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-kXvJ0C/listener-J6qzjO', 'task_index': 1, 'job_name': 'worker', 'authkey': 'JhP\xff\xa2\xe2Nw\x9d\x02RG\x00 N]', 'worker_num': 2, 'host': 'u10-121-135-150', 'ppid': 46415, 'port': 42711, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-NPwGp
/listener-YFUdU5', 'task_index': 2, 'job_name': 'worker', 'authkey': 'j):\x13\xf2\x00E\x80\x89\xa2\xe9\xd5\xa6*HI', 'worker_num': 3, 'host': 'u10-121-135-150', 'ppid': 46419, 'port': 40856, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:13,526 INFO (MainThread-46435) Starting TensorFlow ps:0 on cluster node 0 on background process
Process Process-2:
Traceback (most recent call last):
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1493036076768_0112/container_1493036076768_0112_01_000003/pyfiles/VDSR.py", line 79, in map_fun
import tensorflow as tf
ImportError: No module named tensorflow
17/05/05 11:40:14 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0)
2017-05-05 11:40:15,381 INFO (MainThread-46435) Got msg: None
2017-05-05 11:40:15,381 INFO (MainThread-46435) Terminating PS
17/05/05 11:40:15 WARN python.PythonRunner: Incomplete task interrupted: Attempting to kill Python Worker
17/05/05 11:40:15 INFO executor.Executor: Executor killed task 0.0 in stage 0.0 (TID 0)
17/05/05 17:19:36 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

but when I python VDSR.py , and no error occured.

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

I install tensorflow in my local env.
root@u10-121-135-150:~# python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally

How can I install tensorflow into my Python distribution?
It seems that there is not tensorflow in the distribution env.

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named tensorflow

from tensorflowonspark.

bobo2001281 avatar bobo2001281 commented on May 18, 2024

In the instruction:
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN#convert-the-mnist-zip-files-into-hdfs-files

Install and compile TensorFlow w/ RDMA Support

git clone [email protected]:yahoo/tensorflow.git

# For TensorFlow 0.12 w/ RDMA, checkout the 'yahoo' branch
# For TensorFlow 1.0 w/ RDMA, checkout the 'jun_r1.0' branch
# follow build instructions to install into ${PYTHON_ROOT}

In the last line,How can I install tensorflow into Python?

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

Actually, if you do not need RDMA support, you should be able to just run something like:
Python/bin/pip install tensorflow

If you need specific versions (e.g. Python 2.7 vs. 3.x, CPU/GPU, etc), you can adapt these instructions from TensorFlow

from tensorflowonspark.

bobo200128docker avatar bobo200128docker commented on May 18, 2024

When I execute python VDSR.py and cost 10s .
--(I have add the BATCH_SIZE=256 and set MAX_EPOCH =1)
When I run the VDSR_spark.py (where call VDSR.py according to the Conversion Guide ),the state is RUNNING for a long time and never finish.
There is not any log on HDFS /app-logs.

from tensorflowonspark.

bobo200128docker avatar bobo200128docker commented on May 18, 2024

2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from SCHEDULED to ALLOCATED_SAVING
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED_SAVING to ALLOCATED
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1494227397798_0029_01_000001 : LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH",{{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=<LOG_DIR>,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'org.apache.spark.deploy.PythonRunner',--primary-py-file,VDSR_spark.py,--arg,'--images',--arg,'hdfs:///user/hadoop/data/train',--arg,'--model',--arg,'./model_VDSR_me',--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,<LOG_DIR>/stdout,2>,<LOG_DIR>/stderr
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED to LAUNCHED
2017-05-11 20:46:36,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000001 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:42,576 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1494227397798_0029_000001 (auth:SIMPLE)
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.121.135.150 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1494227397798_0029 APPATTEMPTID=appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from LAUNCHED to RUNNING
2017-05-11 20:46:42,581 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1494227397798_0029 State change from ACCEPTED to RUNNING
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000002
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000002 of capacity <memory:30720, vCores:1> on host u10-121-135-152:60402, which has 5 containers, <memory:124928, vCores:5> used and <memory:6144, vCores:3> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000002, NodeId: u10-121-135-152:60402, NodeHttpAddress: u10-121-135-152:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:190464, vCores:9>, usedCapacity=0.7265625, absoluteUsedCapacity=0.7265625, numApps=3, numContainers=9 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.84375 absoluteUsedCapacity=0.84375 used=<memory:221184, vCores:10> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000003
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000003 of capacity <memory:30720, vCores:1> on host u10-121-135-150:38391, which has 6 containers, <memory:126976, vCores:6> used and <memory:4096, vCores:2> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000003, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:251904, vCores:11>, usedCapacity=0.9609375, absoluteUsedCapacity=0.9609375, numApps=3, numContainers=11
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.9609375 absoluteUsedCapacity=0.9609375 used=<memory:251904, vCores:11> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:44,050 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-152:60402 for container : container_1494227397798_0029_01_000002
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-150:38391 for container : container_1494227397798_0029_01_000003
2017-05-11 20:46:44,052 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ACQUIRED to RUNNING

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

Those logs don't reveal much... Please grab the yarn application logs and look for errors on the executors.

from tensorflowonspark.

bobo200128docker avatar bobo200128docker commented on May 18, 2024

If I cancel (Ctrl + C)the application that is running,where the application log will saved?

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

You will need to do the following:

yarn application -kill <your_applicationId>
yarn logs -applicationId <your_applicationId>   >yarn.log

from tensorflowonspark.

bobo200128docker avatar bobo200128docker commented on May 18, 2024

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
tf.sub(None,None)
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'sub'
tf.version
'1.0.1'

[4]+ Stopped Python/bin/python
hadoop@u10-121-135-150:~$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
tf.version
'0.12.1'

What is the difference of the above 2 tensorflow s?
Now the situation is : when I try VDSR.py
'0.12.1' is in local envirment,and have the function tf.sub()
'1.0.1' is in the executor,and report an error "AttributeError: 'module' object has no attribute 'sub'"

from tensorflowonspark.

bobo200128docker avatar bobo200128docker commented on May 18, 2024

I have install scipy in Python that is distributed to the Spark executors.
why "ImportError: No module named scipy.io" still occured in the application log?

from tensorflowonspark.

leewyang avatar leewyang commented on May 18, 2024

A couple notes:

  • Python/bin/python refers to the custom python distribution that we create and zip up to ship to the executors. We do this because users often don't have control of the python version/packages on the executors.
  • python on the gateway node is just the local installation of python, which will not be distributed to the executors. If you install any dependencies into this local/gateway installation, the dependencies will not be automatically installed on the executors.
  • The TensorFlow APIs changed fairly significantly between 0.12 and 1.0. They have a migration script to help you update your code, but your code will not be cross-compatible between these two versions.

So, with that all said, I'd recommend picking the versions of TensorFlow and Python that you wish to move forward with, then create a custom python distribution (with all necessary dependencies), and then use ONLY this distribution to test your code going forward (for "local" and "distributed" testing).

from tensorflowonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.