I think I screw up somewhere that the script doesn't understand where my directory is
rocess STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:62651 to respond...
Waiting for redis server at 127.0.0.1:22361 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.
======================================================================
View the web UI at http://localhost:8894/notebooks/ray_ui34599.ipynb?token=298685d42e77e7e460e34c71da0e3d27257a1ad1a42a1c5a
======================================================================
== Status ==
Using FIFO scheduling algorithm.
Result logdir: /home/yxu/ray_results/awesome
PENDING trials:
- train_0_lr=0.55999,momentum=0.7021: PENDING
- train_1_lr=0.015444,momentum=0.7021: PENDING
- train_2_lr=0.55999,momentum=0.89643: PENDING
- train_3_lr=0.015444,momentum=0.89643: PENDING
.....
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
Test set: Average loss: 0.1230, Accuracy: 9613/10000 (96%)
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
================== TESTING ==================
Test set: Average loss: 2.3449, Accuracy: 958/10000 (10%)
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
Test set: Average loss: 0.1230, Accuracy: 9613/10000 (96%)
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
Remote function train failed with:
Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 891, in _process_task
*arguments)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trainable.py", line 117, in train
result = self._train()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 114, in _train
result = self._status_reporter._get_and_clear_status()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 42, in _get_and_clear_status
raise TuneError("Trial finished without reporting result!")
ray.tune.error.TuneError: Trial finished without reporting result!
Error processing event: Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 255, in _process_events
result = ray.get(result_id)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 2776, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(888e77e8b61177963bd332b03dec3ac3d6aa12f9). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 891, in _process_task
*arguments)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trainable.py", line 117, in train
result = self._train()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 114, in _train
result = self._status_reporter._get_and_clear_status()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 42, in _get_and_clear_status
raise TuneError("Trial finished without reporting result!")
ray.tune.error.TuneError: Trial finished without reporting result!
Suppressing duplicate error message.
Worker ip unknown, skipping log sync for /home/yxu/ray_results/awesome/train_2_lr=0.55999,momentum=0.89643_2018-08-04_13-37-27t_x_vep8
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 3/8 CPUs, 0/0 GPUs
Result logdir: /home/yxu/ray_results/awesome
ERROR trials:
- train_2_lr=0.55999,momentum=0.89643: ERROR, 1 failures: /home/yxu/ray_results/awesome/train_2_lr=0.55999,momentum=0.89643_2018-08-04_13-37-27t_x_vep8/error_2018-08-04_13-39-21.txt
RUNNING trials:
- train_0_lr=0.55999,momentum=0.7021: RUNNING
- train_1_lr=0.015444,momentum=0.7021: RUNNING
- train_3_lr=0.015444,momentum=0.89643: RUNNING
Error processing event: Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 255, in _process_events
result = ray.get(result_id)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 2776, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(1af518971277c3ede6df2b728c40c5195a99e2b6). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 891, in _process_task
*arguments)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trainable.py", line 117, in train
result = self._train()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 114, in _train
result = self._status_reporter._get_and_clear_status()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 42, in _get_and_clear_status
raise TuneError("Trial finished without reporting result!")
ray.tune.error.TuneError: Trial finished without reporting result!
/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96,got 88
return f(*args, **kwds)
/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96,got 88
return f(*args, **kwds)
Worker ip unknown, skipping log sync for /home/yxu/ray_results/awesome/train_3_lr=0.015444,momentum=0.89643_2018-08-04_13-37-27ynfch_di
Validation set: Average loss: 0.1793, Accuracy: 11327/11968 (95%)
================== TESTING ==================
Test set: Average loss: 2.3101, Accuracy: 958/10000 (10%)
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
Error processing event: Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 255, in _process_events
result = ray.get(result_id)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 2776, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(f17f0f114cee8c6d34a8a8a55feaabafee1496c1). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 891, in _process_task
*arguments)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trainable.py", line 117, in train
result = self._train()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 114, in _train
result = self._status_reporter._get_and_clear_status()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 42, in _get_and_clear_status
raise TuneError("Trial finished without reporting result!")
ray.tune.error.TuneError: Trial finished without reporting result!
Suppressing duplicate error message.
/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96,got 88
return f(*args, **kwds)
Worker ip unknown, skipping log sync for /home/yxu/ray_results/awesome/train_0_lr=0.55999,momentum=0.7021_2018-08-04_13-37-263b0iemnu
Test set: Average loss: 0.1684, Accuracy: 9464/10000 (95%)
Final model stored at "/home/yxu/Documents/default-risk/checkpoint/net2018-08-04 13:37-best.pth.tar".
Error processing event: Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 255, in _process_events
result = ray.get(result_id)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 2776, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(6faf591f18be2eaed89f51cfe0f21c68f9075879). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/worker.py", line 891, in _process_task
*arguments)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/trainable.py", line 117, in train
result = self._train()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 114, in _train
result = self._status_reporter._get_and_clear_status()
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/function_runner.py", line 42, in _get_and_clear_status
raise TuneError("Trial finished without reporting result!")
ray.tune.error.TuneError: Trial finished without reporting result!
Suppressing duplicate error message.
/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96,got 88
return f(*args, **kwds)
Worker ip unknown, skipping log sync for /home/yxu/ray_results/awesome/train_1_lr=0.015444,momentum=0.7021_2018-08-04_13-37-271fw4grhb
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Result logdir: /home/yxu/ray_results/awesome
ERROR trials:
- train_0_lr=0.55999,momentum=0.7021: ERROR, 1 failures: /home/yxu/ray_results/awesome/train_0_lr=0.55999,momentum=0.7021_2018-08-04_13-37-263b0iemnu/error_2018-08-04_13-39-22.txt
- train_1_lr=0.015444,momentum=0.7021: ERROR, 1 failures: /home/yxu/ray_results/awesome/train_1_lr=0.015444,momentum=0.7021_2018-08-04_13-37-271fw4grhb/error_2018-08-04_13-39-23.txt
- train_2_lr=0.55999,momentum=0.89643: ERROR, 1 failures: /home/yxu/ray_results/awesome/train_2_lr=0.55999,momentum=0.89643_2018-08-04_13-37-27t_x_vep8/error_2018-08-04_13-39-21.txt
- train_3_lr=0.015444,momentum=0.89643: ERROR, 1 failures: /home/yxu/ray_results/awesome/train_3_lr=0.015444,momentum=0.89643_2018-08-04_13-37-27ynfch_di/error_2018-08-04_13-39-21.txt
Traceback (most recent call last):
File "dev/mnist-ray.py", line 291, in <module>
}
File "/home/yxu/.local/share/virtualenvs/default-risk-mgdyo4BW/lib/python3.6/site-packages/ray/tune/tune.py", line 104, in run_experiments
raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_0_lr=0.55999,momentum=0.7021, train_1_lr=0.015444,momentum=0.7021, train_2_lr=0.55999,momentum=0.89643, train_3_lr=0.015444,momentum=0.89643])
/ray/src/local_scheduler/local_scheduler.cc:177: Killed worker pid 13852 which hadn't started yet.