williamfalcon / test-tube Goto Github PK
View Code? Open in Web Editor NEWPython library to easily log experiments and parallelize hyperparameter search for neural networks
License: MIT License
Python library to easily log experiments and parallelize hyperparameter search for neural networks
License: MIT License
Initial tests should include:
I'm attempting to train a PytorchLightning model on a slurm cluster, and the PytorchLightning documentation recommends use of the SlurmCluster class in this package to automate submission of slurm scripts. The examples all involve running a hyperparameter scan, however I would like to train just a single model. My attempt at doing so is as follows:
cluster = SlurmCluster()
[...] (set cluster.per_experiment_nb_cpus, cluster.job_time, etc.)
cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, ...)
However, this fails with:
Traceback (most recent call last):
File "train.py", line 67, in hydra_main
train, nb_trials=1, job_name='pl-slurm', job_display_name='pl-slurm')
File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 127, in optimize_parallel_cluster_gpu
enable_auto_resubmit, on_gpu=True)
File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 167, in __optimize_parallel_cluster_internal
if self.is_from_slurm_object:
AttributeError: 'SlurmCluster' object has no attribute 'is_from_slurm_object'
Looking at the code, it seems that SlurmCluster.is_from_slurm_object
was never set. This is because I did not pass in a hyperparam_optimizer
, as I did not intend to perform a scan. What is the correct way go about this?
I'm getting this error using PyTorch ddp for tensorboard's add_scalars (add scalar works fine). Is there something I can do?
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine
self.train()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train
self.run_training_epoch()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 114, in run_training_epoch
self.run_evaluation(test=self.testing)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 130, in run_evaluation
test)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 74, in evaluate
eval_results = model.validation_end(outputs)
File "/home/userfs/b/brm512/experiments/HomographyNet/lightning_module.py", line 547, in validation_end
self.logger.experiment.add_scalars('losses', {'train loss': self.loss_meter_training.avg, 'val loss':self.loss_meter_validation.avg} , self.epoch_nb)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 363, in add_scalars
fw_logdir = self._get_file_writer().get_logdir()
AttributeError: 'TTDummyFileWriter' object has no attribute 'get_logdir'
The parameter tunnable
should be spelled tunable.
Any plans to support Joseph Redmon's Darknet framework (https://pjreddie.com/darknet/) in the near future?
I tried to test this package and i init it like this:
exp = Experiment(name='word2vec',
debug=False,
save_dir='out/',
create_git_tag=True)
what went wrong when I wrote this:
exp.tag({'learning_rate': 0.002, 'nb_layers': 2})
Often you care about data on the log scale (.1, 1, 10, 100, etc) rather than a uniform scale since order of magnitude is more important. It'd be nice to have a way to do that by default in test_tube.
Anyone using scientific python these days will be forced into python 3 only anyway as things like ipython force a migration. I think literally every relevant library that someone using test tube would need supports python 3 anyway. If test tube was 3.5+ only (reasonable since upgrading from 3.4 -> 3.5 doesn't require breaking anything), we could use type annotations in the codebase.
If interested in implementing hyperband (in a framework agnostic way), sign up for it here. Happy to brainstorm implementation design ideas.
I'm trying out pytorch-lightning and I'm having an issue after commits 1fad1c7 and 3fba70a. When I do
from test_tube import Experiment
exp = Experiment(save_dir=cfg['log_dir'])
trainer = Trainer(experiment=exp, max_nb_epochs=1, train_percent_check=0.01)
the program crashes with
File "/home/user/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/test_tube/log.py", line 504, in _get_file_writer
if self.purge_step is not None:
AttributeError: 'Experiment' object has no attribute 'purge_step'
because there are four attributes of Experiment that are never set, self.purge_step, self.filename_suffix, self.flush_secs.
Line 516 in 3d4ce06
This line makes the EventWriter
to save logs to save_dir
rather than log_dir
which leaves the tf
folder empty.
This may be related to new Pytorch update since I am using v1.2
I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd
However, when I run the hyperparam opt, I get the following error:
RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54
Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True)
.
Looking at argparse_hopt.py
, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:
AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'
Looking for suggestions on what to try, thanks!
Sometimes, there's a chance test-tube will try to create an experiment version which already exists. Need to add a small delay to avoid the race condition.
I tried the tensorflow_example.py
to test the function of using multiple GPUs. When setting up nb_workers
more than 1, I met an exception as follows. As a result, I just got nb_trials - 1
tuning results, rather than the expected nb_trials
. Note that I am using Python 3.6.
Caught exception in worker thread [Errno 17] File exists: 'logs/multigpu/test_tube_data/dense_model/version_0'
Traceback (most recent call last):
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 30, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "test_tube_multigpu.py", line 22, in train
autosave=False
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 58, in init
self.__init_cache_file_if_needed()
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
Pandas is there just for the csv writing
Thanks for sharing!
Could you show some examples such as plots or something?
I'm interested to use your tool.
When using SlurmCluster.optimize_parallel_cluster_gpu
, is there a way to turn off the auto-resubmit for continuation? Would simply setting cluster.minutes_to_checkpoint_before_walltime = 0
do the trick?
I usually use python fire (https://github.com/google/python-fire), it creates the parser arguments by default looking at the function to be called. Is there any possibility of integrating this into the existing framework?
Hi,
What is the best way to integration with scikit-learn?
Still having trouble running the following demo script from the documentation:
from test_tube import HyperOptArgumentParser
# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')
# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])
# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
# compile (because it's argparse underneath)
hparams = parser.parse_args()
Exception:
usage: ipykernel_launcher.py [-h] [--learning_rate LEARNING_RATE]
[--nb_layers NB_LAYERS] [--neurons NEURONS]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/me/Library/Jupyter/runtime/kernel-b655e4ee-01bb-427d-a041-327a8e2836bc.json
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2
/Users/me/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2971: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
%tb
Traceback details:
---------------------------------------------------------------------------
SystemExit Traceback (most recent call last)
<ipython-input-16-90bd4dc39a7f> in <module>()
12
13 # compile (because it's argparse underneath)
---> 14 hparams = parser.parse_args()
~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in parse_args(self, args, namespace)
116 def parse_args(self, args=None, namespace=None):
117 # call superclass arg first
--> 118 results = super(HyperOptArgumentParser, self).parse_args(args=args, namespace=namespace)
119
120 # extract vals
~/anaconda/envs/py36/lib/python3.6/argparse.py in parse_args(self, args, namespace)
1731 if argv:
1732 msg = _('unrecognized arguments: %s')
-> 1733 self.error(msg % ' '.join(argv))
1734 return args
1735
~/anaconda/envs/py36/lib/python3.6/argparse.py in error(self, message)
2387 self.print_usage(_sys.stderr)
2388 args = {'prog': self.prog, 'message': message}
-> 2389 self.exit(2, _('%(prog)s: error: %(message)s\n') % args)
~/anaconda/envs/py36/lib/python3.6/argparse.py in exit(self, status, message)
2374 if message:
2375 self._print_message(message, _sys.stderr)
-> 2376 _sys.exit(status)
2377
2378 def error(self, message):
SystemExit: 2
> /Users/me/anaconda/envs/py36/lib/python3.6/argparse.py(2376)exit()
2374 if message:
2375 self._print_message(message, _sys.stderr)
-> 2376 _sys.exit(status)
2377
2378 def error(self, message):
Is this something to do with the fact that I am trying to run it in a Jupyter notebook?
Hi,
What is the best way to get CSV with a summery of all the parameters and finally results?
Right now HyperOptArgumentParser
contains much of the logic for doing a hyperparamter search on a local machine.
https://github.com/williamFalcon/test-tube/blob/master/test_tube/argparse_hopt.py#L259
Why this it not great:
HyperOptArgumentParser
object is passed into SlurmCluster
.HyperOptArgumentParser
independent of the mechanism of deployment.Proposed change:
Having something like Local
or LocalSystem
object that similar to SlurmCluster
accepts a HyperOptArgumentParser
that can be used to optimize hyperparameter locally:
hyperparams = parser.parse_args()
# Enable cluster training.
system = LocalSystem(
hyperparam_optimizer=hyperparams,
log_path=hyperparams.log_path,
python_cmd='python3',
test_tube_exp_name=hyperparams.test_tube_exp_name
)
system.max_cpus = 100
system.max_gpus = 5
# Each hyperparameter combination will use 200 cpus.
system.optimize_parallel_cpu(
# Function to execute:
train,
# Number of hyperparameter combinations to search:
nb_trials=24')
Downsides
Is it possible to install test_tube without pulling torch ? I am building a docker with tensorflow and I do not want it to get torch.
Thanks
I am trying to extend my existing argparse with the HyperOptArgumentParser
class. In my current argparse I have two options which are nargs
list.
When I do the following,
parser = HyperOptArgumentParser(stratergy='random_search')
....
parser.add_argument('--input_idx', nargs='+', type=int, help='input indices from data', default=[0, 1])
parser.add_argument('--output_idx', nargs='+', type=int, help='output indices from data', default=[2])
I get this error:
Traceback (most recent call last):
File "train_process.py", line 11, in <module>
args = parser.parse_args()
File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 237, in parse_args
results = self.__parse_args(args, namespace)
File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 158, in __parse_args
args, argv = self.__whitelist_cluster_commands(args, argv)
File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 196, in __whitelist_cluster_commands
all_values.add(v)
TypeError: unhashable type: 'list'
How can I fix this?
Add Tensorboardx integration
(https://github.com/lanpa/tensorboard-pytorch)
Currently the hparams are logged as text in tensorboard. Could we change this to start using the add_hparams() function in SummaryWritter. This will allow some additional nice views of the hyperparameters such as parallel coordinates view.
I think this would be a relatively easy change and I can help out with it if needed.
I had a setup in ubuntu with python version 3.6
manually install below library
bleach==1.5.0
certifi==2016.2.28
cycler==0.10.0
decorator==4.1.2
html5lib==0.9999999
Markdown==2.6.9
matplotlib==2.1.0
networkx==2.0
nltk==3.2.5
numpy==1.13.3
olefile==0.44
pandas==0.21.0
Pillow==4.3.0
protobuf==3.4.0
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.3
PyWavelets==0.5.2
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==1.0.0
six==1.11.0
sklearn==0.0
tensorflow==1.3.0
tensorflow-tensorboard==0.1.8
test-tube==0.46
Werkzeug==0.12.2
I face issue while run script
getting error as per below so suggest me
in requirement there is test-tube version is 0.46 but i am not able to install that so i used test-tube== 0.2
from test_tube import HyperOptArgumentParser
ImportError: cannot import name 'HyperOptArgumentParser'
Hey, I installed the package but while importing it throws up the following error:
File "", line 1, in
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/init.py", line 6, in
from .log import Experiment
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/log.py", line 366
header = f'''###### {self.name}, version {self.version}\n---\n'''
^
SyntaxError: invalid syntax
Thanks
I have a grid search that results in ~300 processes running. I'd like to submit them to slurm as a slurm array, so as not to overload the scheduler. Is there any functionality to support that in test-tube? I didn't see anything in the SlurmCluster code to indicate that it is.
Some function names are a bit verbose. It'd be nice to at least alias the following:
add_metric_row -> add (or log)
since that carries the same meaning and is a lot shorteradd_opt_argument_list -> add_argument_list
add_opt_argument_range -> add_argument_range
I installed this package in my py2.7 environment, but after I import it I got this error.
I don't know how to fix it, beacause it seems all the code is right.
import test_tube
│Traceback (most recent call last):
│ File "<stdin>", line 1, in <module>
│ File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/__init__.py", line 5, in <module>
│ from .argparse_hopt import HyperOptArgumentParser
│ File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/argparse_hopt.py", line 38
│ def add_opt_argument_list(self, *args, options=None, tunnable=False, **kwargs):
│
│SyntaxError: invalid syntax
Hi! Thanks a lot for this neat package! :)
I was wondering whether there is a way to get the global_step
in the metric.csv
file?
It is sent to Tensorboard
but not added to self.metric
in the Experiment.log method.
Cheers,
Emile
What do you think about making the nb_trials param default to None? I've gotten a few confused questions about how many to use when using grid search.
@zafarali
https://github.com/williamFalcon/test-tube/blob/master/test_tube/argparse_hopt.py#L262
When defining range based arguments, e.g.
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
calculated values are not correct. In this case only min and max will be used for hyperparameter optimization instead of the requested 10 parameters which is caused by missing assignment of the member variable (line 372).
Furthermore if log_base parameter is set, only a single value is determined causing an exception in __flatten_params (argparse_hopt.py, line 339). As a fix add nb_samples as argument for np.random.uniform fucntion call (line 379).
When I first started using the package (great work btw, thanks a lot!) I assumed that the intended behaviour for HyperOptArgumentParser is to only only iterate over the hyperparams if they are not specified in the cli.
In my opinion this would be the convenient designated behaviour and if people agree I could implement it (currently I'm using a hacky work-around.
So currently if I run python hyperparam.py --arg1 128
and the arg1 is specified as
parser.opt_list(--arg1, options=[128, 256, 512])
It would still iterate over the arg1 options. My desired behaviour would be to only iterate over the options if the argument is not explicitly given.
Currently I use this hack:
class MyArgParser(HyperOptArgumentParser):
def my_opt_list(self, name, default, dtype, options, **kwargs):
tunable = (name not in sys.argv)
self.opt_list(name, default=default, type=dtype, tunable=tunable,
options=options,**kwargs)
Which I would be happy to implement if I'm not the only one facing this issue.
Add SMAC strategy
(http://www.cs.ubc.ca/labs/beta/Projects/SMAC/)
Wanted to try out test-tube but I can't even get the following demo script to work:
from test_tube import HyperOptArgumentParser
# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')
# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])
# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
# compile (because it's argparse underneath)
hparams = parser.parse_args()
Output:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-37-90bd4dc39a7f> in <module>()
9
10 # and tune the number of units in each layer
---> 11 parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
12
13 # compile (because it's argparse underneath)
~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in opt_range(self, *args, **kwargs)
100 log_base = kwargs.pop("log_base", None)
101
--> 102 self.add_argument(*args, **kwargs)
103 arg_name = args[-1]
104 self.opt_args[arg_name] = OptArg(
~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in add_argument(self, *args, **kwargs)
80
81 def add_argument(self, *args, **kwargs):
---> 82 super(HyperOptArgumentParser, self).add_argument(*args, **kwargs)
83
84 def opt_list(self, *args, **kwargs):
~/anaconda/envs/py36/lib/python3.6/argparse.py in add_argument(self, *args, **kwargs)
1332 if not callable(action_class):
1333 raise ValueError('unknown action "%s"' % (action_class,))
-> 1334 action = action_class(**kwargs)
1335
1336 # raise an error if the action type is not callable
TypeError: __init__() got an unexpected keyword argument 'nb_samples'
Tried:
help(parser.opt_range)
Help on method opt_range in module test_tube.argparse_hopt:
opt_range(*args, **kwargs) method of test_tube.argparse_hopt.HyperOptArgumentParser instance
And:
import test_tube
print(test_tube.__version__)
but module 'test_tube' has no attribute '__version__'
I installed as follows in a Python 3.6.5 environment managed by Anaconda:
(py36) me$ pip install test_tube
Collecting test_tube
Downloading https://files.pythonhosted.org/packages/be/ab/8fcffebc9764945024bf223b1c64525260bacb96f4bc74a973a8dba2e562/test_tube-0.602.tar.gz
Requirement already satisfied: pandas>=0.20.3 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Collecting numpy>=1.13.3 (from test_tube)
Using cached https://files.pythonhosted.org/packages/8e/75/7a8b7e3c073562563473f2a61bd53e75d0a1f5e2047e576ee61d44113c22/numpy-1.14.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Requirement already satisfied: imageio>=2.3.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: pytz>=2011k in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: six>=1.5 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.20.3->test_tube)
Building wheels for collected packages: test-tube
Running setup.py bdist_wheel for test-tube ... done
Stored in directory: /Users/me/Library/Caches/pip/wheels/aa/68/07/766d78fe06f6803325e2aaffa610df36814ebd2468713a8703
Successfully built test-tube
Installing collected packages: numpy, test-tube
Found existing installation: numpy 1.12.1
Uninstalling numpy-1.12.1:
Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.14.3 test-tube-0.602
Hi!
Thank you for the library! Using it in par with pytorch-lightning to search network's hyperparameters.
Right now the following line:
parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10, log_base=10)
hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
print(vars(trial_hparams))
will produce real values, though
parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10)
hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
print(vars(trial_hparams))
produces int values.
It would be nice to have a feature of sampling in log scale for integer values!
I'm using pytorch-lightning
and test_tube
at the same time. I try to perform hyperparameter search using optimize_parallel_gpu
, but I see the strange error in the title: ChildProcessError: [Errno 10] No child processes
def main_local(hparams, gpu_ids=None):
# init module
# model = SparseNet(hparams)
model = SparseNet(hparams)
# most basic trainer, uses good defaults
trainer = Trainer(
max_nb_epochs=hparams.max_nb_epochs,
gpus=gpu_ids,
distributed_backend=hparams.distributed_backend,
nb_gpu_nodes=hparams.nodes,
# optional
fast_dev_run=hparams.fast_dev_run,
use_amp=hparams.use_amp,
amp_level=("O1" if hparams.use_amp else "O0"),
)
trainer.fit(model)
...
if __name__ == "__main__":
...
parser = SparseNet.add_model_specific_args(parser)
# HyperParameter search
parser.opt_list(
"--n", default=2000, type=int, tunable=True, options=[2000, 3000, 4000]
)
parser.opt_list(
"--k", default=50, type=int, tunable=True, options=[100, 200, 300, 400]
)
parser.opt_list(
"--batch_size",
default=32,
type=int,
tunable=True,
options=[32, 64, 128, 256, 512],
)
# parse params
hparams = parser.parse_args()
# LR for different batch_size
if hparams.batch_size <= 128:
hparams.learning_rate = 0.001
else:
hparams.learning_rate = 0.002
# run trials of random search over the hyperparams
if torch.cuda.is_available():
hparams.optimize_parallel_gpu(
main_local, max_nb_trials=20, gpu_ids=["0, 1"]
)
else:
hparams.gpus = None
hparams.distributed_backend = None
hparams.optimize_parallel_cpu(main_local, nb_trials=20)
# main_local(hparams) # this works
gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "sparse_trainer.py", line 29, in main_local
trainer.fit(model)
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
error_queue = mp.SimpleQueue()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
self._rlock = ctx.Lock()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
return Lock(ctx=self.get_context())
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
register(self._semlock.name)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
self._send('REGISTER', name)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
self.ensure_running()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "sparse_trainer.py", line 29, in main_local
trainer.fit(model)
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
error_queue = mp.SimpleQueue()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
self._rlock = ctx.Lock()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
return Lock(ctx=self.get_context())
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
register(self._semlock.name)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
self._send('REGISTER', name)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
self.ensure_running()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
^CTraceback (most recent call last):
File "sparse_trainer.py", line 73, in <module>
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
main_local, nb_trials=20, trials=hparams.trials(20), gpu_ids=["0, 1"]
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 361, in optimize_trials_parallel_gpu
Traceback (most recent call last):
results = self.pool.map(optimize_parallel_gpu_private, self.trials)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
Traceback (most recent call last):
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
gpu_id_set = g_gpu_id_q.get(block=True)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
KeyboardInterrupt
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
gpu_id_set = g_gpu_id_q.get(block=True)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Traceback (most recent call last):
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
gpu_id_set = g_gpu_id_q.get(block=True)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
self.wait(timeout)
File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
self._event.wait(timeout)
File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 296, in wait
waiter.acquire()
KeyboardInterrupt
When using one GPU, I can obtain nb_trials
results. However, if using more than one GPU, a fewer number of results are obtained. For example, when running 8 trials, I only can get version_0
to version_6
directories. The expected version_7
has not been generated.
When I try to run the example /tensorflow_example.py"
from a conda environment, I got the "permission denied" error. It seems the script is trying to write to the /Users
folder - I'm not sure why that's not allowed.
When I try running Python with sudo
, there is another issue that sudo
changes $PATH and I'm using a different python, which does not have tensorflow installed. This answer also suggests not running python in sudo mode...
So I guess the real question is how to make os.mkdirs()
work on my system (Ubuntu 16.04). Found an answer here. While I'm looking into the os.mkdirs()
permission issue, just think I could open an issue in case other users come into the same situation as well.
Here is the full error log:
(deep) yuqiong@yuqiong-G7-7588:/media/yuqiong/DATA/test-tube$ python examples/tensorflow_example.py
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
Traceback (most recent call last):
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "examples/tensorflow_example.py", line 20, in train
autosave=False,
Traceback (most recent call last):
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
self.__init_cache_file_if_needed()
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
[Previous line repeated 2 more times]
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "examples/tensorflow_example.py", line 20, in train
autosave=False,
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
self.__init_cache_file_if_needed()
File "examples/tensorflow_example.py", line 20, in train
autosave=False,
PermissionError: [Errno 13] Permission denied: '/Users'
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
self.__init_cache_file_if_needed()
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
[Previous line repeated 2 more times]
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
[Previous line repeated 2 more times]
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'
PermissionError: [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "examples/tensorflow_example.py", line 20, in train
autosave=False,
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
self.__init_cache_file_if_needed()
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
[Previous line repeated 2 more times]
File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.