Giter Club home page Giter Club logo

test-tube's Introduction

react-router

Test Tube

Log, organize and parallelize hyperparameter search for Deep Learning experiments

PyPI version

Docs

View the docs here


Test tube is a python library to track and parallelize hyperparameter search for Deep Learning and ML experiments. It's framework agnostic and built on top of the python argparse API for ease of use.

pip install test_tube

Main test-tube uses

Compatible with Python any Python ML library like Tensorflow, Keras, Pytorch, Caffe, Caffe2, Chainer, MXNet, Theano, Scikit-learn


Examples

The Experiment object is a subclass of Pytorch.SummaryWriter.

Log and visualize with Tensorboard

from test-tube import Experiment
import torch

exp = Experiment('/some/path')
exp.tag({'learning_rate': 0.02, 'layers': 4})    

# exp is superclass of SummaryWriter
features = torch.Tensor(100, 784)
writer.add_embedding(features, metadata=label, label_img=images.unsqueeze(1))

# simulate training
for n_iter in range(2000):
    e.log({'testtt': n_iter * np.sin(n_iter)})

# save and close
exp.save()
exp.close()
pip install tensorflow   

tensorboard --logdir /some/path

Run grid search on SLURM GPU cluster

from test_tube.hpc import SlurmCluster

# hyperparameters is a test-tube hyper params object
hyperparams = args.parse()

# init cluster
cluster = SlurmCluster(
    hyperparam_optimizer=hyperparams,
    log_path='/path/to/log/results/to',
    python_cmd='python3'
)

# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)
cluster.notify_job_status(email='[email protected]', on_done=True, on_fail=True)

# set the job options. In this instance, we'll run 20 different models
# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)
cluster.per_experiment_nb_gpus = 1
cluster.per_experiment_nb_nodes = 1

# run the models on the cluster
cluster.optimize_parallel_cluster_gpu(train, nb_trials=20, job_name='first_tt_batch', job_display_name='my_batch')   

# we just ran 20 different hyperparameters on 20 GPUs in the HPC cluster!!    

Optimize hyperparameters across GPUs

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)

# compile (because it's argparse underneath)
hparams = parser.parse_args()

# optimize across 4 gpus
# use 2 gpus together and the other two separately
hparams.optimize_parallel_gpu(MyModel.fit, gpu_ids=['1', '2,3', '0'], nb_trials=192, nb_workers=4)

Or... across CPUs

hparams.optimize_parallel_cpu(MyModel.fit, nb_trials=192, nb_workers=12)

You can also optimize on a log scale to allow better search over magnitudes of hyperparameter values, with a chosen base (disabled by default). Keep in mind that the range you search over must be strictly positive.

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')

# Randomly searches over the (log-transformed) range [100,800).

parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10, log_base=10)


# compile (because it's argparse underneath)
hparams = parser.parse_args()

# run 20 trials of random search over the hyperparams
for hparam_trial in hparams.trials(20):
    train_network(hparam_trial)

Convert your argparse params into searchable params by changing 1 line

import argparse
from test_tube import HyperOptArgumentParser

# these lines are equivalent
parser = argparse.ArgumentParser(description='Process some integers.')
parser = HyperOptArgumentParser(description='Process some integers.', strategy='grid_search')

# do normal argparse stuff
...

Log images inline with metrics

# name must have either jpg, png or jpeg in it
img = np.imread('a.jpg')
exp.log('test_jpg': img, 'val_err': 0.2)

# saves image to ../exp/version/media/test_0.jpg
# csv has file path to that image in that cell

Demos

How to contribute

Feel free to fix bugs and make improvements! 1. Check out the current bugs here or feature requests. 2. To work on a bug or feature, head over to our project page and assign yourself the bug. 3. We'll add contributor names periodically as people improve the library!

Bibtex

To cite the framework use:

@misc{Falcon2017,
  author = {Falcon, W.A.},
  title = {Test Tube},
  year = {2017},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/williamfalcon/test-tube}}
}    

License

In addition to the terms outlined in the license, this software is U.S. Patent Pending.

test-tube's People

Contributors

akhti avatar alok avatar backpropper avatar borda avatar dlfelps avatar expectopatronum avatar felix-petersen avatar jtamir avatar karanchahal avatar kvhooreb avatar oscmansan avatar schneider-mathias avatar seantrue avatar tullie avatar williamfalcon avatar zafarali avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

test-tube's Issues

Consider dropping python 2 support

Anyone using scientific python these days will be forced into python 3 only anyway as things like ipython force a migration. I think literally every relevant library that someone using test tube would need supports python 3 anyway. If test tube was 3.5+ only (reasonable since upgrading from 3.4 -> 3.5 doesn't require breaking anything), we could use type annotations in the codebase.

Implement Hyperband (strategy)

If interested in implementing hyperband (in a framework agnostic way), sign up for it here. Happy to brainstorm implementation design ideas.

Proposal: Independent Local hyperparameter optimization module

Right now HyperOptArgumentParser contains much of the logic for doing a hyperparamter search on a local machine.

https://github.com/williamFalcon/test-tube/blob/master/test_tube/argparse_hopt.py#L259

Why this it not great:

  • This is the opposite of the SLURM code where the HyperOptArgumentParser object is passed into SlurmCluster.
  • Code duplication and entanglement.
  • Hard to test HyperOptArgumentParser independent of the mechanism of deployment.

Proposed change:

Having something like Local or LocalSystem object that similar to SlurmCluster accepts a HyperOptArgumentParser that can be used to optimize hyperparameter locally:

    hyperparams = parser.parse_args()

    # Enable cluster training.
    system = LocalSystem(
        hyperparam_optimizer=hyperparams,
        log_path=hyperparams.log_path,
        python_cmd='python3',
        test_tube_exp_name=hyperparams.test_tube_exp_name
    )
    
    system.max_cpus = 100
    system.max_gpus = 5

    # Each hyperparameter combination will use 200 cpus.
    system.optimize_parallel_cpu(
        # Function to execute:
        train,
        # Number of hyperparameter combinations to search:
        nb_trials=24')

Downsides

  • Probably breaks backward compatibility.

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning

I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd

However, when I run the hyperparam opt, I get the following error:

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True).

Looking at argparse_hopt.py, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

Looking for suggestions on what to try, thanks!

HyperOptArgumentParser explicit argument behaviour

When I first started using the package (great work btw, thanks a lot!) I assumed that the intended behaviour for HyperOptArgumentParser is to only only iterate over the hyperparams if they are not specified in the cli.

In my opinion this would be the convenient designated behaviour and if people agree I could implement it (currently I'm using a hacky work-around.

So currently if I run python hyperparam.py --arg1 128 and the arg1 is specified as

parser.opt_list(--arg1, options=[128, 256, 512])

It would still iterate over the arg1 options. My desired behaviour would be to only iterate over the options if the argument is not explicitly given.

Currently I use this hack:

class MyArgParser(HyperOptArgumentParser):
    def my_opt_list(self, name, default, dtype, options, **kwargs):
        tunable = (name not in sys.argv)
        self.opt_list(name, default=default, type=dtype, tunable=tunable,
                      options=options,**kwargs)

Which I would be happy to implement if I'm not the only one facing this issue.

ImportError: cannot import name 'HyperOptArgumentParser'

I had a setup in ubuntu with python version 3.6
manually install below library

bleach==1.5.0
certifi==2016.2.28
cycler==0.10.0
decorator==4.1.2
html5lib==0.9999999
Markdown==2.6.9
matplotlib==2.1.0
networkx==2.0
nltk==3.2.5
numpy==1.13.3
olefile==0.44
pandas==0.21.0
Pillow==4.3.0
protobuf==3.4.0
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.3
PyWavelets==0.5.2
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==1.0.0
six==1.11.0
sklearn==0.0
tensorflow==1.3.0
tensorflow-tensorboard==0.1.8
test-tube==0.46
Werkzeug==0.12.2

I face issue while run script
getting error as per below so suggest me
in requirement there is test-tube version is 0.46 but i am not able to install that so i used test-tube== 0.2
from test_tube import HyperOptArgumentParser
ImportError: cannot import name 'HyperOptArgumentParser'

TypeError: __init__() got an unexpected keyword argument 'nb_samples'

Wanted to try out test-tube but I can't even get the following demo script to work:

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)

# compile (because it's argparse underneath)
hparams = parser.parse_args()

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-90bd4dc39a7f> in <module>()
      9 
     10 # and tune the number of units in each layer
---> 11 parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
     12 
     13 # compile (because it's argparse underneath)

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in opt_range(self, *args, **kwargs)
    100         log_base = kwargs.pop("log_base", None)
    101 
--> 102         self.add_argument(*args, **kwargs)
    103         arg_name = args[-1]
    104         self.opt_args[arg_name] = OptArg(

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in add_argument(self, *args, **kwargs)
     80 
     81     def add_argument(self, *args, **kwargs):
---> 82         super(HyperOptArgumentParser, self).add_argument(*args, **kwargs)
     83 
     84     def opt_list(self, *args, **kwargs):

~/anaconda/envs/py36/lib/python3.6/argparse.py in add_argument(self, *args, **kwargs)
   1332         if not callable(action_class):
   1333             raise ValueError('unknown action "%s"' % (action_class,))
-> 1334         action = action_class(**kwargs)
   1335 
   1336         # raise an error if the action type is not callable

TypeError: __init__() got an unexpected keyword argument 'nb_samples'

Tried:

help(parser.opt_range)

Help on method opt_range in module test_tube.argparse_hopt:

opt_range(*args, **kwargs) method of test_tube.argparse_hopt.HyperOptArgumentParser instance

And:

import test_tube
print(test_tube.__version__)

but module 'test_tube' has no attribute '__version__'

I installed as follows in a Python 3.6.5 environment managed by Anaconda:

(py36) me$ pip install test_tube
Collecting test_tube
  Downloading https://files.pythonhosted.org/packages/be/ab/8fcffebc9764945024bf223b1c64525260bacb96f4bc74a973a8dba2e562/test_tube-0.602.tar.gz
Requirement already satisfied: pandas>=0.20.3 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Collecting numpy>=1.13.3 (from test_tube)
  Using cached https://files.pythonhosted.org/packages/8e/75/7a8b7e3c073562563473f2a61bd53e75d0a1f5e2047e576ee61d44113c22/numpy-1.14.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Requirement already satisfied: imageio>=2.3.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: pytz>=2011k in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: six>=1.5 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.20.3->test_tube)
Building wheels for collected packages: test-tube
  Running setup.py bdist_wheel for test-tube ... done
  Stored in directory: /Users/me/Library/Caches/pip/wheels/aa/68/07/766d78fe06f6803325e2aaffa610df36814ebd2468713a8703
Successfully built test-tube
Installing collected packages: numpy, test-tube
  Found existing installation: numpy 1.12.1
    Uninstalling numpy-1.12.1:
      Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.14.3 test-tube-0.602

Log Argparse hparams with add_hparams in SummaryWriter

Currently the hparams are logged as text in tensorboard. Could we change this to start using the add_hparams() function in SummaryWritter. This will allow some additional nice views of the hyperparameters such as parallel coordinates view.

I think this would be a relatively easy change and I can help out with it if needed.

Consider aliasing/renaming some API functions

Some function names are a bit verbose. It'd be nice to at least alias the following:

  • add_metric_row -> add (or log) since that carries the same meaning and is a lot shorter
  • add_opt_argument_list -> add_argument_list
  • add_opt_argument_range -> add_argument_range

ipykernel_launcher.py: error: unrecognized arguments: -f /Users/...

Still having trouble running the following demo script from the documentation:

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)

# compile (because it's argparse underneath)
hparams = parser.parse_args()

Exception:

usage: ipykernel_launcher.py [-h] [--learning_rate LEARNING_RATE]
                             [--nb_layers NB_LAYERS] [--neurons NEURONS]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/me/Library/Jupyter/runtime/kernel-b655e4ee-01bb-427d-a041-327a8e2836bc.json
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2


/Users/me/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2971: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

%tb

Traceback details:

---------------------------------------------------------------------------
SystemExit                                Traceback (most recent call last)
<ipython-input-16-90bd4dc39a7f> in <module>()
     12 
     13 # compile (because it's argparse underneath)
---> 14 hparams = parser.parse_args()

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in parse_args(self, args, namespace)
    116     def parse_args(self, args=None, namespace=None):
    117         # call superclass arg first
--> 118         results = super(HyperOptArgumentParser, self).parse_args(args=args, namespace=namespace)
    119 
    120         # extract vals

~/anaconda/envs/py36/lib/python3.6/argparse.py in parse_args(self, args, namespace)
   1731         if argv:
   1732             msg = _('unrecognized arguments: %s')
-> 1733             self.error(msg % ' '.join(argv))
   1734         return args
   1735 

~/anaconda/envs/py36/lib/python3.6/argparse.py in error(self, message)
   2387         self.print_usage(_sys.stderr)
   2388         args = {'prog': self.prog, 'message': message}
-> 2389         self.exit(2, _('%(prog)s: error: %(message)s\n') % args)

~/anaconda/envs/py36/lib/python3.6/argparse.py in exit(self, status, message)
   2374         if message:
   2375             self._print_message(message, _sys.stderr)
-> 2376         _sys.exit(status)
   2377 
   2378     def error(self, message):

SystemExit: 2

> /Users/me/anaconda/envs/py36/lib/python3.6/argparse.py(2376)exit()
   2374         if message:
   2375             self._print_message(message, _sys.stderr)
-> 2376         _sys.exit(status)
   2377 
   2378     def error(self, message):

Is this something to do with the fact that I am trying to run it in a Jupyter notebook?

Bug: OptArg values

When defining range based arguments, e.g.
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
calculated values are not correct. In this case only min and max will be used for hyperparameter optimization instead of the requested 10 parameters which is caused by missing assignment of the member variable (line 372).

Furthermore if log_base parameter is set, only a single value is determined causing an exception in __flatten_params (argparse_hopt.py, line 339). As a fix add nb_samples as argument for np.random.uniform fucntion call (line 379).

Log-scale search?

Often you care about data on the log scale (.1, 1, 10, 100, etc) rather than a uniform scale since order of magnitude is more important. It'd be nice to have a way to do that by default in test_tube.

SlurmCluster - submit multiple jobs as slurm array

I have a grid search that results in ~300 processes running. I'd like to submit them to slurm as a slurm array, so as not to overload the scheduler. Is there any functionality to support that in test-tube? I didn't see anything in the SlurmCluster code to indicate that it is.

SlurmCluster without hyperparameters

I'm attempting to train a PytorchLightning model on a slurm cluster, and the PytorchLightning documentation recommends use of the SlurmCluster class in this package to automate submission of slurm scripts. The examples all involve running a hyperparameter scan, however I would like to train just a single model. My attempt at doing so is as follows:

cluster = SlurmCluster()
[...] (set cluster.per_experiment_nb_cpus, cluster.job_time, etc.)
cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, ...)

However, this fails with:

Traceback (most recent call last):
  File "train.py", line 67, in hydra_main
    train, nb_trials=1, job_name='pl-slurm', job_display_name='pl-slurm')
  File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 127, in optimize_parallel_cluster_gpu
    enable_auto_resubmit, on_gpu=True)
  File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 167, in __optimize_parallel_cluster_internal
    if self.is_from_slurm_object:
AttributeError: 'SlurmCluster' object has no attribute 'is_from_slurm_object'

Looking at the code, it seems that SlurmCluster.is_from_slurm_object was never set. This is because I did not pass in a hyperparam_optimizer, as I did not intend to perform a scan. What is the correct way go about this?

TTDummyFileWriter error with multi processing using add_scalars (tensorboard)

I'm getting this error using PyTorch ddp for tensorboard's add_scalars (add scalar works fine). Is there something I can do?

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine
self.train()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train
self.run_training_epoch()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 114, in run_training_epoch
self.run_evaluation(test=self.testing)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 130, in run_evaluation
test)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 74, in evaluate
eval_results = model.validation_end(outputs)
File "/home/userfs/b/brm512/experiments/HomographyNet/lightning_module.py", line 547, in validation_end
self.logger.experiment.add_scalars('losses', {'train loss': self.loss_meter_training.avg, 'val loss':self.loss_meter_validation.avg} , self.epoch_nb)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 363, in add_scalars
fw_logdir = self._get_file_writer().get_logdir()
AttributeError: 'TTDummyFileWriter' object has no attribute 'get_logdir'

Write basic tests for HyperOptArgumentParser

Initial tests should include:

  • Testing grid search generation
  • Testing random search generation
  • CPU parallelization
  • GPU parallelization
  • Releasing free GPUs and CPUs continuously so there isn't idle time on either

Issue with _get_file_writer after recent commits

I'm trying out pytorch-lightning and I'm having an issue after commits 1fad1c7 and 3fba70a. When I do

from test_tube import Experiment
exp = Experiment(save_dir=cfg['log_dir'])
trainer = Trainer(experiment=exp, max_nb_epochs=1, train_percent_check=0.01)

the program crashes with

File "/home/user/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/test_tube/log.py", line 504, in _get_file_writer
if self.purge_step is not None:
AttributeError: 'Experiment' object has no attribute 'purge_step'

because there are four attributes of Experiment that are never set, self.purge_step, self.filename_suffix, self.flush_secs.

example/tensorflow_example.py had an exception (about dir making) when nb_workers > 1

I tried the tensorflow_example.py to test the function of using multiple GPUs. When setting up nb_workers more than 1, I met an exception as follows. As a result, I just got nb_trials - 1 tuning results, rather than the expected nb_trials. Note that I am using Python 3.6.

Caught exception in worker thread [Errno 17] File exists: 'logs/multigpu/test_tube_data/dense_model/version_0'
Traceback (most recent call last):
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 30, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "test_tube_multigpu.py", line 22, in train
autosave=False
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 58, in init
self.__init_cache_file_if_needed()
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)

Syntax Error in log.py

Hey, I installed the package but while importing it throws up the following error:

File "", line 1, in
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/init.py", line 6, in
from .log import Experiment
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/log.py", line 366
header = f'''###### {self.name}, version {self.version}\n---\n'''
^
SyntaxError: invalid syntax

Thanks

Support of log_base for opt_range(type=int)

Hi!

Thank you for the library! Using it in par with pytorch-lightning to search network's hyperparameters.
Right now the following line:

parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10, log_base=10)

hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
    print(vars(trial_hparams))

will produce real values, though

parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10)

hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
    print(vars(trial_hparams))

produces int values.

It would be nice to have a feature of sampling in log scale for integer values!

"ChildProcessError: [Errno 10] No child processes" when doing `optimize_parallel_gpu`

I'm using pytorch-lightning and test_tube at the same time. I try to perform hyperparameter search using optimize_parallel_gpu, but I see the strange error in the title: ChildProcessError: [Errno 10] No child processes

Code

def main_local(hparams, gpu_ids=None):
    # init module
    # model = SparseNet(hparams)
    model = SparseNet(hparams)

    # most basic trainer, uses good defaults
    trainer = Trainer(
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=gpu_ids,
        distributed_backend=hparams.distributed_backend,
        nb_gpu_nodes=hparams.nodes,
        # optional
        fast_dev_run=hparams.fast_dev_run,
        use_amp=hparams.use_amp,
        amp_level=("O1" if hparams.use_amp else "O0"),
    )
    trainer.fit(model)

...
if __name__ == "__main__":
    ...
    parser = SparseNet.add_model_specific_args(parser)

    # HyperParameter search
    parser.opt_list(
        "--n", default=2000, type=int, tunable=True, options=[2000, 3000, 4000]
    )
    parser.opt_list(
        "--k", default=50, type=int, tunable=True, options=[100, 200, 300, 400]
    )
    parser.opt_list(
        "--batch_size",
        default=32,
        type=int,
        tunable=True,
        options=[32, 64, 128, 256, 512],
    )

    # parse params
    hparams = parser.parse_args()

    # LR for different batch_size
    if hparams.batch_size <= 128:
        hparams.learning_rate = 0.001
    else:
        hparams.learning_rate = 0.002

    # run trials of random search over the hyperparams
    if torch.cuda.is_available():
        hparams.optimize_parallel_gpu(
            main_local, max_nb_trials=20, gpu_ids=["0, 1"]
        )
    else:
        hparams.gpus = None
        hparams.distributed_backend = None
        hparams.optimize_parallel_cpu(main_local, nb_trials=20)

    # main_local(hparams) # this works

console log

gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
^CTraceback (most recent call last):
  File "sparse_trainer.py", line 73, in <module>
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
    main_local, nb_trials=20, trials=hparams.trials(20), gpu_ids=["0, 1"]
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 361, in optimize_trials_parallel_gpu
Traceback (most recent call last):
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    self.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

SyntaxError: invalid syntax. Can not import the package

I installed this package in my py2.7 environment, but after I import it I got this error.
I don't know how to fix it, beacause it seems all the code is right.

import test_tube
│Traceback (most recent call last):
│  File "<stdin>", line 1, in <module>
│  File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/__init__.py", line 5, in <module>
│    from .argparse_hopt import HyperOptArgumentParser
│  File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/argparse_hopt.py", line 38
│    def add_opt_argument_list(self, *args, options=None, tunnable=False, **kwargs):
│    
│SyntaxError: invalid syntax

TypeError: Unhashable type: 'list'

I am trying to extend my existing argparse with the HyperOptArgumentParser class. In my current argparse I have two options which are nargs list.

When I do the following,

parser = HyperOptArgumentParser(stratergy='random_search')

....


parser.add_argument('--input_idx', nargs='+', type=int, help='input indices from data', default=[0, 1])
parser.add_argument('--output_idx', nargs='+', type=int, help='output indices from data', default=[2])

I get this error:

Traceback (most recent call last):
  File "train_process.py", line 11, in <module>
    args = parser.parse_args()
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 237, in parse_args
    results = self.__parse_args(args, namespace)
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 158, in __parse_args
    args, argv = self.__whitelist_cluster_commands(args, argv)
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 196, in __whitelist_cluster_commands
    all_values.add(v)
TypeError: unhashable type: 'list'

How can I fix this?

Examples of visualizations

Thanks for sharing!
Could you show some examples such as plots or something?
I'm interested to use your tool.

Permission denied when running examples possibly because os.makedirs()

When I try to run the example /tensorflow_example.py" from a conda environment, I got the "permission denied" error. It seems the script is trying to write to the /Users folder - I'm not sure why that's not allowed.

When I try running Python with sudo, there is another issue that sudo changes $PATH and I'm using a different python, which does not have tensorflow installed. This answer also suggests not running python in sudo mode...

So I guess the real question is how to make os.mkdirs() work on my system (Ubuntu 16.04). Found an answer here. While I'm looking into the os.mkdirs() permission issue, just think I could open an issue in case other users come into the same situation as well.

Here is the full error log:

(deep) yuqiong@yuqiong-G7-7588:/media/yuqiong/DATA/test-tube$ python examples/tensorflow_example.py
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
PermissionError: [Errno 13] Permission denied: '/Users'
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'
PermissionError: [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'

No auto-resubmit

When using SlurmCluster.optimize_parallel_cluster_gpu, is there a way to turn off the auto-resubmit for continuation? Would simply setting cluster.minutes_to_checkpoint_before_walltime = 0 do the trick?

Wrong directory for logging

self.file_writer = FileWriter(self.save_dir, self.max_queue,

This line makes the EventWriter to save logs to save_dir rather than log_dir which leaves the tf folder empty.

This may be related to new Pytorch update since I am using v1.2

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.