williamfalcon / test-tube Goto Github PK

Python library to easily log experiments and parallelize hyperparameter search for neural networks

License: MIT License

Python 29.89% Shell 0.12% CSS 1.75% HTML 26.58% JavaScript 41.66%

deep-learning machine-learning tensorflow hyperparameter-optimization neural-networks data-science keras pytorch caffe2 caffe

test-tube's Issues

Write basic tests for HyperOptArgumentParser

Initial tests should include:

Testing grid search generation
Testing random search generation
CPU parallelization
GPU parallelization
Releasing free GPUs and CPUs continuously so there isn't idle time on either

SlurmCluster without hyperparameters

I'm attempting to train a PytorchLightning model on a slurm cluster, and the PytorchLightning documentation recommends use of the SlurmCluster class in this package to automate submission of slurm scripts. The examples all involve running a hyperparameter scan, however I would like to train just a single model. My attempt at doing so is as follows:

cluster = SlurmCluster()
[...] (set cluster.per_experiment_nb_cpus, cluster.job_time, etc.)
cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, ...)

However, this fails with:

Traceback (most recent call last):
  File "train.py", line 67, in hydra_main
    train, nb_trials=1, job_name='pl-slurm', job_display_name='pl-slurm')
  File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 127, in optimize_parallel_cluster_gpu
    enable_auto_resubmit, on_gpu=True)
  File "/global/u2/s/schuya/.local/cori/pytorchv1.5.0-gpu/lib/python3.7/site-packages/test_tube/hpc.py", line 167, in __optimize_parallel_cluster_internal
    if self.is_from_slurm_object:
AttributeError: 'SlurmCluster' object has no attribute 'is_from_slurm_object'

Looking at the code, it seems that SlurmCluster.is_from_slurm_object was never set. This is because I did not pass in a hyperparam_optimizer, as I did not intend to perform a scan. What is the correct way go about this?

TTDummyFileWriter error with multi processing using add_scalars (tensorboard)

I'm getting this error using PyTorch ddp for tensorboard's add_scalars (add scalar works fine). Is there something I can do?

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine
self.train()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train
self.run_training_epoch()
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 114, in run_training_epoch
self.run_evaluation(test=self.testing)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 130, in run_evaluation
test)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 74, in evaluate
eval_results = model.validation_end(outputs)
File "/home/userfs/b/brm512/experiments/HomographyNet/lightning_module.py", line 547, in validation_end
self.logger.experiment.add_scalars('losses', {'train loss': self.loss_meter_training.avg, 'val loss':self.loss_meter_validation.avg} , self.epoch_nb)
File "/scratch/staff/brm512/anaconda3/envs/ln1/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 363, in add_scalars
fw_logdir = self._get_file_writer().get_logdir()
AttributeError: 'TTDummyFileWriter' object has no attribute 'get_logdir'

Parameter name misspelled

The parameter tunnable should be spelled tunable.

Darknet

Any plans to support Joseph Redmon's Darknet framework (https://pjreddie.com/darknet/) in the near future?

AttributeError: 'Experiment' object has no attribute 'tag'

I tried to test this package and i init it like this:

exp = Experiment(name='word2vec',
                 debug=False,
                 save_dir='out/',
                create_git_tag=True)

what went wrong when I wrote this:

exp.tag({'learning_rate': 0.002, 'nb_layers': 2})

Log-scale search?

Often you care about data on the log scale (.1, 1, 10, 100, etc) rather than a uniform scale since order of magnitude is more important. It'd be nice to have a way to do that by default in test_tube.

Add tests to log.py

Consider dropping python 2 support

Anyone using scientific python these days will be forced into python 3 only anyway as things like ipython force a migration. I think literally every relevant library that someone using test tube would need supports python 3 anyway. If test tube was 3.5+ only (reasonable since upgrading from 3.4 -> 3.5 doesn't require breaking anything), we could use type annotations in the codebase.

Implement Hyperband (strategy)

If interested in implementing hyperband (in a framework agnostic way), sign up for it here. Happy to brainstorm implementation design ideas.

Issue with _get_file_writer after recent commits

I'm trying out pytorch-lightning and I'm having an issue after commits 1fad1c7 and 3fba70a. When I do

from test_tube import Experiment
exp = Experiment(save_dir=cfg['log_dir'])
trainer = Trainer(experiment=exp, max_nb_epochs=1, train_percent_check=0.01)

the program crashes with

File "/home/user/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/test_tube/log.py", line 504, in _get_file_writer
if self.purge_step is not None:
AttributeError: 'Experiment' object has no attribute 'purge_step'

because there are four attributes of Experiment that are never set, self.purge_step, self.filename_suffix, self.flush_secs.

Wrong directory for logging

test-tube/test_tube/log.py

Line 516 in 3d4ce06

self.file_writer = FileWriter(self.save_dir, self.max_queue,

This line makes the EventWriter to save logs to save_dir rather than log_dir which leaves the tf folder empty.

This may be related to new Pytorch update since I am using v1.2

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning

I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd

However, when I run the hyperparam opt, I get the following error:

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True).

Looking at argparse_hopt.py, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

Looking for suggestions on what to try, thanks!

Experiment version race condition error when using slurm

Sometimes, there's a chance test-tube will try to create an experiment version which already exists. Need to add a small delay to avoid the race condition.

example/tensorflow_example.py had an exception (about dir making) when nb_workers > 1

I tried the tensorflow_example.py to test the function of using multiple GPUs. When setting up nb_workers more than 1, I met an exception as follows. As a result, I just got nb_trials - 1 tuning results, rather than the expected nb_trials. Note that I am using Python 3.6.

Caught exception in worker thread [Errno 17] File exists: 'logs/multigpu/test_tube_data/dense_model/version_0'
Traceback (most recent call last):
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 30, in optimize_parallel_gpu_private
results = train_function(trial_params)
File "test_tube_multigpu.py", line 22, in train
autosave=False
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 58, in init
self.__init_cache_file_if_needed()
File "/home/lchen/.local/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
os.makedirs(exp_cache_file)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)

Replace pandas dep with something lightweight

Pandas is there just for the csv writing

Examples of visualizations

Thanks for sharing!
Could you show some examples such as plots or something?
I'm interested to use your tool.

No auto-resubmit

When using SlurmCluster.optimize_parallel_cluster_gpu, is there a way to turn off the auto-resubmit for continuation? Would simply setting cluster.minutes_to_checkpoint_before_walltime = 0 do the trick?

Possibility to support python-fire?

I usually use python fire (https://github.com/google/python-fire), it creates the parser arguments by default looking at the function to be called. Is there any possibility of integrating this into the existing framework?

what is the way to integration with scikit-learn?

Hi,

What is the best way to integration with scikit-learn?

ipykernel_launcher.py: error: unrecognized arguments: -f /Users/...

Still having trouble running the following demo script from the documentation:

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)

# compile (because it's argparse underneath)
hparams = parser.parse_args()

Exception:

usage: ipykernel_launcher.py [-h] [--learning_rate LEARNING_RATE]
                             [--nb_layers NB_LAYERS] [--neurons NEURONS]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/me/Library/Jupyter/runtime/kernel-b655e4ee-01bb-427d-a041-327a8e2836bc.json
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2


/Users/me/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2971: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

%tb

Traceback details:

---------------------------------------------------------------------------
SystemExit                                Traceback (most recent call last)
<ipython-input-16-90bd4dc39a7f> in <module>()
     12 
     13 # compile (because it's argparse underneath)
---> 14 hparams = parser.parse_args()

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in parse_args(self, args, namespace)
    116     def parse_args(self, args=None, namespace=None):
    117         # call superclass arg first
--> 118         results = super(HyperOptArgumentParser, self).parse_args(args=args, namespace=namespace)
    119 
    120         # extract vals

~/anaconda/envs/py36/lib/python3.6/argparse.py in parse_args(self, args, namespace)
   1731         if argv:
   1732             msg = _('unrecognized arguments: %s')
-> 1733             self.error(msg % ' '.join(argv))
   1734         return args
   1735 

~/anaconda/envs/py36/lib/python3.6/argparse.py in error(self, message)
   2387         self.print_usage(_sys.stderr)
   2388         args = {'prog': self.prog, 'message': message}
-> 2389         self.exit(2, _('%(prog)s: error: %(message)s\n') % args)

~/anaconda/envs/py36/lib/python3.6/argparse.py in exit(self, status, message)
   2374         if message:
   2375             self._print_message(message, _sys.stderr)
-> 2376         _sys.exit(status)
   2377 
   2378     def error(self, message):

SystemExit: 2

> /Users/me/anaconda/envs/py36/lib/python3.6/argparse.py(2376)exit()
   2374         if message:
   2375             self._print_message(message, _sys.stderr)
-> 2376         _sys.exit(status)
   2377 
   2378     def error(self, message):

Is this something to do with the fact that I am trying to run it in a Jupyter notebook?

How can I get a summery of all experiments ?

Hi,
What is the best way to get CSV with a summery of all the parameters and finally results?

Integrate CircleCI

Proposal: Independent Local hyperparameter optimization module

Right now HyperOptArgumentParser contains much of the logic for doing a hyperparamter search on a local machine.

https://github.com/williamFalcon/test-tube/blob/master/test_tube/argparse_hopt.py#L259

Why this it not great:

This is the opposite of the SLURM code where the HyperOptArgumentParser object is passed into SlurmCluster.
Code duplication and entanglement.
Hard to test HyperOptArgumentParser independent of the mechanism of deployment.

Proposed change:

Having something like Local or LocalSystem object that similar to SlurmCluster accepts a HyperOptArgumentParser that can be used to optimize hyperparameter locally:

    hyperparams = parser.parse_args()

    # Enable cluster training.
    system = LocalSystem(
        hyperparam_optimizer=hyperparams,
        log_path=hyperparams.log_path,
        python_cmd='python3',
        test_tube_exp_name=hyperparams.test_tube_exp_name
    )
    
    system.max_cpus = 100
    system.max_gpus = 5

    # Each hyperparameter combination will use 200 cpus.
    system.optimize_parallel_cpu(
        # Function to execute:
        train,
        # Number of hyperparameter combinations to search:
        nb_trials=24')

Downsides

Probably breaks backward compatibility.

Any way to not pull torch when "pip install test_tube"

Is it possible to install test_tube without pulling torch ? I am building a docker with tensorflow and I do not want it to get torch.

Thanks

TypeError: Unhashable type: 'list'

I am trying to extend my existing argparse with the HyperOptArgumentParser class. In my current argparse I have two options which are nargs list.

When I do the following,

parser = HyperOptArgumentParser(stratergy='random_search')

....


parser.add_argument('--input_idx', nargs='+', type=int, help='input indices from data', default=[0, 1])
parser.add_argument('--output_idx', nargs='+', type=int, help='output indices from data', default=[2])

I get this error:

Traceback (most recent call last):
  File "train_process.py", line 11, in <module>
    args = parser.parse_args()
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 237, in parse_args
    results = self.__parse_args(args, namespace)
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 158, in __parse_args
    args, argv = self.__whitelist_cluster_commands(args, argv)
  File "/home/srivathsa/miniconda3/envs/py35gad/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 196, in __whitelist_cluster_commands
    all_values.add(v)
TypeError: unhashable type: 'list'

How can I fix this?

Add tensorboardx integration

Add Tensorboardx integration
(https://github.com/lanpa/tensorboard-pytorch)

Log Argparse hparams with add_hparams in SummaryWriter

Currently the hparams are logged as text in tensorboard. Could we change this to start using the add_hparams() function in SummaryWritter. This will allow some additional nice views of the hyperparameters such as parallel coordinates view.

I think this would be a relatively easy change and I can help out with it if needed.

ImportError: cannot import name 'HyperOptArgumentParser'

I had a setup in ubuntu with python version 3.6
manually install below library

bleach==1.5.0
certifi==2016.2.28
cycler==0.10.0
decorator==4.1.2
html5lib==0.9999999
Markdown==2.6.9
matplotlib==2.1.0
networkx==2.0
nltk==3.2.5
numpy==1.13.3
olefile==0.44
pandas==0.21.0
Pillow==4.3.0
protobuf==3.4.0
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.3
PyWavelets==0.5.2
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==1.0.0
six==1.11.0
sklearn==0.0
tensorflow==1.3.0
tensorflow-tensorboard==0.1.8
test-tube==0.46
Werkzeug==0.12.2

I face issue while run script
getting error as per below so suggest me
in requirement there is test-tube version is 0.46 but i am not able to install that so i used test-tube== 0.2
from test_tube import HyperOptArgumentParser
ImportError: cannot import name 'HyperOptArgumentParser'

Syntax Error in log.py

Hey, I installed the package but while importing it throws up the following error:

File "", line 1, in
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/init.py", line 6, in
from .log import Experiment
File "/home/ece/anaconda2/envs/sameer27/lib/python2.7/site-packages/test_tube/log.py", line 366
header = f'''###### {self.name}, version {self.version}\n---\n'''
^
SyntaxError: invalid syntax

Thanks

SlurmCluster - submit multiple jobs as slurm array

I have a grid search that results in ~300 processes running. I'd like to submit them to slurm as a slurm array, so as not to overload the scheduler. Is there any functionality to support that in test-tube? I didn't see anything in the SlurmCluster code to indicate that it is.

Consider aliasing/renaming some API functions

Some function names are a bit verbose. It'd be nice to at least alias the following:

add_metric_row -> add (or log) since that carries the same meaning and is a lot shorter
add_opt_argument_list -> add_argument_list
add_opt_argument_range -> add_argument_range

SyntaxError: invalid syntax. Can not import the package

I installed this package in my py2.7 environment, but after I import it I got this error.
I don't know how to fix it, beacause it seems all the code is right.

import test_tube
│Traceback (most recent call last):
│  File "<stdin>", line 1, in <module>
│  File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/__init__.py", line 5, in <module>
│    from .argparse_hopt import HyperOptArgumentParser
│  File "/home/andy/.conda/envs/dynet/lib/python2.7/site-packages/test_tube/argparse_hopt.py", line 38
│    def add_opt_argument_list(self, *args, options=None, tunnable=False, **kwargs):
│    
│SyntaxError: invalid syntax

global_step not saved in metric.csv

Hi! Thanks a lot for this neat package! :)

I was wondering whether there is a way to get the global_step in the metric.csv file?
It is sent to Tensorboard but not added to self.metric in the Experiment.log method.

Cheers,
Emile

Set nb_trials default to None

What do you think about making the nb_trials param default to None? I've gotten a few confused questions about how many to use when using grid search.

@zafarali
https://github.com/williamFalcon/test-tube/blob/master/test_tube/argparse_hopt.py#L262

Bug: OptArg values

When defining range based arguments, e.g.
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
calculated values are not correct. In this case only min and max will be used for hyperparameter optimization instead of the requested 10 parameters which is caused by missing assignment of the member variable (line 372).

Furthermore if log_base parameter is set, only a single value is determined causing an exception in __flatten_params (argparse_hopt.py, line 339). As a fix add nb_samples as argument for np.random.uniform fucntion call (line 379).

Add tests for argparse_hopt.py

HyperOptArgumentParser explicit argument behaviour

When I first started using the package (great work btw, thanks a lot!) I assumed that the intended behaviour for HyperOptArgumentParser is to only only iterate over the hyperparams if they are not specified in the cli.

In my opinion this would be the convenient designated behaviour and if people agree I could implement it (currently I'm using a hacky work-around.

So currently if I run python hyperparam.py --arg1 128 and the arg1 is specified as

parser.opt_list(--arg1, options=[128, 256, 512])

It would still iterate over the arg1 options. My desired behaviour would be to only iterate over the options if the argument is not explicitly given.

Currently I use this hack:

class MyArgParser(HyperOptArgumentParser):
    def my_opt_list(self, name, default, dtype, options, **kwargs):
        tunable = (name not in sys.argv)
        self.opt_list(name, default=default, type=dtype, tunable=tunable,
                      options=options,**kwargs)

Which I would be happy to implement if I'm not the only one facing this issue.

Add SMAC strategy

Add SMAC strategy
(http://www.cs.ubc.ca/labs/beta/Projects/SMAC/)

TypeError: init() got an unexpected keyword argument 'nb_samples'

Wanted to try out test-tube but I can't even get the following demo script to work:

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

# and tune the number of units in each layer
parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)

# compile (because it's argparse underneath)
hparams = parser.parse_args()

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-90bd4dc39a7f> in <module>()
      9 
     10 # and tune the number of units in each layer
---> 11 parser.opt_range('--neurons', default=50, type=int, tunable=True, low=100, high=800, nb_samples=10)
     12 
     13 # compile (because it's argparse underneath)

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in opt_range(self, *args, **kwargs)
    100         log_base = kwargs.pop("log_base", None)
    101 
--> 102         self.add_argument(*args, **kwargs)
    103         arg_name = args[-1]
    104         self.opt_args[arg_name] = OptArg(

~/anaconda/envs/py36/lib/python3.6/site-packages/test_tube/argparse_hopt.py in add_argument(self, *args, **kwargs)
     80 
     81     def add_argument(self, *args, **kwargs):
---> 82         super(HyperOptArgumentParser, self).add_argument(*args, **kwargs)
     83 
     84     def opt_list(self, *args, **kwargs):

~/anaconda/envs/py36/lib/python3.6/argparse.py in add_argument(self, *args, **kwargs)
   1332         if not callable(action_class):
   1333             raise ValueError('unknown action "%s"' % (action_class,))
-> 1334         action = action_class(**kwargs)
   1335 
   1336         # raise an error if the action type is not callable

TypeError: __init__() got an unexpected keyword argument 'nb_samples'

Tried:

help(parser.opt_range)

Help on method opt_range in module test_tube.argparse_hopt:

opt_range(*args, **kwargs) method of test_tube.argparse_hopt.HyperOptArgumentParser instance

And:

import test_tube
print(test_tube.__version__)

but module 'test_tube' has no attribute '__version__'

I installed as follows in a Python 3.6.5 environment managed by Anaconda:

(py36) me$ pip install test_tube
Collecting test_tube
  Downloading https://files.pythonhosted.org/packages/be/ab/8fcffebc9764945024bf223b1c64525260bacb96f4bc74a973a8dba2e562/test_tube-0.602.tar.gz
Requirement already satisfied: pandas>=0.20.3 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Collecting numpy>=1.13.3 (from test_tube)
  Using cached https://files.pythonhosted.org/packages/8e/75/7a8b7e3c073562563473f2a61bd53e75d0a1f5e2047e576ee61d44113c22/numpy-1.14.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Requirement already satisfied: imageio>=2.3.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from test_tube)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: pytz>=2011k in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from pandas>=0.20.3->test_tube)
Requirement already satisfied: six>=1.5 in /Users/me/anaconda/envs/py36/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas>=0.20.3->test_tube)
Building wheels for collected packages: test-tube
  Running setup.py bdist_wheel for test-tube ... done
  Stored in directory: /Users/me/Library/Caches/pip/wheels/aa/68/07/766d78fe06f6803325e2aaffa610df36814ebd2468713a8703
Successfully built test-tube
Installing collected packages: numpy, test-tube
  Found existing installation: numpy 1.12.1
    Uninstalling numpy-1.12.1:
      Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.14.3 test-tube-0.602

Support of log_base for opt_range(type=int)

Hi!

Thank you for the library! Using it in par with pytorch-lightning to search network's hyperparameters.
Right now the following line:

parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10, log_base=10)

hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
    print(vars(trial_hparams))

will produce real values, though

parser = HyperOptArgumentParser()
parser.opt_range('--batch-size', type=int, default=1500, tunable=True, low=16, high=8192, nb_samples=10)

hparams = parser.parse_args()
for trial_hparams in hparams.trials(10):
    print(vars(trial_hparams))

produces int values.

It would be nice to have a feature of sampling in log scale for integer values!

"ChildProcessError: [Errno 10] No child processes" when doing `optimize_parallel_gpu`

I'm using pytorch-lightning and test_tube at the same time. I try to perform hyperparameter search using optimize_parallel_gpu, but I see the strange error in the title: ChildProcessError: [Errno 10] No child processes

Code

def main_local(hparams, gpu_ids=None):
    # init module
    # model = SparseNet(hparams)
    model = SparseNet(hparams)

    # most basic trainer, uses good defaults
    trainer = Trainer(
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=gpu_ids,
        distributed_backend=hparams.distributed_backend,
        nb_gpu_nodes=hparams.nodes,
        # optional
        fast_dev_run=hparams.fast_dev_run,
        use_amp=hparams.use_amp,
        amp_level=("O1" if hparams.use_amp else "O0"),
    )
    trainer.fit(model)

...
if __name__ == "__main__":
    ...
    parser = SparseNet.add_model_specific_args(parser)

    # HyperParameter search
    parser.opt_list(
        "--n", default=2000, type=int, tunable=True, options=[2000, 3000, 4000]
    )
    parser.opt_list(
        "--k", default=50, type=int, tunable=True, options=[100, 200, 300, 400]
    )
    parser.opt_list(
        "--batch_size",
        default=32,
        type=int,
        tunable=True,
        options=[32, 64, 128, 256, 512],
    )

    # parse params
    hparams = parser.parse_args()

    # LR for different batch_size
    if hparams.batch_size <= 128:
        hparams.learning_rate = 0.001
    else:
        hparams.learning_rate = 0.002

    # run trials of random search over the hyperparams
    if torch.cuda.is_available():
        hparams.optimize_parallel_gpu(
            main_local, max_nb_trials=20, gpu_ids=["0, 1"]
        )
    else:
        hparams.gpus = None
        hparams.distributed_backend = None
        hparams.optimize_parallel_cpu(main_local, nb_trials=20)

    # main_local(hparams) # this works

console log

gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
gpu available: True, used: True
VISIBLE GPUS: 0,1
Caught exception in worker thread [Errno 10] No child processes
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "sparse_trainer.py", line 29, in main_local
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model, ))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 156, in spawn
    error_queue = mp.SimpleQueue()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 112, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 332, in __init__
    self._rlock = ctx.Lock()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py", line 46, in ensure_running
    pid, status = os.waitpid(self._pid, os.WNOHANG)
ChildProcessError: [Errno 10] No child processes
^CTraceback (most recent call last):
  File "sparse_trainer.py", line 73, in <module>
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
    main_local, nb_trials=20, trials=hparams.trials(20), gpu_ids=["0, 1"]
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 361, in optimize_trials_parallel_gpu
Traceback (most recent call last):
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kyoungrok/anaconda3/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    gpu_id_set = g_gpu_id_q.get(block=True)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/queues.py", line 94, in get
    res = self._recv_bytes()
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    self.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/kyoungrok/anaconda3/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

optimize_parallel_gpu generates results fewer than nb_trials when using more than one GPU.

When using one GPU, I can obtain nb_trials results. However, if using more than one GPU, a fewer number of results are obtained. For example, when running 8 trials, I only can get version_0 to version_6 directories. The expected version_7 has not been generated.

Permission denied when running examples possibly because os.makedirs()

When I try to run the example /tensorflow_example.py" from a conda environment, I got the "permission denied" error. It seems the script is trying to write to the /Users folder - I'm not sure why that's not allowed.

When I try running Python with sudo, there is another issue that sudo changes $PATH and I'm using a different python, which does not have tensorflow installed. This answer also suggests not running python in sudo mode...

So I guess the real question is how to make os.mkdirs() work on my system (Ubuntu 16.04). Found an answer here. While I'm looking into the os.mkdirs() permission issue, just think I could open an issue in case other users come into the same situation as well.

Here is the full error log:

(deep) yuqiong@yuqiong-G7-7588:/media/yuqiong/DATA/test-tube$ python examples/tensorflow_example.py
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Caught exception in worker thread [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
PermissionError: [Errno 13] Permission denied: '/Users'
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'
PermissionError: [Errno 13] Permission denied: '/Users'
Traceback (most recent call last):
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/argparse_hopt.py", line 29, in optimize_parallel_gpu_private
    results = train_function(trial_params)
  File "examples/tensorflow_example.py", line 20, in train
    autosave=False,
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 58, in __init__
    self.__init_cache_file_if_needed()
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/site-packages/test_tube/log.py", line 121, in __init_cache_file_if_needed
    os.makedirs(exp_cache_file)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  [Previous line repeated 2 more times]
  File "/media/yuqiong/DATA/Anaconda3/envs/deep/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/Users'

williamfalcon / test-tube Goto Github PK

test-tube's Issues

Code

console log

Recommend Projects

Recommend Topics

Recommend Org