Comments (22)
sure, I was thinking about having it really lightweight so even drop or move the arg parser class to this framework... about the logging, what about have one more parameter like
log_callback
which will be used if it is specified... and then in examples you just keep your proposal with using test-tube, but if someone come and want to use anything else, he just change the callback function... I think that Keras is also using these callbacks in a similar way...
Good suggestion, however I think that's just adding another thing for the user to remember. I'd rather them just plug in the logger and it work.
So, in the meantime, why don't we just internally call the correct function depending on the logger?
something like:
def log(logger, metrics):
package_name = logger.__module__.split('.')[0]
if type(logger) is 'test_tube':
logger.ml_flow_way_of_logging(metrics)
And replace the lines of
self.experiment.log(x)
# replace with
self.log(logger, x)
Just tried this:
(ddt) comp:~ user$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from test_tube import Experiment
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
>>> Experiment.__module__
'test_tube.log'
>>>
from lightning.
I would keep #44 simple just adding CI, so I would leave the resolving test-tube to another PR/user 😄
from lightning.
Alright, started taking a look at this. I think it's going to be more straightforward than I initially thought. I'd like some feedback on a proposal though, before I start working on it.
I'd like to create a base class for experiment loggers. Loggers for individual tracking frameworks would inherit from this base class. The particular logger that a user wants can be passed to the Trainer
constructor, just like the test-tube experiment is today.
Proposed base class:
class LightningLoggerBase:
def log_metrics(self, metrics_dict, epoch_num):
pass
def log_hyperparams(self, param_dict):
pass
def save(self):
pass
Implementation of a logger for a particular tracking framework would look like this:
class MLFlowLogger(LightningLoggerBase):
...
And using it would be straightforward
from pytorch_lightning.loggers import MLFlowLogger
logger = MLFlowLogger(...)
trainer = pl.Trainer(logger=logger, ...)
trainer.fit(model)
Any feedback on this? In particular, I'm wondering if I'm missing methods that LightningLoggerBase
should have. Off the top of my head, do we want properties for experiment_name
and experiment_version
? Or perhaps an optional method to save a model or other artifact?
I'm also wondering what we do with the existing test-tube integration. We could rip the existing stuff out and replace it with a test-tube logger that inherits from LightingLoggerBase
. Or we could leave it in, since it's optional, and add other frameworks using the new interface. Thoughts?
from lightning.
beautiful solution. then we don't have to wrap them, each user can wrap their own?
I think we also move these options to this logger class from trainer:
log_save_interval=100,
add_log_row_interval=10,
Because if no logger is passed in, these flags won't be used. So, makes sense to define them in the logger.
The root experiment also needs:
- rank property (to make sure it only logs from process 0 in ddp mode).
- get_non_ddp_exp() - when launching ddp, PT pickles everything. if a logger isn't picklable ddp won't work. either we let those loggers fail (ie: no use) or we allow user to create a pickle safe version. Test tube exp has the following meta data class:
class DDPExperiment(object):
def __init__(
self,
exp
):
"""
Used as meta_data storage if the experiment needs to be pickled
:param name:
:param debug:
:param version:
:param save_dir:
:param autosave:
:param description:
:param create_git_tag:
:param args:
:param kwargs:
"""
self.tag_markdown_saved = exp.tag_markdown_saved
self.no_save_dir = exp.no_save_dir
self.metrics = exp.metrics
self.tags = exp.tags
self.name = exp.name
self.debug = exp.debug
self.version = exp.version
self.autosave = exp.autosave
self.description = exp.description
self.create_git_tag = exp.create_git_tag
self.exp_hash = exp.exp_hash
self.created_at = exp.created_at
self.save_dir = exp.save_dir
def get_non_ddp_exp(self):
return Experiment(
name=self.name,
debug=self.debug,
version=self.version,
save_dir=self.save_dir,
autosave=self.autosave,
description=self.description,
create_git_tag=self.create_git_tag
)
Experiment name and version
A nice thing about the test-tube logger is that you get a nice organization of experiments each with a version and a set of hyperparams saved in meta_tags.csv.
This should just be provided by the core class which people use to wrap up their experiments. We can add a flag to turn this off in the case where an MLTracker already does this.
I think those are all the main points we need to account for. If you have a cleaner way of handling the above needs we can go with that too. These are just points to account for.
from lightning.
I was envisioning that Lightning could provide loggers for some of the more common frameworks (maybe Test Tube, MLFlow, and polyaxon?). Of course, users would be free to write their own as well.
I'm not sure I agree about log_save_interval
and add_log_row_interval
. Those control how often the trainer calls the logger methods. IMO they should stay in the trainer, but I don't feel strongly about it. On a similar note, IMO the Trainer should be responsible for logging only from the rank 0 process.
What were you running into that made test-tube Experiments non-pickleable? I would have guessed that it would just work.
from lightning.
i don’t remember if it was test tube itself. i think it’s PyTorch summarywriter. i didn’t spend too much time debugging because the PT code is a bit messy there.
But i didn’t spend too much time on that, however, it might be helpful to try a quick pickle of an experiment instance.
from lightning.
i think the trainer should be the one to save hyperparameters. that will allow loading when people don’t use a logger.
same for weights.
let’s do the .load separately because we need to think through it. On one hand it is more flexible to allow users to load using arbitrary args, but it will break reproducibility. So, most likely i think we should force users to use the hyperparams to restore the model. this means the trainer should save hyperparams automatically instead of the logger. This will all be part of the style guide I’m creating.
from lightning.
I also wonder if SlurmManager should just be brought into lightning. Thoughts?
from lightning.
So far I see, all used methods are following
from test_tube import HyperOptArgumentParser, Experiment, SlurmCluster
I would remove the test_tube
dependence from pytorch_lightning.testing.lm_test_module
so the pytorch_lightning
package will be independent on these additional frameworks and then I would even keep this test_tube
for testing/examples...
And instead HyperOptArgumentParser
I would use standard/basic python argparse
https://docs.python.org/3/library/argparse.html
from lightning.
There are also these lines (which need to be generic for each logger (but seems hard seeing they all have a different interface). So, maybe have a function that calls the appropriate thing.
here, here, here, here (basically search for self.experiment).
I kind of like the idea of limiting support for a few things here so that people aren't lost in what they can use. We can support many different ones, but would be nice in the docs if people were told which ones they can choose from.
A big help that lightning provides is helping people who don't know what they don't know. For instance some people not in CS/ML have never heard of tensorboard. But if they see it listed here, they'll know they can get the support for free if they use it.
Also remember that the experiment object in test-tube is a subclass of PyTorch SummaryWriter.
from lightning.
So far I see, all used methods are following
from test_tube import HyperOptArgumentParser, Experiment, SlurmClusterI would remove the
test_tube
dependence frompytorch_lightning.testing.lm_test_module
so thepytorch_lightning
package will be independent on these additional frameworks and then I would even keep thistest_tube
for testing/examples...
And insteadHyperOptArgumentParser
I would use standard/basic pythonargparse
https://docs.python.org/3/library/argparse.html
HyperargumentParser is a subclass of Argparse. It's the same thing except it adds more features for hyperparameter optimization.
from lightning.
Most people also use ArgumentParser to pass in args to their models. Thus, this is the standard way of doing it in lightning. Remove the choices for things that should be standard.
A core idea that I like about Lightning is that it removes the unnecessary choices you might need to make. So it's highly opinionated about standard practices. (ie: to log we all do this, to viz we all do this, to grid search we all do this, etc...). Except instead of doing something a million ways, there's one way.
from lightning.
sure, I was thinking about having it really lightweight so even drop or move the arg parser class to this framework... about the logging, what about have one more parameter like log_callback
which will be used if it is specified... and then in examples you just keep your proposal with using test-tube, but if someone come and want to use anything else, he just change the callback function... I think that Keras is also using these callbacks in a similar way...
from lightning.
I'd like to pick one of the frameworks up, MLflow if it's isn't already taken. I'll look at it tomorrow morning and try to figure out how it can be integrated. I'm a bit confused though, can't tensorboard be used to track experiments and different model runs?
Also, could lighting add a feature for quantisation or pruning ?
Maybe expose an API so that people can try out different quantisation and pruning strategies to compress their Network.
I recently was able to quantise a fp32 model to 8 bits ( only the fully connected and conv layers) with next to no loss in accuracy, this was an MNIST network. But I'm following and using the same algos the Tensorflow lite people are using. I can share a Google colab notebook of a sample implementation of you think it's a viable thing for this project.
from lightning.
@karanchahal I agree with you... most people I know use tensorboard, but happy to allow this to be flexible. There's no real need to force an opinion on this particular thing from the framework.
I think @Borda is probably going to look at it (but we can defonflict here if you want to look at it).
However, a way of doing network pruning and quantisation would be good. I don't know those in details, so maybe make a proposal and what the use cases are for? (different issue with enhancement tag)
from lightning.
@karanchahal feel free to give this a shot!
link to this issue
from lightning.
@karanchahal still interested in making these improvements?
from lightning.
from lightning.
@neggert For the test tube logger, I think that it would be best to make it just another logger and remove the special-casing it has right now. Since this is meant to become more generic in the PyTorch ecosystem, I think making it more pluggable is a better idea.
from lightning.
process = 0 would have to be inside the abstraction.
the user will for sure call self.logger(...) at any point in their code. they won’t know what process it is and it will also get annoying adding an if statement every time. more room for user error.
we can keep the logging interval stuff in trainer for now.
from lightning.
Ah, okay. Makes sense. I was thinking that the user wouldn't ever call the logger themselves, but I guess there's no way to stop them :P.
Any insight into what was causing the problems with pickling that you ran into? It's not clear to me if this is a problem we'll run into frequently, or if there's something specific to Test Tube that causes it.
from lightning.
Does it make sense to do #183 along the way? This would let users load models from whatever logger they want.
I'm imagining adding a couple methods to LightningLoggerBase
:
def get_hyperparams(self):
pass
def save_weights(self, trainer, filename):
pass
def get_weights_path(self, filename):
pass
Then a model can be loaded like so:
new_hparams = logger.get_hyperparams()
new_weights_path = logger.get_weights_path("save_test.ckpt")
model_2 = LightningTemplateModel.load(new_weights_path, new_hparams, on_gpu=False)
much of the logic in the existing load_from_metrics
would get moved into TestTubeLogger.get_hyperparams
.
from lightning.
Related Issues (20)
- Construct objects from yaml by classmethod
- FSDP Strategy checkpoint loading
- Current FSDPPrecision does not support custom scaler for 16-mixed precision
- Differentiate testing multiple sets/models when logging
- Issue in Manual optimisation, during self.manual_backward call HOT 1
- Existing metric keys not moved to device after LearningRateFinder
- Checkpoint every_n_steps reruns epoch on restore HOT 3
- Metrics logged by self.log and metric.compute() are different HOT 1
- Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster HOT 3
- Full validation after first microbatch when training after LearningRateFinder
- Add a warning when some of the modules are in eval mode before the training stage
- why pytorch-lightning doc say "Model-parallel training (FSDP and DeepSpeed)". I think there is something wrong. HOT 1
- AWS Trainium fails number of device validation when using more than 1 accelerator on the instances
- OnExceptionCheckpoint: training resumes if ckpt found, even if no ckpt_path provided
- TensorBoardLogger has the wrong epoch numbers much more than the fact HOT 1
- How to incorporate vLLM in Lightning for LLM inference?
- WandbLogger `save_dir` and `dir` parameters do not work as expected.
- Loading large models with fabric, FSDP and empty_init=True does not work
- Unable to extract confusion matrix as a metric from trainer HOT 1
- Torchmetrics Accuracy issue when dont shuffle test data. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.