Giter Club home page Giter Club logo

lightning's Introduction

Lightning

The deep learning framework to pretrain, finetune and deploy AI models.

NEW- Lightning 2.0 features a clean and stable API!!


Lightning.aiPyTorch LightningFabricLightning AppsDocsCommunityContribute

PyPI - Python Version PyPI Status PyPI - Downloads Conda codecov

Discord GitHub commit activity license

Install Lightning

Simple installation from PyPI

pip install lightning
Other installation options

Install with optional dependencies

pip install lightning['extra']

Conda

conda install lightning -c conda-forge

Install stable version

Install future release from the source

pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/release/stable.zip -U

Install bleeding-edge

Install nightly from the source (no guarantees)

pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U

or from testing PyPI

pip install -iU https://test.pypi.org/simple/ pytorch-lightning

Lightning has 4 core packages

PyTorch Lightning: Train and deploy PyTorch at scale.
Lightning Fabric: Expert control.
Lightning Data: Blazing fast, distributed streaming of training data from cloud storage.
Lightning Apps: Build AI products and ML workflows.

Lightning gives you granular control over how much abstraction you want to add over PyTorch.


PyTorch Lightning: Train and Deploy PyTorch at Scale

PyTorch Lightning is just organized PyTorch - Lightning disentangles PyTorch code to decouple the science from the engineering.

PT to PL


Hello simple model

# main.py
# ! pip install torchvision
import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F
import lightning as L

# --------------------------------
# Step 1: Define a LightningModule
# --------------------------------
# A LightningModule (nn.Module subclass) defines a full *system*
# (ie: an LLM, diffusion model, autoencoder, or simple image classifier).


class LitAutoEncoder(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop. It is independent of forward
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# -------------------
# Step 2: Define data
# -------------------
dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor())
train, val = data.random_split(dataset, [55000, 5000])

# -------------------
# Step 3: Train
# -------------------
autoencoder = LitAutoEncoder()
trainer = L.Trainer()
trainer.fit(autoencoder, data.DataLoader(train), data.DataLoader(val))

Run the model on your terminal

pip install torchvision
python main.py

Advanced features

Lightning has over 40+ advanced features designed for professional AI research at scale.

Here are some examples:

Train on 1000s of GPUs without code changes
# 8 GPUs
# no code changes needed
trainer = Trainer(accelerator="gpu", devices=8)

# 256 GPUs
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=32)
Train on other accelerators like TPUs without code changes
# no code changes needed
trainer = Trainer(accelerator="tpu", devices=8)
16-bit precision
# no code changes needed
trainer = Trainer(precision=16)
Experiment managers
from lightning import loggers

# tensorboard
trainer = Trainer(logger=TensorBoardLogger("logs/"))

# weights and biases
trainer = Trainer(logger=loggers.WandbLogger())

# comet
trainer = Trainer(logger=loggers.CometLogger())

# mlflow
trainer = Trainer(logger=loggers.MLFlowLogger())

# neptune
trainer = Trainer(logger=loggers.NeptuneLogger())

# ... and dozens more
Early Stopping
es = EarlyStopping(monitor="val_loss")
trainer = Trainer(callbacks=[es])
Checkpointing
checkpointing = ModelCheckpoint(monitor="val_loss")
trainer = Trainer(callbacks=[checkpointing])
Export to torchscript (JIT) (production use)
# torchscript
autoencoder = LitAutoEncoder()
torch.jit.save(autoencoder.to_torchscript(), "model.pt")
Export to ONNX (production use)
# onnx
with tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as tmpfile:
    autoencoder = LitAutoEncoder()
    input_sample = torch.randn((1, 64))
    autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True)
    os.path.isfile(tmpfile.name)

Advantages over unstructured PyTorch

  • Models become hardware agnostic
  • Code is clear to read because engineering code is abstracted away
  • Easier to reproduce
  • Make fewer mistakes because lightning handles the tricky engineering
  • Keeps all the flexibility (LightningModules are still PyTorch modules), but removes a ton of boilerplate
  • Lightning has dozens of integrations with popular machine learning tools.
  • Tested rigorously with every new PR. We test every combination of PyTorch and Python supported versions, every OS, multi GPUs and even TPUs.
  • Minimal running speed overhead (about 300 ms per epoch compared with pure PyTorch).


Lightning Fabric: Expert control.

Run on any device at any scale with expert-level control over PyTorch training loop and scaling strategy. You can even write your own Trainer.

Fabric is designed for the most complex models like foundation model scaling, LLMs, diffusion, transformers, reinforcement learning, active learning. Of any size.

What to change Resulting Fabric Code (copy me!)
+ import lightning as L
  import torch; import torchvision as tv

 dataset = tv.datasets.CIFAR10("data", download=True,
                               train=True,
                               transform=tv.transforms.ToTensor())

+ fabric = L.Fabric()
+ fabric.launch()

  model = tv.models.resnet18()
  optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
- device = "cuda" if torch.cuda.is_available() else "cpu"
- model.to(device)
+ model, optimizer = fabric.setup(model, optimizer)

  dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
+ dataloader = fabric.setup_dataloaders(dataloader)

  model.train()
  num_epochs = 10
  for epoch in range(num_epochs):
      for batch in dataloader:
          inputs, labels = batch
-         inputs, labels = inputs.to(device), labels.to(device)
          optimizer.zero_grad()
          outputs = model(inputs)
          loss = torch.nn.functional.cross_entropy(outputs, labels)
-         loss.backward()
+         fabric.backward(loss)
          optimizer.step()
          print(loss.data)
import lightning as L
import torch; import torchvision as tv

dataset = tv.datasets.CIFAR10("data", download=True,
                              train=True,
                              transform=tv.transforms.ToTensor())

fabric = L.Fabric()
fabric.launch()

model = tv.models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
model, optimizer = fabric.setup(model, optimizer)

dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
dataloader = fabric.setup_dataloaders(dataloader)

model.train()
num_epochs = 10
for epoch in range(num_epochs):
    for batch in dataloader:
        inputs, labels = batch
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = torch.nn.functional.cross_entropy(outputs, labels)
        fabric.backward(loss)
        optimizer.step()
        print(loss.data)

Key features

Easily switch from running on CPU to GPU (Apple Silicon, CUDA, …), TPU, multi-GPU or even multi-node training
# Use your available hardware
# no code changes needed
fabric = Fabric()

# Run on GPUs (CUDA or MPS)
fabric = Fabric(accelerator="gpu")

# 8 GPUs
fabric = Fabric(accelerator="gpu", devices=8)

# 256 GPUs, multi-node
fabric = Fabric(accelerator="gpu", devices=8, num_nodes=32)

# Run on TPUs
fabric = Fabric(accelerator="tpu")
Use state-of-the-art distributed training strategies (DDP, FSDP, DeepSpeed) and mixed precision out of the box
# Use state-of-the-art distributed training techniques
fabric = Fabric(strategy="ddp")
fabric = Fabric(strategy="deepspeed")
fabric = Fabric(strategy="fsdp")

# Switch the precision
fabric = Fabric(precision="16-mixed")
fabric = Fabric(precision="64")
All the device logic boilerplate is handled for you
  # no more of this!
- model.to(device)
- batch.to(device)
Build your own custom Trainer using Fabric primitives for training checkpointing, logging, and more
import lightning as L


class MyCustomTrainer:
    def __init__(self, accelerator="auto", strategy="auto", devices="auto", precision="32-true"):
        self.fabric = L.Fabric(accelerator=accelerator, strategy=strategy, devices=devices, precision=precision)

    def fit(self, model, optimizer, dataloader, max_epochs):
        self.fabric.launch()

        model, optimizer = self.fabric.setup(model, optimizer)
        dataloader = self.fabric.setup_dataloaders(dataloader)
        model.train()

        for epoch in range(max_epochs):
            for batch in dataloader:
                input, target = batch
                optimizer.zero_grad()
                output = model(input)
                loss = loss_fn(output, target)
                self.fabric.backward(loss)
                optimizer.step()

You can find a more extensive example in our examples



Lightning Apps: Build AI products and ML workflows

Lightning Apps remove the cloud infrastructure boilerplate so you can focus on solving the research or business problems. Lightning Apps can run on the Lightning Cloud, your own cluster or a private cloud.

Hello Lightning app world

# app.py
import lightning as L


class TrainComponent(L.LightningWork):
    def run(self, x):
        print(f"train a model on {x}")


class AnalyzeComponent(L.LightningWork):
    def run(self, x):
        print(f"analyze model on {x}")


class WorkflowOrchestrator(L.LightningFlow):
    def __init__(self) -> None:
        super().__init__()
        self.train = TrainComponent(cloud_compute=L.CloudCompute("cpu"))
        self.analyze = AnalyzeComponent(cloud_compute=L.CloudCompute("gpu"))

    def run(self):
        self.train.run("CPU machine 1")
        self.analyze.run("GPU machine 2")


app = L.LightningApp(WorkflowOrchestrator())

Run on the cloud or locally

# run on the cloud
lightning run app app.py --setup --cloud

# run locally
lightning run app app.py


Examples

Self-supervised Learning
Convolutional Architectures
Reinforcement Learning
GANs
Classic ML

Continuous Integration

Lightning is rigorously tested across multiple CPUs, GPUs and TPUs and against major Python and PyTorch versions.

*Codecov is > 90%+ but build delays may show less
Current build statuses
System / PyTorch ver. 1.13 2.0 2.1
Linux py3.9 [GPUs] Build Status
Linux py3.9 [TPUs] Test PyTorch - TPU
Linux (multiple Python versions) Test PyTorch Test PyTorch Test PyTorch
OSX (multiple Python versions) Test PyTorch Test PyTorch Test PyTorch
Windows (multiple Python versions) Test PyTorch Test PyTorch Test PyTorch

Community

The lightning community is maintained by

  • 10+ core contributors who are all a mix of professional engineers, Research Scientists, and Ph.D. students from top AI labs.
  • 800+ community contributors.

Want to help us build Lightning and reduce boilerplate for thousands of researchers? Learn how to make your first contribution here

Lightning is also part of the PyTorch ecosystem which requires projects to have solid testing, documentation and support.

Asking for help

If you have any questions please:

  1. Read the docs.
  2. Search through existing Discussions, or add a new question
  3. Join our discord.

lightning's People

Contributors

akihironitta avatar ananthsub avatar arnaudgelas avatar awaelchli avatar borda avatar carmocca avatar daniellepintz avatar dependabot[bot] avatar duyicong515 avatar edenlightning avatar ethanwharris avatar four4fish avatar jeremyjordan avatar jerome-habana avatar jjenniferdai avatar justusschock avatar kaushikb11 avatar krishnakalyan3 avatar krshrimali avatar mauvilsa avatar neggert avatar nicolai86 avatar otaj avatar rohitgr7 avatar s-rog avatar seannaren avatar skaftenicki avatar tchaton avatar victorprins avatar williamfalcon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightning's Issues

Relax requirement for DistributedSampler with ddp

Is your feature request related to a problem? Please describe.
I have an application where I'm using a custom BatchSampler to construct batches for the N-Pairs metric learning loss. I need all of the data to be available on all processes when using DistributedDataParallel, so I wouldn't want to use DistributedSampler, even if it was compatible with a custom BatchSampler. Right now, I've hit a wall because lightning throws this exception:

pytorch_lightning.utilities.debugging.MisconfigurationException: 
when using multiple gpus and multiple nodes you must pass
 a DistributedSampler to DataLoader(sampler).

ie: this:
dataset = myDataset()
dataloader = Dataloader(dataset)

becomes:
dataset = myDataset()
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = Dataloader(dataset, sampler=dist_sampler)

Describe the solution you'd like
Could this exception be turned into a warning? I'm all for letting the user know when they're violating best practices, but throwing an exception removes flexibility for advanced users.

Describe alternatives you've considered
I looked at using the dp backend, but that's not going to work because the n-pairs loss needs the entire batch to compute the loss. Splitting it into chunks breaks things.

If I'm understanding correctly, this is actually another limitation introduced by Lightning. In a usual DataParallel setting, the batch would be merged back together before computing the loss and everything would be fine.

Returning None in validation_end method raises error

Hey,
If we define a validation_end method like

    def validation_end(self, outputs):
        return

it is gonna raise an error

AttributeError: 'NoneType' object has no attribute 'items'

Is this intended, if not shouldnt this part of the code initialize the metrics dict
https://github.com/williamFalcon/pytorch-lightning/blob/018b8da50e90638e8aa8d3eda1f8637656c25f2d/pytorch_lightning/models/trainer.py#L987

like here

https://github.com/williamFalcon/pytorch-lightning/blob/018b8da50e90638e8aa8d3eda1f8637656c25f2d/pytorch_lightning/models/trainer.py#L886

Arbitrary lr_scheduler?

Currently the only learning rate scheduler supported is MultiStepLR, specified through the params of Trainer() constructor.
What do you think about a more flexible approach for lr scheduler, maybe an optional user defined function in Trainer() similar to configure_optimizer?

Quantisation and Pruning Support

Is your feature request related to a problem? Please describe.
Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

  1. Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
  2. The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Describe the solution you'd like
Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

  1. Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
  2. Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let's come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you've considered
Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.

Additional context
If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial

Add support for ReduceLROnPlateau

Is your feature request related to a problem? Please describe.
As of now it does not seem like it is possible to use ReduceLROnPlateau as a metric has to be passed to the step method of the lr_scheduler.

Describe the solution you'd like
A possibility to use ReduceLROnPlateau on some or any of the metrics calculated during training or validation.

Describe alternatives you've considered
In my use case I want to do the step based on a metric calculated on the validation set. As a workaround I define the lr_scheduler in the init of the model and perform the step in the validation_end function

pip install -e . crash

Ran:

pip install -e .

Obtaining file:///Users/williamfalcon/Developer/opensource/pytorch-lightning
Complete output from command python setup.py egg_info:
error in pytorch-lightning setup command: ("EntryPoint must be in 'name=module:attrs [extras]' format", 'pytorch-lightning=pytorch-lightning.cli:main')

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /Users/williamfalcon/Developer/opensource/pytorch-lightning/

@shreyasbapat

Enable multiple dataset in validation_step

Allow validation step to use multiple datasets.

Need to also decide how it will be called (especially to handle a case where the two datasets aren't the same length.

Option A:

for batch_a in dataset_a:
    model.validation_step(batch_a, batch_nb, dataset_index)

for batch_b in dataset_b:
    model.validation_step(batch_b, batch_nb, dataset_index)

Option B:

for batches in zip(dataset_a, dataset_b):
    # use both
    # new dynamic signature validation_step(batch_a, batch_b, batch_n, batch_nb)
    model.validation_step(*batches, batch_nb)

Option C:
(I think this is the only real generic way). The user would have to make sure datasets are the same length or be ok with iterating only as far as the shortest one.

import itertools

for batches in zip(*itertools.chain(dataset_a, dataset_b, ..., dataset_n)):
    # use both
    # new dynamic signature validation_step(batch_a, batch_b, batch_n, batch_nb)
    model.validation_step(*batches, batch_nb)

@cinjon @ppwwyyxx

How to set hyperparameters search range and run the search?

Thanks for your powerful project and it's really helpful. I 'd like to try it for my current research if everything goes well.

My problem is, how to set hyperparameters search range and run the search? I 've read the chapters 'CPU hyperparameter search' and 'Running grid search on a cluster' in your document, however, I guess it is not very clear as there is only a few lines of code in 'CPU hyperparameter search' chapter without explanation (and main_local appears in the code without declaration).

Here is my trial to change LightningTemplateModel and single_cpu_template.py to be able to perform a hyperparameter search:

  1. set tunable=True for some params in def add_model_specific_args(parent_parser, root_dir) in LightningTemplateModel, e.g., parser.opt_list('--learning_rate', default=0.001*8, type=float, options=[0.0001, 0.0005, 0.001, 0.005], tunable=True)
  2. annotate main(hyperparams) and add hyperparams.optimize_parallel_cpu( main, nb_trials=20, nb_workers=1 )

However, it doesn't seem to work. So how can I set hyperparameters search range and run the search?

Sorry if my presentation is unclear (as I 'm not a native speaker). Thanks.

Cannot load saved model.

I cannot load model back after saved, using load_from_metrics:

Traceback (most recent call last):
  File "eval-metric.py", line 60, in <module>
    main(hyperparams)
  File "eval-metric.py", line 32, in main
    on_gpu=False, map_location=None)
  File "/opt/anaconda3/lib/python3.7/site-packages/pytorch_lightning/root_module/root_module.py", line 112, in load_from_metrics
    model = cls(hparams)
  File "/home/jupyter/kaggle-CellSignal/arcface_module.py", line 74, in __init__
    super().__init__(hparams)
  File "/opt/anaconda3/lib/python3.7/site-packages/pytorch_lightning/root_module/root_module.py", line 12, in __init__
    super(LightningModule, self).__init__(*args, **kwargs)
TypeError: __init__() takes 1 positional argument but 2 were given

Really don't know what went wrong.

revert to absolute imports

recent relative imports are causing issues. in addition, pep8 recommends absolute imports for clarity as well.

let's go back to absolute imports

Is it possible to make `validation_step` and `val_dataloader` no-ops?

Is your feature request related to a problem? Please describe.
Sometimes I don't have a separate validation split, only a train/test split. I'm trying out pytorch-lightning to prototype / experiment, and trying to see what the best of way of doing this is.

I could make the train dataset and then do torch.utils.data.random_split or use torch.utils.data.SubsetRandomSampler to build a validation set as well, but if I don't have enough data (or just don't want to do a separate validation step) this isn't ideal.

Describe the solution you'd like
I'd like to be able to implement only the training_step, train_dataloader, and test_dataloader methods and then have the validation step and validation metrics be omitted (maybe explicit no-ops). Right now, I'm experimenting with having an empty DataLoader for the validation data.

Describe alternatives you've considered

  • Implement val_dataloader with an empty (dummy) DataLoader
    • Not sure if this will work yet (if lightning will still call validation_step and validation_end).

No real time Experiment logging

Currently I'm using your library in a simple setup

exp = Experiment(save_dir=save_dir)
trainer = Trainer(max_nb_epochs=1, experiment=exp)
trainer.fit(my_model)

on Google Colab. The folder default/version_0/tf/ gets immediently created, but sadly the tf experiment logs are only saved, when the training finished or got aborted by me by KeyboardInterrupt. So I can't watch the training process in tensorboard. Do you have any suggestions what to change to recieve real time updates?

codecov doesn't respect ignore

need to add some ignore files to codecov... but it seems to not care about the ignore part:

coverage:
  precision: 0  # 2 = xx.xx%, 0 = xx%
  round: nearest # how coverage is rounded: down/up/nearest
  range: 40...100 # custom range of coverage colors from red -> yellow -> green
  status:
    # https://codecov.readme.io/v1.0/docs/commit-status
    project:
      default:
        against: auto
        target: 99% # specify the target coverage for each commit status
        threshold: 20% # allow this little decrease on project
        # https://github.com/codecov/support/wiki/Filtering-Branches
        # branches: master
        if_ci_failed: error
    # https://github.com/codecov/support/wiki/Patch-Status
    patch:
      default:
        against: auto
        target: 40% # specify the target "X%" coverage to hit
        # threshold: 50% # allow this much decrease on patch
    changes: false
    ignore:
      - "pytorch_lightning/utilities/arg_parse.py"
      - "raise *"

@Borda

how to setup slurm in a cluster

This is a great repo of warper pytorch when using ddp, especially using slurm, which has little repos. I found slurm in mmdetection also.

In my group, we have 3 nodes, each of them has 4 GPUs.
I want to setup a slurm cluster to fully use these nodes. But little data can be found.
So could you please share some toturial of setup up slurm in a cluster? My nodes are all ubuntu 18.04 server.

self-balancing architecture

This is a really awesome feature we're looking to add. Super hard problem also if any ninjas want to try to tackle it :) (you'll be legendary haha).

Problem:
Some models are too big to fit in memory. Thus can't do any distributed training currently available (even in PyTorch).

But... we can break up the model and put parts on each GPU. The trick though is to do it automatically, because manually doing this is a PITA (trust me, i spend weeks dealing with this haha).

Proposed solution:
User hook in LightningModule where user returns the modules they want balanced.

class MyModule(LightningModule):
    def __init__(...):
        self.model_a = SomeModel()
        self.layer_1 = Linear(...)
        self.layer2 = Linear(...)

    def forward(x):
       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.model_a(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_1(x)

       # in each of these module calls, auto place the input x on the gpu of the module
        x = self.layer_2(x)
        return x

    def self_balance():
        return [self.model_a, self.layer_1, self.layer_2]

So the above does two cool things:

  1. user says how they want to break up the model.
  2. In the forward, we auto put the input on that module's GPU.

That's the easy part lol... the hard part is deciding how to balance... optimizing for speed so you minimize data transfer across GPUs while not blowing up the RAM and using the RAM efficiently.

Anyone want to give this a shot?

Unable to import trainer

Hey, I am able to import pytorch_lightning but not the trainer. I am new to python and have no idea how to deal with it. It throws following error:

File "", line 1, in
ImportError: cannot import name Trainer

Thanks

ModuleNotFoundError: No module named 'demo'

simon:~/Desktop/pytorch-lightning/demo$ python fully_featured_trainer.py
Traceback (most recent call last):
File "fully_featured_trainer.py", line 20, in
from demo.example_model import ExampleModel
ModuleNotFoundError: No module named 'demo'

Adding visualization module

Do you consider adding visualization ability? For example adding TensorBoard utility to visualize validation curve, or scalar changes, etc.

Allow optimizers to alternate at arbitrary intervals

For GANs or similar approaches, we may want optimizer A to step every batch while optimizer B might step every k batches.

This feature will enable this behavior.

Approach still needs to be scoped out. Open to suggestions here.

Incorrect Implementation for Accumulating Batch Gradients in Trainer

Current Behavior:
If accumulate_grad_batches > default of 1, the Trainer will proceed to take the loss from each batch and run loss.backward() for each batch accumulated, running the optimizer.step() once the desired number of batches has undergone backprop.
Loss averaging is only done for batch_loss_value.

Correct Behavior:
The loss from the output needs to be divided by accumulate_grad_batches before loss.backward() is run, otherwise the overall magnitude of the gradient could be up to N times greater for a simulated batch size N times bigger than the actual.

moving examples out of the package

Hello, nice peace of work. I was wondering if it would be easier to have examples out of the package (more intuitive to finds and keep the package simple) as well as all tests?
(e.g. pytorch-lightning/pytorch_lightning/testing_models/lm_test_module.py)

Training accuracy

I was wondering whether there is something like validation_end but for training (e.g., training_end). I want to compute the training accuracy at the end of each epoch. Thanks!

DDP support on Jupyter Notebook

I'm trying to get ddp, fp16 etc running, however just setting the gpus and the distributed_backend='ddp' lets the CoolModel demo crash on my machine. I'm running it from a jupyter notebook with python 3.6 on a Ubuntu 18.04 machine with 2xV100.
The errors directly in the notebook are:

Exception Traceback (most recent call last)
in
69
70 # train (1 epoch only here for demo)
---> 71 trainer.fit(model)
72
73 # view tensorflow logs

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in fit(self, model)
425 """
426 warnings.warn(msg)
--> 427 mp.spawn(self.ddp_train, nprocs=len(self.data_parallel_device_ids), args=(model, ))
428
429 # 1 gpu or dp option triggers training using DP module

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon)
165
166 # Loop on join until it returns True or raises an exception.
--> 167 while not spawn_context.join():
168 pass

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
106 raise Exception(
107 "process %d terminated with exit code %d" %
--> 108 (error_index, exitcode)
109 )
110

Exception: process 0 terminated with exit code 1

and in the jupyter server output window:

Traceback (most recent call last):
File "", line 1, in
File "/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'CoolModel' on <module 'main' (built-in)>

I installed apex as it is written it the repository.

AttributeError: 'TTNamespace' object has no attribute 'drop_prob'

Got the following error while running the Demo examples:

python single_gpu_node_template.py --gpus "0,1,2,3"

Traceback (most recent call last):
File "single_gpu_node_template.py", line 112, in
main(hyperparams)
File "single_gpu_node_template.py", line 33, in main
model = LightningTemplateModel(hparams)
File "/home/dgueraco/projects/pytorch-lightning/pytorch_lightning/examples/new_project_templates/lightning_module_template.py", line 37, in init
self.__build_model()
File "/home/dgueraco/projects/pytorch-lightning/pytorch_lightning/examples/new_project_templates/lightning_module_template.py", line 49, in __build_model
self.c_d1_drop = nn.Dropout(self.hparams.drop_prob)
AttributeError: 'TTNamespace' object has no attribute 'drop_prob'

Update Lightning compatibility with PyTorch 1.2.0

Is your feature request related to a problem? Please describe.
PyTorch 1.2.0 has breaking changes for the experiment object.
Likely underlying changes to SummaryWriter.

For now, Lightning requires pytorch 1.1.0 but need to update compatibility.

Consider: ability to set seed

I dunno if this is in scope (feel free to close if not), but when experimenting, setting a fixed seed is handy since you can remove one source of randomness (Karpathy's recipe even includes it as an important beginning step).

Basically, being able to set the seeds for the random, numpy, torch, and other common modules in the config would be handy.

Issue install the library

Hey, I wanted to give this library a try, so I did pip install pytorch-lightning, which gave the following error

C:\Users\cs>pip install pytorch-lightning
Collecting pytorch-lightning
  Using cached https://files.pythonhosted.org/packages/7e/3e/599dfe7b8c35ef9c72d
f4825d876c023fafe5e2618483ee3f3f2f4cdc3a9/pytorch-lightning-0.0.2.tar.gz
Collecting test-tube (from pytorch-lightning)
  Using cached https://files.pythonhosted.org/packages/3a/50/47ea5613be804c8e6e0
b01b1719e1f8186b8bc626441002b141c8a962abb/test_tube-0.631.tar.gz
Collecting torch (from pytorch-lightning)
  Using cached https://files.pythonhosted.org/packages/5f/e9/bac4204fe9cb1a002ec
6140b47f51affda1655379fe302a1caef421f9846/torch-0.1.2.post1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\cs\AppData\Local\Temp\pip-install-bufxsmuo\torch\setup.py",
 line 11, in <module>
        raise RuntimeError(README)
    RuntimeError: PyTorch does not currently provide packages for PyPI (see stat
us at https://github.com/pytorch/pytorch/issues/566).

    Please follow the instructions at http://pytorch.org/ to install with minico
nda instead.


    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\cs\AppDa
ta\Local\Temp\pip-install-bufxsmuo\torch\

C:\Users\cs>

So I went to the pytorch.org, and got the following set of commands to install pytorch on my system

pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.1-cp37-cp37m-win_amd64.whl
pip3 install torchvision

The first command generated an error

C:\Users\cs>pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.1-cp37-c
p37m-win_amd64.whl
torch-1.0.1-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.

C:\Users\cs>

Now the only option for me is install pytorch from source code. I was wonder if we can provide the pytorch-lighting as a docker image. We provide a template docker file for people to only provide the path for the test_python.py file. Is it a viable option?

My system: Windows 7, 32-bit, Python 3.7

Enable any ML experiment tracking framework

People seem to have strong preferences for either using MLFlow, test-tube, polyaxon, etc...

Let's just add generic support for whatever people want to use. I don't know if generic support is possible, but each can easily be supported individually.

To make this work we'd need to:

  • change the logging to be non test-tube specific. Logging only happens in 2 places (train and validation completion).
  • Each call to log needs to be process-safe. Meaning when using distributed only rank=0 will log.
  • the experiment param in Trainer will need to be generalized (signature the same), to take any logger.

I think that's all that's needed to add this support.

Any suggestions and takers for working on this integration?
@Borda @alok

codecov not updating

Awesome improvements to coverage and tests! thanks @Borda

Wondering what has to be done to update the badge now. I pushed a report from GPU coverage but no updates.

@Borda

Dataset only available when the trainer is instantiated

Is your feature request related to a problem? Please describe.
This is half feedback/feature request. Maybe our approach is not right be here is what we felt when trying this awesome library:

We would like to use a LightningModule in our pipelines, but we have some constraints which makes this difficult.

We have an experiment framework where we can register models (eg a LightningModule) by instantiating them. Then the framework trains the various model using some train/val/test data which is specified at runtime and generates performance reports.

Pseudo code:

class TorchModel:
  def fit(x_train, y_train, x_val, y_val):
    trainer = Trainer(...)
    trainer.fit(self.module)

models = [
  ModelA(...),
  TorchModel(module=CoolModel()),  # TorchModel is actually a wrapper which exposes a common interface to Sklearn/Keras/Torch models
]

experiment_runner = Runner(models)
experiment_runner.run(train_dataset, val_dataset, test_dataset)

Or Uber's Ludwig would do:

from ludwig.api import LudwigModel

# train a model
model_definition = {...}
model = LudwigModel(model_definition)
train_stats = model.train(training_dataframe)

Describe the solution you'd like
For us, the datasets / input tensors don't belong to the definition of the module. We understand that it improves reproducibility but it may reduce portability of models

They probably should be provided to the trainer at instantiation:

Trainer(train_dataset=..., val_dataset=...)

# And maybe
class CoolModel(pl.LightningModule):
    ...

    @pl.data_loader
    def tng_dataloader(self, dataset):
        return DataLoader(dataset, batch_size=32)

   ...

Describe alternatives you've considered
A temporary solution could be:

class TorchModel:
  def fit(x_train, y_train, x_val, y_val):
    self.module.set_train_dataset(x_train, y_train)
    self.module.set_val_dataset(x_val, y_val)
    trainer = Trainer(...)
    trainer.fit(self.module)

Additional context
Thanks for creating this library, this makes pytorch so much easier to use!

Streamlined UX in saving, loading, continue training.

Currently each time I ran the command with the same experiment name, a new version is created and trained from scratch.
If I wanted to load back the model to continue training, I have to create a different script, input the path for the previous checkpoints and meta_tags.

Proposal:

  • Start training, input only 1 log_path and 1 experiment_name. Test_tube data and saved models will be included in the same folder.
  • In each experiment, default is to create a new model and trained from scratch (current behavior)
  • If user passed in --continue-training, model will load the latest model (not necessarily the best model) from the latest version, then create a new experiment version and continue training from there.
  • If user passed in --continue-training--best, model will load the the best model from the latest version, then create a new experiment version and continue training from there.
  • If user passed in --continue-training(--best) --version=X, auto load the model from version X and start training.

In the end, all the path and dir are determined from only the log_path and name of experiment, and default behavior is the same as of now:
python train.py --exp_name=exp1 --....
If user want to change some hyperparams, or start finetuning, they will only need to add:
python train.py --exp_name=exp1 --.... --continue-trainig--best --version=x

What do you think?
Happy to discuss more, as I am not sure this is suitable for cluster training or not.
I can make a PR if you're interest.

Trainer.fit() crashes if no checkpoint callback is provided

I hope it's okay that I keep posting issues...
Now that I can circumvent the github installation issues, I pulled in the latests master and let my simple CoolModel demo code run. But now calling trainer.fit() crashes with:

AttributeError Traceback (most recent call last)
in
21 )
22
---> 23 trainer.fit(model)
24 # exp.close()

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in fit(self, model)
494 self.optimizers, self.lr_schedulers = self.optimizers
495
--> 496 self.__run_pretrain_routine(model)
497
498 # return 1 when finished

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in __run_pretrain_routine(self, model)
680
681 # restore training and model before hpc call
--> 682 self.restore_state_if_existing_checkpoint()
683
684 # enable cluster checkpointing

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/models/trainer.py in restore_state_if_existing_checkpoint(self)
261
262 # find last epoch
--> 263 checkpoints = os.listdir(self.checkpoint_callback.filepath)
264 for name in checkpoints:
265 # ignore hpc ckpts

AttributeError: 'NoneType' object has no attribute 'filepath'

Looking at the code, it appears to happen because I did not provide a checkpoint callback and it tries to access it in restore_state_if_existing_checkpoint

Nitpick: `ptl` may be better as `pl`

(Feel free to ignore.)

All the usage examples do import pytorch_lighting as ptl. Instead of ptl, pl may be better as it doesn't clash with any library I know of, is 2 characters like NumPy's np, and is harder to mistype as plt, which many researchers probably also have imported. Since the library is in its early days, I don't think it would be that dramatic a change and is a little easier to read for people like me who often mix up letters like that.

On the other hand, it's pretty clear that it's not matplotlib from context, is yet another change, and is an aesthetic choice at its root, so it may not be worth it.

pip installation using github repository incomplete

I tried to install pytorch-lightning using pip and the github repository.
Importing the module results in the following errors:

`
ModuleNotFoundError Traceback (most recent call last)
in
8 from torchvision import ops
9
---> 10 import pytorch_lightning as ptl
11 from pytorch_lightning import Trainer
12 from test_tube import Experiment

/opt/miniconda3/envs/dev_pytorch_lightning36/lib/python3.6/site-packages/pytorch_lightning/init.py in
----> 1 from .models.trainer import Trainer
2 from .root_module.root_module import LightningModule
3 from .root_module.decorators import data_loader

ModuleNotFoundError: No module named 'pytorch_lightning.models'`

Following the error I found out, that there is no models folder under the path site-packages\pytorch_lightning

0.4.0 release - final checks (releasing later today)

Want to release the last 2 core required feats we were missing

  1. continue training (and session) from checkpoint (added).
  2. 16-bit with single GPU and no DP or DDP (added).

any other stability things to consider before releasing? (3.7)

Proposal for help

Hi @williamFalcon ! I saw your project and I am very pleased by the idea. I wish to help you writing production level code. PLease let me know in what way can I help!

Predict method for test set

The main Lightning module requires to define the test_dataloader function. But I'm not able to find any method that requires the test loader as input. Is there a model.predict() method to call on the test set?

change Checkpoint callback's `save_best_only` to `save_top_k`

Is your feature request related to a problem? Please describe.
save_best_only is a special case of save_top_k. However, save_tok_k checkpoints can be used to create ensemble model during the test time.

Describe the solution you'd like
keep a dict of {epoch: monitor} of length k, and save new ckeckpoint that can enter this dict, remove the worst checkpoint.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.