Giter Club home page Giter Club logo

Comments (13)

clementpoiret avatar clementpoiret commented on May 29, 2024 1

Update on my issue, it seems that the optimizer setter is not triggered by the callback: https://github.com/Lightning-AI/lightning/blob/618e1c8061753e767e7ae628cf55098b8fa6ad55/src/lightning/pytorch/strategies/strategy.py#L107

Meaning that the resulting LightningOptimizer is not properly initialized by LightningOptimizer._to_lightning_optimizer.

One quick hack would be to either modify the already existing object such as:

    def on_fit_start(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        if len(trainer.strategy.optimizers) > 1:
            raise MisconfigurationException(
                "SparseML only supports training with one optimizer.")
        optimizer = trainer.strategy.optimizers[0]

        optimizer = self.manager.modify(
            pl_module,
            optimizer,
            steps_per_epoch=trainer.estimated_stepping_batches,
            epoch=0)

        trainer.strategy._optimizers = [optimizer]
        trainer.strategy._lightning_optimizers[0]._optimizer = optimizer

What do you think about this possible solution? I'm afraid this would break older versions of pytorch lightnings, as I did not tested on older versions.

from sparseml.

mgoin avatar mgoin commented on May 29, 2024 1

Hey @clementpoiret we added support for torch<2.2 on the latest sparseml-nightly. Please give it a try and let me know if you need any other assistance

from sparseml.

bfineran avatar bfineran commented on May 29, 2024 1

Hi @clementpoiret, yes it looks like it does get overwritten here -

kwargs["training"] = torch.onnx.TrainingMode.PRESERVE

the reason why seems to be lost in the commit history, but we can dig a bit more. If you can make edits to your local install, you can try switching preserve to eval here.
Additionally, tou can try running an eval and benchmark on your exported model to see if it meets your accuracy/performance needs in case the warning is not material.

from sparseml.

jeanniefinks avatar jeanniefinks commented on May 29, 2024 1

Hi @clementpoiret
As it's been some time without further discussion, I am going to go ahead and close this thread. However, feel free to re-open if there is further comment.
Thank you! 👋🏼

Jeannie / Neural Magic

from sparseml.

mgoin avatar mgoin commented on May 29, 2024

Hey @clementpoiret I think the lightning integration isn't heavily used so I wouldn't have concern with backwards compatibility for that interface. My main concern would be with torch 2.1 affecting other flows. It sounds like that upgrade would still be needed for your DINO usecase, correct?

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

Yes, unfortunately I can't export dinov2 in onnx using torch 2.0 as I am getting torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::scaled_dot_product_attention' to ONNX opset version 17 is not supported.
This issue has been fixed by the 2.1 update

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

Hey @mgoin thanks for the update
I can confirm everything works on my side!

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

Okay @mgoin, I'll correct what I said 😂 my fix of the lightning callback was just (I believe) incorrectly applying the optimizer, thus dismissing the error. Actually, I end up with non sparse models, meaning the optimizer step isn't applied correctly... So I still have the issue with sparseml 1.6.0: the ScheduledModifierManager and the LightningOptimizer classes have issues working together. I ended with the same issue as above : self._strategy is none.

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

Okay, I believe it's just a mix of ambiguities here and there, what I said above is weird, I still don't understand why modifying the base optimizer reset lightning's strategy, but setting modifying the optimizer of the LightningOptimizer seems to be okay now.

What happened is quite simple: trainer.estimated_stepping_batches was incorrect, leading modifiers to be close to never applied, thus start_epoch and end_epoch being completely off. Manually indicating step_per_epoch fixed the issue and applied all modifiers.

My last weird error is a warning when exporting to ONNX, telling that I should disable constant folding if the training=TrainingMode.PRESERVE or TrainingMode.TRAINING. I don't get how to modify this behavior. I keep having the warning even when setting export_onnx(..., training=TrainingMode.EVAL) or setting model.training = False.

For the sake of reference, here is the updated SparseMLCallback:

from typing import Any, Optional

import torch
from lightning.pytorch import Callback, LightningModule, Trainer
from lightning.pytorch.utilities.exceptions import MisconfigurationException
from sparseml.pytorch.optim import ScheduledModifierManager
from sparseml.pytorch.utils import ModuleExporter


class SparseMLCallback(Callback):
    """Enables SparseML aware training. Requires a recipe to run during training.

    Args:
        recipe_path: Path to a SparseML compatible yaml recipe.
            More information at https://docs.neuralmagic.com/sparseml/source/recipes.html

    """

    def __init__(self,
                 recipe_path: str,
                 steps_per_epoch: Optional[int] = None) -> None:
        self.manager = ScheduledModifierManager.from_yaml(recipe_path)
        self.steps_per_epoch = steps_per_epoch

    def on_fit_start(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        optimizer = trainer.strategy._lightning_optimizers[0]._optimizer

        if len(trainer.optimizers) > 1:
            raise MisconfigurationException(
                "SparseML only supports training with one optimizer.")

        if self.steps_per_epoch is None:
            self.steps_per_epoch = trainer.estimated_stepping_batches

        optimizer = self.manager.modify(pl_module,
                                        optimizer,
                                        steps_per_epoch=self.steps_per_epoch,
                                        epoch=0)

        trainer.strategy._lightning_optimizers[0]._optimizer = optimizer

    def on_fit_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
        self.manager.finalize(pl_module)

    @staticmethod
    # TODO: check for TrainingMode.EVAL
    def export_to_sparse_onnx(model: LightningModule,
                              output_dir: str,
                              sample_batch: Optional[torch.Tensor] = None,
                              name: str = "model.onnx",
                              opset: int = 14,
                              disable_bn_fusing: bool = True,
                              convert_qat: bool = True,
                              **export_kwargs: Any) -> None:
        """Exports the model to ONNX format."""
        exporter = ModuleExporter(model, output_dir=output_dir)
        sample_batch = sample_batch if sample_batch is not None else model.example_input_array  # type: ignore[assignment] # noqa: E501
        if sample_batch is None:
            raise MisconfigurationException(
                "To export the model, a sample batch must be passed via "
                "``SparseMLCallback.export_to_sparse_onnx(model, output_dir, sample_batch=sample_batch)`` "
                "or an ``example_input_array`` property within the LightningModule"
            )
        exporter.export_onnx(
            sample_batch=sample_batch,
            name=name,
            opset=opset,
            disable_bn_fusing=disable_bn_fusing,
            convert_qat=convert_qat,
            **export_kwargs,
        )

from sparseml.

bfineran avatar bfineran commented on May 29, 2024

Hi @clementpoiret does the export with the warning still produce a valid ONNX model? Could you also paste the warning if you have it?

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

@bfineran it sounds like it is valid. Here is the code I use to save the model:

# Export to ONNX
clf.eval()
clf.training = False
sparseml.export_to_sparse_onnx(
    clf,
    output_dir="./sparseml_models/",
    name="sparse_model.onnx",
    sample_batch=torch.randn(1, 1, 28, 28),
    # opset_version=17,
    opset=14,
    disable_bn_fusing=False,
    convert_qat=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {
            0: "batch_size"
        },
        "output": {
            0: "batch_size"
        },
    },
)

And the warning:

/home/clementpoiret/micromamba/envs/torch211/lib/python3.10/site-packages/torch/onnx/utils.py:823: UserWarning: It is recommended that constant folding be turned off ('d
o_constant_folding=False') when exporting the model in training-amenable mode, i.e. with 'training=TrainingMode.TRAIN' or 'training=TrainingMode.PRESERVE' (when model is
 in training mode). Otherwise, some learnable model parameters may not translate correctly in the exported ONNX model because constant folding mutates model parameters. 
Please consider turning off constant folding or setting the training=TrainingMode.EVAL.

passing training=TrainingMode.EVAL has no effect as it seems overwritten later on.

from sparseml.

clementpoiret avatar clementpoiret commented on May 29, 2024

Thanks for your answer. Do you know the practical implications of saving a model in training mode? Will the onnx file be bigger or the inference slower?

from sparseml.

bfineran avatar bfineran commented on May 29, 2024

Not sure of any specific examples, but if the model behaves differently in training vs eval (batch norm updates, dropout, etc) these operations may be represented in the trace

from sparseml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.