Describe the bug I am trying to update sparseml and

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Okay <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

PyTorch 2.1.0 and Lightning 2.1.0 Support: AssertionError on `assert self._strategy is not None` about sparseml HOT 13 CLOSED

clementpoiret commented on May 29, 2024

PyTorch 2.1.0 and Lightning 2.1.0 Support: AssertionError on `assert self._strategy is not None`

from sparseml.

Comments (13)

clementpoiret commented on May 29, 2024 1

Update on my issue, it seems that the optimizer setter is not triggered by the callback: https://github.com/Lightning-AI/lightning/blob/618e1c8061753e767e7ae628cf55098b8fa6ad55/src/lightning/pytorch/strategies/strategy.py#L107

Meaning that the resulting LightningOptimizer is not properly initialized by LightningOptimizer._to_lightning_optimizer.

One quick hack would be to either modify the already existing object such as:

    def on_fit_start(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        if len(trainer.strategy.optimizers) > 1:
            raise MisconfigurationException(
                "SparseML only supports training with one optimizer.")
        optimizer = trainer.strategy.optimizers[0]

        optimizer = self.manager.modify(
            pl_module,
            optimizer,
            steps_per_epoch=trainer.estimated_stepping_batches,
            epoch=0)

        trainer.strategy._optimizers = [optimizer]
        trainer.strategy._lightning_optimizers[0]._optimizer = optimizer

What do you think about this possible solution? I'm afraid this would break older versions of pytorch lightnings, as I did not tested on older versions.

from sparseml.

mgoin commented on May 29, 2024 1

Hey @clementpoiret we added support for torch<2.2 on the latest sparseml-nightly. Please give it a try and let me know if you need any other assistance

from sparseml.

bfineran commented on May 29, 2024 1

Hi @clementpoiret, yes it looks like it does get overwritten here -

sparseml/src/sparseml/pytorch/utils/exporter.py

Line 560 in 239db82

kwargs["training"] = torch.onnx.TrainingMode.PRESERVE

the reason why seems to be lost in the commit history, but we can dig a bit more. If you can make edits to your local install, you can try switching preserve to eval here.
Additionally, tou can try running an eval and benchmark on your exported model to see if it meets your accuracy/performance needs in case the warning is not material.

from sparseml.

jeanniefinks commented on May 29, 2024 1

Hi @clementpoiret
As it's been some time without further discussion, I am going to go ahead and close this thread. However, feel free to re-open if there is further comment.
Thank you! 👋🏼

Jeannie / Neural Magic

from sparseml.

mgoin commented on May 29, 2024

Hey @clementpoiret I think the lightning integration isn't heavily used so I wouldn't have concern with backwards compatibility for that interface. My main concern would be with torch 2.1 affecting other flows. It sounds like that upgrade would still be needed for your DINO usecase, correct?

from sparseml.

clementpoiret commented on May 29, 2024

Yes, unfortunately I can't export dinov2 in onnx using torch 2.0 as I am getting torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::scaled_dot_product_attention' to ONNX opset version 17 is not supported.
This issue has been fixed by the 2.1 update

from sparseml.

clementpoiret commented on May 29, 2024

Hey @mgoin thanks for the update
I can confirm everything works on my side!

from sparseml.

clementpoiret commented on May 29, 2024

Okay @mgoin, I'll correct what I said 😂 my fix of the lightning callback was just (I believe) incorrectly applying the optimizer, thus dismissing the error. Actually, I end up with non sparse models, meaning the optimizer step isn't applied correctly... So I still have the issue with sparseml 1.6.0: the ScheduledModifierManager and the LightningOptimizer classes have issues working together. I ended with the same issue as above : self._strategy is none.

from sparseml.

clementpoiret commented on May 29, 2024

Okay, I believe it's just a mix of ambiguities here and there, what I said above is weird, I still don't understand why modifying the base optimizer reset lightning's strategy, but setting modifying the optimizer of the LightningOptimizer seems to be okay now.

What happened is quite simple: trainer.estimated_stepping_batches was incorrect, leading modifiers to be close to never applied, thus start_epoch and end_epoch being completely off. Manually indicating step_per_epoch fixed the issue and applied all modifiers.

My last weird error is a warning when exporting to ONNX, telling that I should disable constant folding if the training=TrainingMode.PRESERVE or TrainingMode.TRAINING. I don't get how to modify this behavior. I keep having the warning even when setting export_onnx(..., training=TrainingMode.EVAL) or setting model.training = False.

For the sake of reference, here is the updated SparseMLCallback:

from typing import Any, Optional

import torch
from lightning.pytorch import Callback, LightningModule, Trainer
from lightning.pytorch.utilities.exceptions import MisconfigurationException
from sparseml.pytorch.optim import ScheduledModifierManager
from sparseml.pytorch.utils import ModuleExporter


class SparseMLCallback(Callback):
    """Enables SparseML aware training. Requires a recipe to run during training.

    Args:
        recipe_path: Path to a SparseML compatible yaml recipe.
            More information at https://docs.neuralmagic.com/sparseml/source/recipes.html

    """

    def __init__(self,
                 recipe_path: str,
                 steps_per_epoch: Optional[int] = None) -> None:
        self.manager = ScheduledModifierManager.from_yaml(recipe_path)
        self.steps_per_epoch = steps_per_epoch

    def on_fit_start(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        optimizer = trainer.strategy._lightning_optimizers[0]._optimizer

        if len(trainer.optimizers) > 1:
            raise MisconfigurationException(
                "SparseML only supports training with one optimizer.")

        if self.steps_per_epoch is None:
            self.steps_per_epoch = trainer.estimated_stepping_batches

        optimizer = self.manager.modify(pl_module,
                                        optimizer,
                                        steps_per_epoch=self.steps_per_epoch,
                                        epoch=0)

        trainer.strategy._lightning_optimizers[0]._optimizer = optimizer

    def on_fit_end(self, trainer: Trainer, pl_module: LightningModule) -> None:
        self.manager.finalize(pl_module)

    @staticmethod
    # TODO: check for TrainingMode.EVAL
    def export_to_sparse_onnx(model: LightningModule,
                              output_dir: str,
                              sample_batch: Optional[torch.Tensor] = None,
                              name: str = "model.onnx",
                              opset: int = 14,
                              disable_bn_fusing: bool = True,
                              convert_qat: bool = True,
                              **export_kwargs: Any) -> None:
        """Exports the model to ONNX format."""
        exporter = ModuleExporter(model, output_dir=output_dir)
        sample_batch = sample_batch if sample_batch is not None else model.example_input_array  # type: ignore[assignment] # noqa: E501
        if sample_batch is None:
            raise MisconfigurationException(
                "To export the model, a sample batch must be passed via "
                "``SparseMLCallback.export_to_sparse_onnx(model, output_dir, sample_batch=sample_batch)`` "
                "or an ``example_input_array`` property within the LightningModule"
            )
        exporter.export_onnx(
            sample_batch=sample_batch,
            name=name,
            opset=opset,
            disable_bn_fusing=disable_bn_fusing,
            convert_qat=convert_qat,
            **export_kwargs,
        )

from sparseml.

bfineran commented on May 29, 2024

Hi @clementpoiret does the export with the warning still produce a valid ONNX model? Could you also paste the warning if you have it?

from sparseml.

clementpoiret commented on May 29, 2024

@bfineran it sounds like it is valid. Here is the code I use to save the model:

# Export to ONNX
clf.eval()
clf.training = False
sparseml.export_to_sparse_onnx(
    clf,
    output_dir="./sparseml_models/",
    name="sparse_model.onnx",
    sample_batch=torch.randn(1, 1, 28, 28),
    # opset_version=17,
    opset=14,
    disable_bn_fusing=False,
    convert_qat=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {
            0: "batch_size"
        },
        "output": {
            0: "batch_size"
        },
    },
)

And the warning:

/home/clementpoiret/micromamba/envs/torch211/lib/python3.10/site-packages/torch/onnx/utils.py:823: UserWarning: It is recommended that constant folding be turned off ('d
o_constant_folding=False') when exporting the model in training-amenable mode, i.e. with 'training=TrainingMode.TRAIN' or 'training=TrainingMode.PRESERVE' (when model is
 in training mode). Otherwise, some learnable model parameters may not translate correctly in the exported ONNX model because constant folding mutates model parameters. 
Please consider turning off constant folding or setting the training=TrainingMode.EVAL.

passing training=TrainingMode.EVAL has no effect as it seems overwritten later on.

from sparseml.

clementpoiret commented on May 29, 2024

Thanks for your answer. Do you know the practical implications of saving a model in training mode? Will the onnx file be bigger or the inference slower?

from sparseml.

bfineran commented on May 29, 2024

Not sure of any specific examples, but if the model behaves differently in training vs eval (batch norm updates, dropout, etc) these operations may be represented in the trace

from sparseml.

PyTorch 2.1.0 and Lightning 2.1.0 Support: AssertionError on `assert self._strategy is not None` about sparseml HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent