Giter Club home page Giter Club logo

optimum's Introduction

ONNX Runtime

Hugging Face Optimum

๐Ÿค— Optimum is an extension of ๐Ÿค— Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.

Installation

๐Ÿค— Optimum can be installed using pip as follows:

python -m pip install optimum

If you'd like to use the accelerator-specific features of ๐Ÿค— Optimum, you can install the required dependencies according to the table below:

Accelerator Installation
ONNX Runtime pip install --upgrade --upgrade-strategy eager optimum[onnxruntime]
Intel Neural Compressor pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]
OpenVINO pip install --upgrade --upgrade-strategy eager optimum[openvino]
NVIDIA TensorRT-LLM docker run -it --gpus all --ipc host huggingface/optimum-nvidia
AMD Instinct GPUs and Ryzen AI NPU pip install --upgrade --upgrade-strategy eager optimum[amd]
AWS Trainum & Inferentia pip install --upgrade --upgrade-strategy eager optimum[neuronx]
Habana Gaudi Processor (HPU) pip install --upgrade --upgrade-strategy eager optimum[habana]
FuriosaAI pip install --upgrade --upgrade-strategy eager optimum[furiosa]

The --upgrade --upgrade-strategy eager option is needed to ensure the different packages are upgraded to the latest possible version.

To install from source:

python -m pip install git+https://github.com/huggingface/optimum.git

For the accelerator-specific features, append optimum[accelerator_type] to the above command:

python -m pip install optimum[onnxruntime]@git+https://github.com/huggingface/optimum.git

Accelerated Inference

๐Ÿค— Optimum provides multiple tools to export and run optimized models on various ecosystems:

  • ONNX / ONNX Runtime
  • TensorFlow Lite
  • OpenVINO
  • Habana first-gen Gaudi / Gaudi2, more details here
  • AWS Inferentia 2 / Inferentia 1, more details here
  • NVIDIA TensorRT-LLM , more details here

The export and optimizations can be done both programmatically and with a command line.

Features summary

Features ONNX Runtime Neural Compressor OpenVINO TensorFlow Lite
Graph optimization โœ”๏ธ N/A โœ”๏ธ N/A
Post-training dynamic quantization โœ”๏ธ โœ”๏ธ N/A โœ”๏ธ
Post-training static quantization โœ”๏ธ โœ”๏ธ โœ”๏ธ โœ”๏ธ
Quantization Aware Training (QAT) N/A โœ”๏ธ โœ”๏ธ N/A
FP16 (half precision) โœ”๏ธ N/A โœ”๏ธ โœ”๏ธ
Pruning N/A โœ”๏ธ โœ”๏ธ N/A
Knowledge Distillation N/A โœ”๏ธ โœ”๏ธ N/A

OpenVINO

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[openvino]

It is possible to export ๐Ÿค— Transformers and Diffusers models to the OpenVINO format easily:

optimum-cli export openvino --model distilbert-base-uncased-finetuned-sst-2-english distilbert_sst2_ov

If you add --weight-format int8, the weights will be quantized to int8, check out our documentation for more detail on weight only quantization. To apply quantization on both weights and activations, you can find more information here.

To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained("distilbert_sst2_ov")

  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
  results = classifier("He's a dreadful magician.")

You can find more examples in the documentation and in the examples.

Neural Compressor

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[neural-compressor]

Dynamic quantization can be applied on your model:

optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbert

To load a model quantized with Intel Neural Compressor, hosted locally or on the ๐Ÿค— hub, you can do as follows :

from optimum.intel import INCModelForSequenceClassification

model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)

You can find more examples in the documentation and in the examples.

ONNX + ONNX Runtime

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters,onnxruntime]

It is possible to export ๐Ÿค— Transformers and Diffusers models to the ONNX format and perform graph optimization as well as quantization easily:

optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx

The model can then be quantized using onnxruntime:

optimum-cli onnxruntime quantize \
  --avx512 \
  --onnx_model roberta_base_qa_onnx \
  -o quantized_roberta_base_qa_onnx

These commands will export deepset/roberta-base-squad2 and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.

For more information on the ONNX export, please check the documentation.

Run the exported model using ONNX Runtime

Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:

- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
  from transformers import AutoTokenizer, pipeline

  model_id = "deepset/roberta-base-squad2"
  tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
  question = "What's Optimum?"
  context = "Optimum is an awesome library everyone should use!"
  results = qa_pipe(question=question, context=context)

More details on how to run ONNX models with ORTModelForXXX classes here.

TensorFlow Lite

Before you begin, make sure you have all the necessary libraries installed :

pip install optimum[exporters-tf]

Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:

optimum-cli export tflite \
  -m deepset/roberta-base-squad2 \
  --sequence_length 384  \
  --quantize int8-dynamic roberta_tflite_model

Accelerated training

๐Ÿค— Optimum provides wrappers around the original ๐Ÿค— Transformers Trainer to enable training on powerful hardware easily. We support many providers:

  • Habana's Gaudi processors
  • AWS Trainium instances, check here
  • ONNX Runtime (optimized for GPUs)

Habana

Before you begin, make sure you have all the necessary libraries installed :

pip install --upgrade --upgrade-strategy eager optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForXxx.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
      output_dir="path/to/save/folder/",
+     use_habana=True,
+     use_lazy_mode=True,
+     gaudi_config_name="Habana/bert-base-uncased",
      ...
  )

  # Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use Habana Gaudi processor for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

ONNX Runtime

- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
      output_dir="path/to/save/folder/",
      optim="adamw_ort_fused",
      ...
  )

  # Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use ONNX Runtime for training!
  trainer.train()

You can find more examples in the documentation and in the examples.

optimum's People

Contributors

adamlouly avatar baskrahmer avatar carzh avatar changwangss avatar echarlaix avatar fxmarty avatar ierezell avatar ilyasmoutawwakil avatar jingyahuang avatar jingyanwangms avatar jplu avatar kunal-vaishnavi avatar lewtun avatar madlag avatar mfuntowicz avatar mht-sharma avatar michaelbenayoun avatar mishig25 avatar penghuicheng avatar philschmid avatar prathikr avatar regisss avatar rui-ren avatar ryanrussell avatar sunmarc avatar vivekkhandelwal1 avatar vrdn-23 avatar xenova avatar xin3he avatar younesbelkada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optimum's Issues

Posibility to load an ORTQuantizer or ORTOptimizer from Onnx

FIrst, thanks a lot for this library, it make work so much easier.

I was wondering if it's possible to quantize and then optimize a model (or the reverse) but looking at the doc, it seems possible to do so only by passing a huggingface vanilla model.

Is it possible to do so with already compiled models?

Like : MyFineTunedModel ---optimize----> MyFineTunedOnnxOptimizedModel -----quantize-----> MyFinalReallyLightModel

# Note that self.model_dir is my local folder with my custom fine-tuned hugginface model
onnx_path = self.model_dir.joinpath("model.onnx")
onnx_quantized_path = self.model_dir.joinpath("quantized_model.onnx")
onnx_chad_path = self.model_dir.joinpath("chad_model.onnx")
onnx_path.unlink(missing_ok=True)
onnx_quantized_path.unlink(missing_ok=True)
onnx_chad_path.unlink(missing_ok=True)

quantizer = ORTQuantizer.from_pretrained(self.model_dir, feature="token-classification")
quantized_path = quantizer.export(
    onnx_model_path=onnx_path, onnx_quantized_model_output_path=onnx_quantized_path,
    quantization_config=AutoQuantizationConfig.arm64(is_static=False, per_channel=False),
)
quantizer.model.save_pretrained(optimized_path.parent) # To have the model config.json
quantized_path.parent.joinpath("pytorch_model.bin").unlink() # To ensure that we're not loading the vanilla pytorch model

# Load an Optimizer from an onnx path... 
# optimizer = ORTOptimizer.from_pretrained(quantized_path.parent, feature="token-classification")  <-- this fails
# optimizer.export(
#     onnx_model_path=onnx_path,
#     onnx_optimized_model_output_path=onnx_chad_path,
#     optimization_config=OptimizationConfig(optimization_level=99),
# )
model = ORTModelForTokenClassification.from_pretrained(quantized_path.parent, file_name="quantized_model.onnx")
# Ideally would load onnx_chad_path (with chad_model.onnx) if the commented section works.

tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(self.model_dir)
self.pipeline = cast(TokenClassificationPipeline, pipeline(
    model=model, tokenizer=tokenizer,
    task="token-classification", accelerator="ort",
    aggregation_strategy=AggregationStrategy.SIMPLE,
    device=device_number(self.device),
))

Note that optimization alone works perfectly fine, quantization too, but I was hopping that both would be feasible.. unless optimization also does some kind of quantization or lighter model ?

Thanks in advance.
Have a great day

Issue when running transformers from GPU / CPU

Following this unit test I was able to successfully run the text-generation with the dummy model, however when I run it with the actual GPT2 model I get an inconsistency of GPU / CPU resources - namely I run:

model_id = "gpt2"
onnx_model = ORTModelForCausalLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pp = pipeline("text-generation", model=onnx_model, tokenizer=tokenizer)
text = "My Name is Philipp and i live"
outputs = pp(text)

However I get the following GPU/CPU error:

File ~/miniconda3/envs/optimum-mlserver/lib/python3.8/site-packages/transformers/generation_utils.py:587, in GenerationMixin._expand_inputs_for_generation(input_ids, expand_size, is_encoder_decoder, attention_mask, encoder_outputs, **model_kwargs)
    584     model_kwargs["token_type_ids"] = token_type_ids.index_select(0, expanded_return_idx)
    586 if attention_mask is not None:
--> 587     model_kwargs["attention_mask"] = attention_mask.index_select(0, expanded_return_idx)
    589 if is_encoder_decoder:
    590     if encoder_outputs is None:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

When I change the pipeline unfortunately I get an error where the ORT model does not support the to function (and then if that is circumvented there are other issues like the ORT function does not support config param, etc. Namely if I upate the pipeline to pp = pipeline("text-generation", model=onnx_model, tokenizer=tokenizer, device=0) - then the error is:

File ~/miniconda3/envs/optimum-mlserver/lib/python3.8/site-packages/transformers/pipelines/base.py:757, in Pipeline.__init__(self, model, tokenizer, feature_extractor, modelcard, framework, task, args_parser, device, binary_output, **kwargs)
    755 # Special handling
    756 if self.framework == "pt" and self.device.type == "cuda":
--> 757     self.model = self.model.to(self.device)
    759 # Update config with task specific parameters
    760 task_specific_params = self.model.config.task_specific_params

AttributeError: 'ORTModelForCausalLM' object has no attribute 'to'

For completeness I am getting the notice The model 'ORTModelForCausalLM' is not supported for text-generation. - is that expected or does that suggest my transformers/hugging-face library is also not compatible with the latest live branch of optimum? I am using the following dependencies:

optimum (from main)
huggingface==0.0.1
huggingface-hub==0.5.1
transformers==4.18.0

Support for computer vision tasks

Thanks for the amazing library! It would be great to add support for computer vision tasks to ๐Ÿค— optimum!

In particular, it would be amazing to use the new inference pipelines to take advantage of ONNX for inference when using image classification models.

This is partially selfishly motivated for use in (https://github.com/davanstrien/flyswot/) which is being used to 'deploy' transformer models via command line (running on CPU). I have previously had good success using onnx to speed things up but found the process of exporting onnx models, getting data ready for inference etc. not very fun. The API for pipelines looks like it will make this task much more enjoyable and flexible so it would be amazing to have support for image classification models in optimum.

Tried to train using run_glue.py but it threw an ImportError

I tried to train a text classification model using run_glue.py using

python run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name sst2 \
    --do_train \
    --do_eval \
    --output_dir /tmp/ort-bert-sst2/

I get the below error:

Traceback (most recent call last):
  File "run_glue.py", line 571, in <module>
    main()
  File "run_glue.py", line 490, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/optimum/onnxruntime/trainer.py", line 295, in train
    model = ORTModule(self.model)
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 90, in __init__
    self._fallback_manager.handle_exception(exception=e,
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 151, in handle_exception
    raise exception
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 66, in __init__
    raise ortmodule._FALLBACK_INIT_EXCEPTION
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback_exceptions.py", line 72, in wrap_exception
    raise new_exception(raised_exception) from raised_exception
onnxruntime.training.ortmodule._fallback_exceptions.ORTModuleInitException: ORTModule's extensions were not detected at '/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions' folder. Run `python -m torch_ort.configure` before using `ORTModule` frontend.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/__init__.py", line 3, in clear_all_grad_fns
    from onnxruntime.training.ortmodule.torch_cpp_extensions import torch_interop_utils
ImportError: cannot import name 'torch_interop_utils' from 'onnxruntime.training.ortmodule.torch_cpp_extensions' (/home/nirmal/miniconda3/envs/optimum/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/__init__.py)

Support Custom Pipelines

I would like to create a custom pipeline as seen here that can leverage a quantized model. I don't see the capability to extend some sort of OrtPipeline class in the same way using optimum. I have a need for a custom pipeline because of a cross encoding task which requires some custom code but I also want to leverage the awesome speedups I get from quantizing my model ๐Ÿ˜ข

Let me know if this is currently possible or if this feature has already been requested

Support for GPT2-Neo

Attempt to provide initial support for GPT2-Neo

  • Include support within transformers for GPT2-Neo
  • Include support within transformers for GPT2-Neo + past
  • Optimizations + Unittests

Static quantization issue

Hello,

thanks a lot for your code and examples! I'm trying to get the static quantization working in the example code, but I always get

NotImplementedError: Could not run 'quantized::linear' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].

Could you please give me a hint on how to get this running? From what I have found out is that we need to add Quant() and DeQuant() layers to the beginning and end of the BERT model? If yes, is there already a class that I can use?

`ORTTrainer` doesn't work with distributed training and/or DeepSpeed

ORTTrainer.train fails with distributed training and/or deepspeed.

The line inference_manager = model._torch_module._execution_manager._inference_manager assumes that model is of type ORTModule. However, when deepspeed is enabled, it is of type DeepSpeedEngine. Its type is DistributedDataParallel during distributed training.

This leads to an AttributeError for such cases.

Add Pixelated Butterfly for efficient sparse training

Description

Screen Shot 2021-12-12 at 17 04 51

Pixelated butterfly (Pixelfly) is a training technique that use a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers. From the paper:

On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5ร— faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy.

This seems like an interesting approach to consider for integration with optimum, and builds on prior work by @madlag using pytorch_block_sparse.

Message "ORTModelForX is not supported for Y" when loading in pipeline

As initially reported in #161, when loading an ORT model in transformer pipeline there is a warning that suggests it's not supported. As proposed, having different message would be more intuitive.

To reproduce:

model_id = "gpt2"
onnx_model = ORTModelForCausalLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pp = pipeline("text-generation", model=onnx_model, tokenizer=tokenizer)

output:

The model 'ORTModelForCausalLM' is not supported for text-generation. Supported models are ['XGLMForCausalLM', 'PLBartForCausalLM', 'QDQBertLMHeadModel', 'TrOCRForCausalLM', 'GPTJForCausalLM', 'RemBertForCausalLM', 'RoFormerForCausalLM', 'BigBirdPegasusForCausalLM', 'GPTNeoForCausalLM', 'BigBirdForCausalLM', 'CamembertForCausalLM', 'XLMRobertaXLForCausalLM', 'XLMRobertaForCausalLM', 'RobertaForCausalLM', 'BertLMHeadModel', 'OpenAIGPTLMHeadModel', 'GPT2LMHeadModel', 'TransfoXLLMHeadModel', 'XLNetLMHeadModel', 'XLMWithLMHeadModel', 'ElectraForCausalLM', 'CTRLLMHeadModel', 'ReformerModelWithLMHead', 'BertGenerationDecoder', 'XLMProphetNetForCausalLM', 'ProphetNetForCausalLM', 'BartForCausalLM', 'MBartForCausalLM', 'PegasusForCausalLM', 'MarianForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'MegatronBertForCausalLM', 'Speech2Text2ForCausalLM', 'Data2VecTextForCausalLM'].

Optimum Pruning and Quantization Current Limitation

I just added a topic on the Huggingface forum about limitations that I found while trying out the Huggingface Optimum on text classification and text summarization tasks.

https://discuss.huggingface.co/t/optimum-pruning-and-quantization-current-limitation/13978


The following is a copy of the text I wrote there:

We are checking out the Huggingface Optimum. There are some issues that we would like to clarify:

  • Pruning does not always speed up the model, and it may increase the model's storage size which is not expected.

  • Dynamic quantization works only on CPU (Running it on GPU shows error conflict between CPU and GPU

Could someone or developer in the area explain this behavior? We think the Huggingface Optimum has a high hope for model compression.

If some details are necessary, I would be glad to clarify more.

`*with_loss` wrappers failed for transformers 4.19.0

Problem

As new the new argument num_choices added in transformers.onnx.config.generate_dummy_inputs, the wrappers will fail to match correct arguments

Error Message


Traceback (most recent call last):
  File "test_onnxruntime_train.py", line 116, in test_ort_trainer
    ort_eval_metrics = trainer.evaluate(inference_with_ort=inference_with_ort)
  File "/workspace/optimum/onnxruntime/trainer.py", line 631, in evaluate
    output = eval_loop(
  File "/workspace/optimum/onnxruntime/trainer.py", line 767, in evaluation_loop_ort
    self._export(onnx_model_path, with_loss=with_loss)
  File "/workspace/optimum/onnxruntime/trainer.py", line 1230, in _export
    _ = export(preprocessor=self.tokenizer, model=model, config=onnx_config, opset=opset, output=model_path)
  File "/usr/local/lib/python3.8/dist-packages/transformers/onnx/convert.py", line 313, in export
    return export_pytorch(preprocessor, model, config, opset, output, tokenizer=tokenizer)
  File "/usr/local/lib/python3.8/dist-packages/transformers/onnx/convert.py", line 138, in export_pytorch
    model_inputs = config.generate_dummy_inputs(preprocessor, framework=TensorType.PYTORCH)
  File "/workspace/optimum/onnx/configuration.py", line 170, in generate_dummy_inputs
    dummy_inputs = super().generate_dummy_inputs(
  File "/usr/local/lib/python3.8/dist-packages/transformers/onnx/config.py", line 308, in generate_dummy_inputs
    token_to_add = preprocessor.num_special_tokens_to_add(is_pair)
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py", line 289, in num_special_tokens_to_add
    return self._tokenizer.num_special_tokens_to_add(pair)
TypeError: Can't convert <TensorType.PYTORCH: 'pt'> to PyBool

Solution

Add keys for arguments

ONNX Runtime Training

Add ONNX Runtime Training capability in the library.

It would be interesting to look at a potential Trainer wrapper which would provide an easy interface to all our examples.

Quantization support

Initial support for quantization through ONNX Runtime for:

  • BERT

  • DistilBERT

  • Unittests

Not possible to configure GPU in pipelines nor leveraging batch_size parallelisation

When setting the device variable in the pipeline function/class to >= 0, an error appears AttributeError: 'ORTModelForCausalLM' object has no attribute 'to' - when running in GPU. This was initially reported in #161 so opening this issue to encompass supporting the device parameter in the ORT classes. This is important as otherwise it won't be possible to allow configuration of CPU/GPU similar to normal transformer libraries.

Is there currently a workaround to ensure that the class is run on GPU? By default it seems this woudl eb set in CPU even when GPU is available:

>>> m = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True)
>>> t = AutoTokenizer.from_pretrained("gpt2")
>>> pp = pipeline("text-generation", model=m, tokenizer=t)
>>> pp.device

device(type='cpu')

This is still the case even with the optimum[onnxruntime-gpu] package. I have validated by testing against a normal transformer with batch_size=X (ie pp = pipeline("text-generation", model=m, tokenizer=t, batch_size=128)) and it seems there is no optimization with parallel processing with optimum, whereas normal transformer is orders of magnitude faster (which is most likely as it's not utilizing the parallelism)

I can confirm that the model is loaded with GPU correctly:

>>> m.device

device(type='cuda', index=0)

And GPU is configured correctly:

>>> from optimum.onnxruntime.utils import _is_gpu_available
>>> _is_gpu_available()

True

Is there a way to enable GPU for processing with batching in optimum?

optimum inference for summarization

With reference to the blog: https://huggingface.co/blog/optimum-inference, I am able to do this:

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = optimum_qa(question, context)

I need to do similar inference for summarization, for the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")

I get the following error:

>>> from optimum.onnxruntime import ORTModelForSeq2SeqLM
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'ORTModelForSeq2SeqLM' from 'optimum.onnxruntime' (/datadrive/shilpa/work/virtual_environments/venv_hf_optimum/lib/python3.9/site-packages/optimum/onnxruntime/__init__.py)

https://huggingface.co/docs/optimum/main/en/pipelines - mentions "text-generation" as one of the supported tasks, "summarization" i assumed comes under this category. Am I right?

Create benchmarking suite for optimised models

Now that we have tight Hub integration coming via #113, it could be useful to implement a simple benchmarking suite that allows users to:

  • Select a dataset on the Hub
  • Select a metric on the Hub
  • Select N models (could already be optimised models)
  • Optimise the models (if needed)
  • Report a table of results comparing the gains in latency and impact on the model metric

In a first step, we might simply want to benchmark latency with some dummy input at various sequence lengths etc.

Support initial set of models optimizations

Let's focus on BERT like models at first:

  • BERT
  • DistilBERT

When adding optimizations, it would be great to hve identified onnxruntime test suite in the CI that ensures the models outputs are close to the reference model.

Exporting to onnx fails with unexpected keyword argument 'preprocessor'

I am in a Linux environment, running optimum 1.2.2, transformers 4.16.2, onnx 1.11.0 and onnxruntime-gpu 1.11.1, and am trying to run the text-classification sample published here: https://huggingface.co/docs/optimum/main/en/pipelines

When it goes to export the model to onnx it crashes with an error regarding preprocessor. Can anyone help me resolve this - is it a library issue or a bug?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 classifier = pipeline(task="text-classification", accelerator="ort")

File ~/anaconda3/envs/deep-learning/lib/python3.8/site-packages/optimum/pipelines.py:85, in pipeline(task, model, tokenizer, feature_extractor, use_fast, use_auth_token, accelerator, **kwargs)
     83 if model is None:
     84     model_id = SUPPORTED_TASKS[task]["default"]
---> 85     model = SUPPORTED_TASKS[task]["class"][0].from_pretrained(model_id, from_transformers=True)
     86 elif isinstance(model, str):
     87     model_id = model

File ~/anaconda3/envs/deep-learning/lib/python3.8/site-packages/optimum/modeling_base.py:201, in OptimizedModel.from_pretrained(cls, model_id, from_transformers, force_download, use_auth_token, cache_dir, **model_kwargs)
    198     model_kwargs.update({"config": config})
    200 if from_transformers:
--> 201     return cls._from_transformers(
    202         model_id=model_id,
    203         revision=revision,
    204         cache_dir=cache_dir,
    205         force_download=force_download,
    206         use_auth_token=use_auth_token,
    207         **model_kwargs,
    208     )
    209 else:
    210     return cls._from_pretrained(
    211         model_id=model_id,
    212         revision=revision,
   (...)
    216         **model_kwargs,
    217     )

File ~/anaconda3/envs/deep-learning/lib/python3.8/site-packages/optimum/onnxruntime/modeling_ort.py:242, in ORTModel._from_transformers(cls, model_id, save_dir, use_auth_token, revision, force_download, cache_dir, **kwargs)
    239 onnx_config = model_onnx_config(model.config)
    241 # export model
--> 242 export(
    243     preprocessor=tokenizer,
    244     model=model,
    245     config=onnx_config,
    246     opset=onnx_config.default_onnx_opset,
    247     output=save_dir.joinpath(ONNX_WEIGHTS_NAME),
    248 )
    249 kwargs["config"] = model.config.__dict__
    250 # 3. load normal model

TypeError: export() got an unexpected keyword argument 'preprocessor'

How to optimize bart models?

I tried running the same code readme using facebook/bart-large-mnli. I am getting KeyError: 'decoder_input_ids'

from optimum.onnxruntime.configuration import OptimizationConfig
optimization_config = OptimizationConfig(optimization_level=99, optimize_for_gpu=True)

from optimum.onnxruntime import ORTOptimizer
model_name = "facebook/bart-large-mnli"
optimizer = ORTOptimizer.from_pretrained(
    model_name,
    feature="sequence-classification",
)

optimizer.export(
    onnx_model_path="op_fbbart.onnx",
    onnx_optimized_model_output_path="op_fbbart-optimized.onnx",
    optimization_config=optimization_config,
)

from optimum.onnxruntime import ORTModel
from functools import partial
from datasets import Dataset

ort_model = ORTModel("op_fbbart-optimized.onnx", optimizer._onnx_config)
ds = Dataset.from_dict({"sentence": ["I love burritos!"]})
def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=optimizer.tokenizer))
ort_outputs = ort_model.evaluation_loop(tokenized_ds)
ort_outputs.predictions  #Key Error

Traceback:

KeyError                                  Traceback (most recent call last)
Input In [13], in <cell line: 2>()
      1 tokenized_ds = ds.map(partial(preprocess_fn, tokenizer=optimizer.tokenizer))
----> 2 ort_outputs = ort_model.evaluation_loop(tokenized_ds)
      3 # Extract logits!
      4 ort_outputs.predictions

File ~/miniconda3/envs/optimum/lib/python3.9/site-packages/optimum/onnxruntime/model.py:93, in ORTModel.evaluation_loop(self, dataset)
     91 else:
     92     labels = None
---> 93 onnx_inputs = {key: np.array([inputs[key]]) for key in self.onnx_config.inputs}
     94 preds = session.run(self.onnx_named_outputs, onnx_inputs)
     95 if len(preds) == 1:

File ~/miniconda3/envs/optimum/lib/python3.9/site-packages/optimum/onnxruntime/model.py:93, in <dictcomp>(.0)
     91 else:
     92     labels = None
---> 93 onnx_inputs = {key: np.array([inputs[key]]) for key in self.onnx_config.inputs}
     94 preds = session.run(self.onnx_named_outputs, onnx_inputs)
     95 if len(preds) == 1:

KeyError: 'decoder_input_ids'

๐Ÿš€ Add built-in support for autorregressive text generation with ONNX models.

๐Ÿš€ Add built-in support for autorregressive text generation with ONNX models.

After converting a autorregressive model to ONNX, it would be nice to be able to generate text with it via something like:

from transformers import OnnxTextGenerationModel, AutoTokenizer

model_path = "gpt-something.onnx"
tokenizer_name = "gpt2"

model = OnnxTextGenerationModel(model_path)

# and then

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model.generate(encoded_input)

With support to using past_key_values internally in the most efficient way.

Motivation

When trying to accelerate inference with transformers, being unable to load our ONNX model with the lib and running a model.generate method to seamlessly generate sequences and perform Beam Search is somehow frustrating. That leads us to have to rely on custom implementations - which takes time and are a lot more prone to have bugs.

Support for electra model

I came across this tool and it looks very interesting but i am trying to use electra model and i can see this is not supported. By this
"electra is not supported yet. Only ['albert', 'bart', 'mbart', 'bert', 'ibert', 'camembert', 'distilbert', 'longformer', 'marian', 'roberta', 't5', 'xlm-roberta', 'gpt2', 'gpt-neo', 'layoutlm'] are supported. If you want to support electra please propose a PR or open up an issue.
is any plans for electra models in future.
Example of models https://huggingface.co/german-nlp-group/electra-base-german-uncased

How to use the quantised model for inference

I was able to quantise and save a hugging face model with my custom data how do I use the model for inference. Could anybody help on how to load and use the quantised model with something like trainer / pipeline APIs?

I have 2 files in the directory to which I save my quantised model best_configure.yaml and best_model_weights.pt how do I use these files for inference?

Quantizing ONNX text classification model based on "setu4993/smaller-LaBSE" causes much lower precision scores

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.13.0-39-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu102 (True)
- Tensorflow version (GPU?): 2.3.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
 
optimum==1.1.0

Who can help?

@LysandreJik since LaBSE is based on BERT

Not sure who to ping for optimum.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The following code trains a simple binary text classification model using pytorch, then converts to ONNX and then quantizes the ONNX model, printing evaluation results for each. It first fine-tunes on "distilbert-base-uncased" and then "setu4993/smaller-LaBSE". Scores don't change much for the three model version on distilbert, but quantization lowers scores on LaBSE.

import os
import shutil

import numpy as np
import torch
from datasets import load_dataset, Dataset, load_metric
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer, ORTModel
from sklearn.metrics import precision_recall_fscore_support
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer


os.environ["WANDB_DISABLED"] = "true"

imdb = load_dataset("imdb").shuffle()

# Taking a subset to speed up training/testing times, same effect occurs on full dataset
imdb['train'] = Dataset.from_dict(imdb['train'][:1000])
imdb['test'] = Dataset.from_dict(imdb['test'][:1000])

metric = load_metric('f1')


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metrics = metric.compute(predictions=predictions, references=labels)
    prec_rec = precision_recall_fscore_support(labels, predictions, average='binary')
    metrics['precision'] = prec_rec[0]
    metrics['recall'] = prec_rec[1]
    return metrics


def train_eval_demo(model_name):
    """
    Train a simple binary text classification model using the IMDB dataset
    Convert to ONNX and then quantize the ONNX model and return evaluation results
    """

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=128)

    tokenized_imdb = imdb.map(preprocess_function, batched=True)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    output_dir = "model_debug"
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    training_args = TrainingArguments(
        output_dir=f"./{output_dir}",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=5,
        weight_decay=0.01,
        overwrite_output_dir=True,
        no_cuda=not torch.cuda.is_available(),
        save_steps=1000,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_imdb["train"],
        eval_dataset=tokenized_imdb["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train(resume_from_checkpoint=False)
    trainer.save_model()

    metrics = trainer.evaluate(metric_key_prefix="eval")

    results = {}
    results["1. PyTorch"] = metrics

    # The model we wish to quantize
    # The type of quantization to apply
    qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
    quantizer = ORTQuantizer.from_pretrained(output_dir, feature="sequence-classification")

    # Quantize the model!
    quantizer.export(
        onnx_model_path=f"{output_dir}/onnx_model.onnx",
        onnx_quantized_model_output_path=f"{output_dir}/onnx_model-quantized.onnx",
        quantization_config=qconfig,
    )

    ort_model = ORTModel(f"{output_dir}/onnx_model.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["2. ONNX"] = onnx_metrics

    ort_model = ORTModel(f"{output_dir}/onnx_model-quantized.onnx", quantizer._onnx_config)
    ort_outputs = ort_model.evaluation_loop(tokenized_imdb["test"])
    onnx_quant_metrics = compute_metrics((ort_outputs.predictions, tokenized_imdb["test"]["label"]))
    results["3. ONNX Quant"] = onnx_quant_metrics

    return results


# Models to compare
model_names = ["distilbert-base-uncased", "setu4993/smaller-LaBSE"]  #
model_scores = {}

for model_name in model_names:
    results = train_eval_demo(model_name)
    model_scores[model_name] = results

for model_name in model_names:
    print("* " + model_name)
    for iteration in sorted(model_scores[model_name]):
        print(iteration)
        print(f"\t{model_scores[model_name][iteration]}")
    print()

Expected behavior

I'm building text classification models based on LaBSE (specifically the smaller version "setu4993/smaller-LaBSE").  After conversion to ONNX, the f1, precision and recall values are the same.  However, after quantization scores drop a lot (precision 83.2 to 62.7):

* setu4993/smaller-LaBSE
1. PyTorch
	{'eval_f1': 0.8576998050682261, 'eval_precision': 0.831758034026465, 'eval_recall': 0.8853118712273642}
2. ONNX
	{'f1': 0.8576998050682261, 'precision': 0.831758034026465, 'recall': 0.8853118712273642}
3. ONNX Quant
	{'f1': 0.7473598700243704, 'precision': 0.6267029972752044, 'recall': 0.9255533199195171}

What I would expect to happen is the scores only change by at most 1 or so points after quantization.  The above code will reproduce these scores.  It also occurs on larger LaBSE models, such as 'pvl/labse_bert' and on token classification tasks.

For comparison, here is the output using distilbert-base-uncased.  No dramatic score changes.

* distilbert-base-uncased
1. PyTorch
	{'eval_f1': 0.8359683794466404, 'eval_precision': 0.8213592233009709, 'eval_recall': 0.8511066398390342}
2. ONNX
	{'f1': 0.8359683794466404, 'precision': 0.8213592233009709, 'recall': 0.8511066398390342}
3. ONNX Quant
	{'f1': 0.8288822947576657, 'precision': 0.8151750972762646, 'recall': 0.8430583501006036}

Expose a SUPPORTED_OPTIMUM_TASKS analogous to SUPPORTED_TASKS

Currently having access to the transformers.pipelines.SUPPORTED_TASKS is quite useful for application development to ensure access to the respective classes based on relevant tasks is done correctly.

For explicitness the structure of the SUPPORTED_TASKS object is:

SUPPORTED_TASKS = {
    "audio-classification": {
        "impl": AudioClassificationPipeline,
        "tf": (),
        "pt": (AutoModelForAudioClassification,) if is_torch_available() else (),
        "default": {"model": {"pt": "superb/wav2vec2-base-superb-ks"}},
        "type": "audio",
    },
    "automatic-speech-recognition": {
        "impl": AutomaticSpeechRecognitionPipeline,
        "tf": (),
   # ...etc

At the moment we had to create our own SUPPORTED_OPTIMUM_TASKS to map each respective task to their relevant class, as below. However it would be more robust if we are able to have access to an Optimum object that we can rely on to ensure we are providing access to the relevant classes in a robust way.

SUPPORTED_OPTIMUM_TASKS = {
    "feature-extraction": ORTModelForFeatureExtraction,
    "sentiment-analysis": ORTModelForSequenceClassification,
    "ner": ORTModelForTokenClassification,
    "question-answering": ORTModelForQuestionAnswering,
    "text-generation": ORTModelForCausalLM,
}

support for Tapas model

Hi,
When I try to quantize the tapas model, it gives me an error. Is it supported?
Wehn I try below:
tapas_model = "google/tapas-base-finetuned-wtq"
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(tapas_model, feature="question-answering")

quantizer.export(
onnx_model_path="model.onnx",
onnx_quantized_model_output_path="tapas_model-quantized.onnx",
quantization_config=qconfig,
)
I get,
ValueError: Unrecognized configuration class <class 'transformers.models.tapas.configuration_tapas.TapasConfig'> for this kind of AutoModel: AutoModelForQuestionAnswering.
Model type should be one of YosoConfig, NystromformerConfig, QDQBertConfig, FNetConfig, GPTJConfig, LayoutLMv2Config, RemBertConfig, CanineConfig, RoFormerConfig, BigBirdPegasusConfig, BigBirdConfig, ConvBertConfig, LEDConfig, IBertConfig, MobileBertConfig, DistilBertConfig, AlbertConfig, CamembertConfig, XLMRobertaXLConfig, XLMRobertaConfig, MBartConfig, MegatronBertConfig, MPNetConfig, BartConfig, ReformerConfig, LongformerConfig, RobertaConfig, DebertaV2Config, DebertaConfig, FlaubertConfig, SqueezeBertConfig, BertConfig, XLNetConfig, XLMConfig, ElectraConfig, FunnelConfig, LxmertConfig, SplinterConfig, Data2VecTextConfig.

When I use "table-question-answering" as the feature, then there is a key error.

Please let me know if it can be supported.

Installation issues

I have Python 3.6.9 and i get the following installation issues.

(venv_hf_optimum)$ pip install "optimum[onnxruntime]==1.2.0"
Collecting optimum[onnxruntime]==1.2.0
  Cache entry deserialization failed, entry ignored
  Could not find a version that satisfies the requirement optimum[onnxruntime]==1.2.0 (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2a0, 0.1.2, 0.1.3, 1.0.0, 1.1.0, 1.1.1)
No matching distribution found for optimum[onnxruntime]==1.2.0
(venv_hf_optimum)$ python -m pip install optimum
Collecting optimum
  Cache entry deserialization failed, entry ignored
  Downloading https://files.pythonhosted.org/packages/a5/05/4f31c8ff3b01f8d99a6352528d221210341bf4b38859e8747cfc19c5cd9d/optimum-1.1.1.tar.gz (66kB)
    100% |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 71kB 755kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-vwtwg4zk/optimum/setup.py", line 3, in <module>
        from setuptools import find_namespace_packages, setup
    ImportError: cannot import name 'find_namespace_packages'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-vwtwg4zk/optimum/

run_glue script ValueError: cannot convert float NaN to integer

run_glue script not working when applying quantization with the following parameters:

python run_glue.py --model_name_or_path valhalla/distilbart-mnli-12-6 --task_name mnli --quantize --quantization_approach static --do_eval --output_dir run_glue_model
>>> File "/home/jspablo/Documents/ml-experiments/Categorizer/optimum_quantization/env/lib/python3.8/site-packages/onnxruntime/quantization/quant_utils.py", line 144, in compute_scale_zp
    zero_point = round(qmin - rmin/scale)
ValueError: cannot convert float NaN to integer

Are all the tasks supported?

Error: Expected shape from model of {} does not match actual shape of {1,1,1} for output

Problem

I'm getting the following error when I'm trying to apply static quantization (ONNX) with the ORTQuantizer .

image

Tests

This error occurs for:

More

  • The resulting model-quantized.onnx can be loaded but produces very bad results.
  • dynamic quantization works seamlessy
  • using:
    • Python 3.9
    • tested on two different devices with different operating systems:
      • MacOS Monterey (with Intel)
      • WSL for Windows 11 (Ubuntu)

disable or remove non-implemented `entry_points`

The code for the following three entry_points do not exist currently. Please either disable them by commenting them out, until implemented, or remove them entirely.

  • optimum.onnxruntime.convert:main
  • optimum.onnxruntime.optimize_model:main
  • optimum.onnxruntime.convert_and_optimize:main

optimum/setup.py

Lines 68 to 73 in 1ac1f76

entry_points={
"console_scripts": [
"optimum_export=optimum.onnxruntime.convert:main",
"optimum_optimize=optimum.onnxruntime.optimize_model:main",
"optimum_export_optimize=optimum.onnxruntime.convert_and_optimize:main",
],

Current repo structure:
image

Pruning feature

Hello,

In the library page, pruning is mentioned as a capability of the optimum library. Since it doesn't seem to be the case as of yet, I was wondering if this is a feature that should come out soon ? I am guessing it will be linked to the work done in https://github.com/huggingface/nn_pruning ?

From your tests, do you have any figures for what the expected speed gains could look like by running optimum with both quantization and pruning (with limited accuracy drops ofc) ? More generally, would it be possible to have a bit more information in the Readme about the performance of the different tools in the optimum toolbox on standard benchmarks (training_aware_quant, dynamic_quant, etc etc ...)

I know it's a lot of open-ended questions but thank you in advance for the great work on the library, and I am excited for what is to come,

Cheers,
Manuel

Consistent use of `"sequence-classification"` vs. `"text-classification", "audio-classification"`

Currently, transformers' FeaturesManager._TASKS_TO_AUTOMODELS to handle strings passed to load models. Notably, this is used in the ORTQuantizer.from_pretrained() method (where here, for example, feature="sequence-classification"):

model_class = FeaturesManager.get_model_class_for_feature(feature)

In the meanwhile, pipeline abstraction for text classification expects pipeline(..., task="text-classification"). Hence it could be troublesome for users to pass both "text-classification" and "sequence-classification".

A handy workflow could be the following:

from onnxruntime.quantization import QuantFormat, QuantizationMode, QuantType
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig
from optimum.onnxruntime.modeling_ort import ORTModel

from optimum.pipelines import pipeline as _optimum_pipeline
from transformers import pipeline as _transformers_pipeline

from optimum.onnxruntime.modeling_ort import ORTModelForSequenceClassification

static_quantization = False
task = "text-classification"

# Create the quantization configuration containing all the quantization parameters
qconfig = QuantizationConfig(
    is_static=static_quantization,
    format=QuantFormat.QDQ if static_quantization else QuantFormat.QOperator,
    mode=QuantizationMode.QLinearOps if static_quantization else QuantizationMode.IntegerOps,
    activations_dtype=QuantType.QInt8 if static_quantization else QuantType.QUInt8,
    weights_dtype=QuantType.QInt8,
    per_channel=False,
    reduce_range=False,
    operators_to_quantize=["Add"],
)

quantizer = ORTQuantizer.from_pretrained(
    "Bhumika/roberta-base-finetuned-sst2",
    feature=task,
    opset=15,
)

tokenizer = quantizer.tokenizer

model_path = "model.onnx"
quantized_model_path = "quantized_model.onnx"

quantization_preprocessor = None
ranges = None

# Export the quantized model
quantizer.export(
    onnx_model_path=model_path,
    onnx_quantized_model_output_path=quantized_model_path,
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
    preprocessor=quantization_preprocessor,
)

ort_session = ORTModel.load_model(quantized_model_path)
ort_model = ORTModelForSequenceClassification(ort_session, config=quantizer.model.config)

task_alias = "text-classification"
ort_pipeline = _optimum_pipeline(
    task=task,
    model=ort_model,
    tokenizer=tokenizer,
    feature_extractor=None,
    accelerator="ort"
)

which currently raises KeyError: "Unknown task: text-classification for ORTQuantizer.from_pretrained().

Right now we need to pass something like

task = "text-classification"
feature = "sequence-classification"

and provide the feature to ORTQuantizer, which is troublesome.

Possible solutions are:

  • Have an auto-mapping from "tasks" (as in https://huggingface.co/models ) to "features" ("text-classification" --> "sequence-classification", "audio-classification" --> "sequence-classification")
  • Modify transformers.onnx.FeaturesManager to use real tasks and not "sequence-classification"
  • Add abstraction classes like ForTextClassification, ForAudioClassification classes just inheriting from ForSequenceClassification and modify transformers.onnx.FeaturesManager accordingly

@lewtun

Structure the repository src/ tests/

  • src/ should be the home for all the library code
  • tests/ should be the home for all the library unittests

Top-Level package should be clearly identified as optimus

add `optimum` to conda-forge channel on `conda`

๐Ÿ“ It will be nice to have optimum on conda-forge for installation and for downstream applications that use optimum and want to make themselves available on conda-forge channel.

๐Ÿ‘‰ I have already started the work on this. The PR is very close to getting merged. ๐Ÿ”ฅ

โ„น๏ธ I will share updates on the availability of optimum on conda-forge, once it gets merged.

โšก With this addition, users will be able to install optimum with:

conda install -c conda-forge optimum

Optimum `from_pretrained` does not seem to honour TRANSFORMERS_CACHE env var

We currently use the TRANSFORMERS_CACHE library to ensure that the downloads are set on a folder with relevant write permissions, as we are deploying inside a container in kubernetes - it seems that the only way of setting this would be through an explicit parameter to the model (as per the code below). This results in quite a fiddly implementation to cover cases for both optimum-pipeline and non-optimum pipeline. This issue woudl encompass supporting the TRANSFORMERS_CACHE env var that is used in the non-optimum transformer classes.

Using `onnxruntime-gpu`, specifying the execution provider is necessary

This issue should be fixed with #137 .

Reproduce

One may want to run inference on a GPU.

  1. pip uninstall onnx onnxruntime optimum
  2. pip install onnx onnxruntime-gpu
  3. pip install optimum[dev]
  4. run e.g. https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/quantization/text-classification/run_glue.py with python run_glue.py --model_name_or_path howey/bert-base-uncased-sst2 --task_name sst2 --quantization_approach static --do_eval --output_dir /tmp/quantized_distilbert_sst2 --max_eval_samples 50 --num_calibration_samples 50

Error raised

Traceback (most recent call last):
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/quantization/text-classification/run_glue.py", line 519, in <module>
    main()
  File "/home/fxmarty/hf_internship/optimum/examples/onnxruntime/quantization/text-classification/run_glue.py", line 481, in main
    outputs = ort_model.evaluation_loop(eval_dataset)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/optimum/onnxruntime/model.py", line 84, in evaluation_loop
    session = InferenceSession(self.model_path.as_posix(), options)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/fxmarty/anaconda3/envs/hf-inf/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 363, in _create_inference_session
    raise ValueError("This ORT build has {} enabled. ".format(available_providers) +
ValueError: This ORT build has ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. Since ORT 1.9, you are required to explicitly set the providers parameter when instantiating InferenceSession. For example, onnxruntime.InferenceSession(..., providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'], ...)

i.e. https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/onnxruntime_inference_collection.py#L361

Why this issue

We install onnxruntime-gpu and not onnxruntime, hence have access to additional providers ('TensorrtExecutionProvider', 'CUDAExecutionProvider') which triggers the error raise. Installing onnxruntime on top of onnxruntime-gpu makes the additional providers unvisible and only 'CPUExecutionProvider' becomes available.

Unable to load huggingface models with intel neural compressor's IncQuantizer.

from optimum.intel.neural_compressor.quantization import IncQuantizerForSequenceClassification
quantizer = IncQuantizerForSequenceClassification.from_config('typeform/distilbert-base-uncased-mnli')

The above code gives
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/typeform/distilbert-base-uncased-mnli/resolve/main/best_configure.yaml

Add a `optimum-cli`

Transformers has a transformers-cli that can perform various tasks (see transformers-cli --help). It could be useful to have a similar command for optimum, it is for example good for bug reports to collect info on the user packages.

Add Support for DeBERTaV2

I would like to use DeBERTaV2 for sequence classification as a quantized model. Please let me know what needs to be done to open a PR to add this support!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.