Describe the issue I directly export whisper models to ONNX model

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you for your response. <a class="user-mention notranslate" data-hovercard-type="

[Performance] Whisper model inference results incorrect after Transformer Optimizer about onnxruntime HOT 2 OPEN

XciciciX commented on July 17, 2024

[Performance] Whisper model inference results incorrect after Transformer Optimizer

from onnxruntime.

Comments (2)

tianleiwu commented on July 17, 2024

@XciciciX, Could you share some the detail steps to reproduce the issue?

For example, command lines to export onnx model, optimize onnx model, and test script. Or share the optimized onnx model. You can also look at operator spec if you suspect some attention node is not corrected fused: https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md

from onnxruntime.

XciciciX commented on July 17, 2024

Thank you for your response. @tianleiwu

Here is part of the code related to model export.

model = whisper.load_model("medium")
x_mel = compute_features("./data/test.mp3")


x_audio = model.encoder(x_mel)
torch.onnx.export(
    model.encoder,
    (x_mel),
    "./models/encoder.onnx",
    input_names=["x"],
    output_names=["out"],
    dynamic_axes={
        "x": {0: "batch"},
        "out": {0: "batch"},
    },
)
torch.onnx.export(
    model.decoder,
    (x_tokens, x_audio),
    "./models/decoder.onnx",
    input_names=["tokens", "audio"],
    output_names=["out"],
    dynamic_axes={
        "tokens": {0: "batch", 1: "seq"},
        "audio": {0: "batch"},
        "out": {0: "batch", 1: "seq"},
    },
)

Then, they are optimized by: python -m onnxruntime.transformers.optimizer --input ./whisper-medium-onnx/decoder.onnx --output ./whisper-medium-onnx-test/decoder__mha.onnx --float16 --model_type bart --num_heads 16 --hidden_size 1024 --use_multi_head_attention

Here are the exported models
https://drive.google.com/drive/folders/16tbQ46OB91hQtIC4XJJvwNVnl5YaVU60?usp=drive_link

encoder.onnx and decoder.onnx are not optimized. The ones with _mha are optimized.

Here is the test script. The original models can run. The optimized models can run too but the results are wrong.

import numpy as np
import onnxruntime

sess_encoder = onnxruntime.InferenceSession("./models/encoder.onnx", providers=["CUDAExecutionProvider"])
sess_decoder = onnxruntime.InferenceSession("./models/decoder.onnx",  providers=["CUDAExecutionProvider"])

start = time.time()

x_mel_fp32 = compute_features("./data/test.mp3")
x_mel_fp16 = x_mel_fp32.to(dtype=torch.float16)


out_encoder, = sess_encoder.run(["out"], {"x": x_mel_fp32.numpy()})


tokens = list(tokenizer.sot_sequence_including_notimestamps)
next_token = tokenizer.sot

while len(tokens) <= max_tokens and next_token != tokenizer.eot:
    out_decoder, = sess_decoder.run(
        ["out"],
        {
            "tokens": np.asarray([tokens], dtype="int64"),
            "audio": out_encoder,
        },
    )

next_token = out_decoder[0, -1].argmax()
tokens.append(next_token)

print("took", time.time() - start, "seconds")

print(tokenizer.decode(tokens))

from onnxruntime.

Recommend Projects