Giter Club home page Giter Club logo

Comments (6)

kha-white avatar kha-white commented on May 20, 2024 4

Ok so I don't know exactly how to do the inference in onnx (I played around with it a little bit but it seemed rather tricky to do and I abandoned/postponed it), so I'll just tell you what I know.

This model has an encoder-decoder architecture. The encoder gets an image as an input and outputs a feature vector (this is encoder_last_hidden_state). Then, this feature vector is passed to the decoder, which is run iteratively, outputting one token at a time until reaching a special token indicating the end of the sequence. At each step, the decoder is being fed the sequence of all the tokens it had outputted so far (this is decoder_input_ids). The decoder outputs the tokens as logits - each token is represented by a vector of 6144 values corresponding to the vocab, as you correctly noticed. You can take the argmax from the logits vector to get the most probable token, but what actually happens is the logits are converted to probabilities and top N hypotheses are considered at each step (beam search). This is done by HuggingFace's generate method, which is unfortunately quite complex.

The tricky part is replicating the beam search (although it could be replaced with a simpler greedy search at cost of some accuracy drop) and getting all the little details right when passing around the tensors.

BTW I suppose that there is something wrong with both yours and that bot's onnx export. I think that there should be separate onnx file for encoder and decoder, since the encoder is run only once per inference and then the decoder is run iteratively until the end of the sequence is reached.

from manga-ocr.

kha-white avatar kha-white commented on May 20, 2024

Conversion to ONNX is possible and has been done, but inference in other languages is not trivial, mainly because HuggingFace's generate method (basically beam search) would have to be reimplemented. Tokenizer and post-processing are relatively easy to substitute; tokenizer's decode can be replaced with a look-up table and the rest are some rather simple operations on strings.

from manga-ocr.

Mar2ck avatar Mar2ck commented on May 20, 2024

An automated conversion of the model to ONNX can be found here https://huggingface.co/kha-white/manga-ocr-base/blob/refs%2Fpr%2F3/model.onnx

from manga-ocr.

LancerComet avatar LancerComet commented on May 20, 2024

@kha-white Thanks for your reply, I have converted an ONNX model with input and output shapes as follows:

Input Details:
Name: pixel_values, Shape: ['batch_size', 'num_channels', 'height', 'width'], Type: tensor(float)
Name: decoder_input_ids, Shape: ['batch_size', 'decoder_sequence_length'], Type: tensor(int64)

Output Details:
Name: logits, Shape: ['batch_size', 'decoder_sequence_length', 6144], Type: tensor(float)

I understand that pixel_values is a tensor of the bitmap, but I am unclear about decoder_input_ids. Does it come from data in the manga109 training set?

logits is a tensor of length 6144, I am currently guessing it is to be used in conjuction with vocab.txt.

@Mar2ck Wow I didn't see there is even a bot that turns model into onnx automatically! But I see it is a little different from mine:

Input Details:
Name: pixel_values, Shape: ['batch_size', 'num_channels', 'height', 'width'], Type: tensor(float)
Name: decoder_input_ids, Shape: ['batch_size', 'decoder_sequence_length'], Type: tensor(int64)

Output Details:
Name: logits, Shape: ['batch_size', 'decoder_sequence_length', 6144], Type: tensor(float)
Name: encoder_last_hidden_state, Shape: ['batch_size', 'encoder_sequence_length', 768], Type: tensor(float)

It has encoder_last_hidden_state in output which mine doesn't have. Have no clue why it happens.

from manga-ocr.

LancerComet avatar LancerComet commented on May 20, 2024

@kha-white Thank you for your reply, I'm starting to understand the whole workflow. I am currently looking for a solution regarding generate function. I saw some stuff about BeamSearch in the onnxruntime repository, but haven't delved into it yet. As for the onnx model issue, it's actually possible to create two separate models, but I had previously merged them while generating because I didn't have a deep understanding of the model. Again thank you very much for your response.

from manga-ocr.

mayocream avatar mayocream commented on May 20, 2024

@LancerComet Hi, would you like to share your method to export the pre-trained model to onnx format? I am getting the below errors when exporting with optimum-cli:

$ optimum-cli export onnx --model kha-white/manga-ocr-base bin/                                                                                                     (base) 
Framework not specified. Using pt to export to ONNX.
Automatic task detection to image-to-text-with-past.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
/Users/mayo/miniconda3/lib/python3.11/site-packages/transformers/models/vit/feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/mayo/miniconda3/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/commands/export/onnx.py", line 232, in run
    main_export(
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/exporters/onnx/__main__.py", line 399, in main_export
    onnx_config, models_and_onnx_configs = _get_submodels_and_onnx_configs(
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/exporters/onnx/__main__.py", line 82, in _get_submodels_and_onnx_configs
    onnx_config = onnx_config_constructor(
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/exporters/onnx/base.py", line 623, in with_past
    return cls(
           ^^^^
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/exporters/onnx/model_configs.py", line 1231, in __init__
    super().__init__(
  File "/Users/mayo/miniconda3/lib/python3.11/site-packages/optimum/exporters/onnx/config.py", line 322, in __init__
    raise ValueError(
ValueError: The decoder part of the encoder-decoder model is bert which does not need past key values.

Updated:

Got exporting succucced. Run
optimum-cli export onnx --model kha-white/manga-ocr-base --task vision2seq-lm bin/

from manga-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.