Giter Club home page Giter Club logo

Comments (5)

zucchini-nlp avatar zucchini-nlp commented on August 16, 2024 1

@yinsong1986 Right, I didn't notice that first time I looked. So, now I did more digging and compared both implementations. From what I see, there is no bug and it's simply the naming which is a bit different from the original LLaVarepo.

If we compare select_best_resolution as you pointed out, the height and width are swapped (only names since the resulting best resolution is same regardless of how you call it). Later in this piece of code we still follow the "height, width" naming,

height, width = select_best_resolution(image_size, grid_pinpoints)
return height // patch_size, width // patch_size

but we swap back the names as it should be here

num_patch_width, num_patch_height = get_anyres_image_grid_shape(

So if my understanding is correct, at the end we end up with the width and height in the places where they should be. Also we ran an equivalence test between two implementations and got nearly same logits, which I believe supports my claim that it's not a bug.

But I agree that it's quite counter-intuitive to see a sudden swap between the two in above lines. I will fix the naming next week :)

from transformers.

yinsong1986 avatar yinsong1986 commented on August 16, 2024 1

@zucchini-nlp

FYI: in the original implementation from https://github.com/LLaVA-VL/LLaVA-NeXT, they didn't do any swap of the (width, height), when calling get_anyres_image_grid_shape. The source code is as below:

Hope it helps with your refactoring code. Thank you!

from transformers.

zucchini-nlp avatar zucchini-nlp commented on August 16, 2024

Same as #31327. I asked the authors of LLaVa-NeXT and didn't get any reply yet.

For me it also look like swapped and should be the other way, but since that is how LLaVa-NeXT authors implemented it in their repo and I didn't see much difference by running a few examples between the two swaps, I decided to not flag it as a bug yet and wait for the authors' reply.

Let me know if you ran an evaluation and found that swapping back to num_patch_height, num_patch_width is better in some (OCR, high-res images?) or all tasks!

cc @NielsRogge also, who added the model

from transformers.

yinsong1986 avatar yinsong1986 commented on August 16, 2024

Hi @zucchini-nlp Thanks for your reply!

I think in the original implementation, they kind of keeping the order as (width, height), but for this hf implementation, you kind of keep the order as (height, width) almost everywhere. An example of comparing the two can be found below:

so your current implementation probably is not quite implemented same as the original implement, as far as I understand it :)

Pls correct me if I am wrong.Thanks!

from transformers.

yinsong1986 avatar yinsong1986 commented on August 16, 2024

Thank you and look forward to the updated code!

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.