System Info To the best of my understanding, <div class="Box B

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Same as <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Potential Bug in llava_next when calling pack_image_features function. about transformers HOT 5 OPEN

yinsong1986 commented on August 16, 2024

Potential Bug in llava_next when calling pack_image_features function.

from transformers.

Comments (5)

zucchini-nlp commented on August 16, 2024 1

@yinsong1986 Right, I didn't notice that first time I looked. So, now I did more digging and compared both implementations. From what I see, there is no bug and it's simply the naming which is a bit different from the original LLaVarepo.

If we compare select_best_resolution as you pointed out, the height and width are swapped (only names since the resulting best resolution is same regardless of how you call it). Later in this piece of code we still follow the "height, width" naming,

transformers/src/transformers/models/llava_next/modeling_llava_next.py

Lines 73 to 74 in 730a440

 height, width = select_best_resolution(image_size, grid_pinpoints) 

 return height // patch_size, width // patch_size

but we swap back the names as it should be here

transformers/src/transformers/models/llava_next/modeling_llava_next.py

Line 656 in 730a440

num_patch_width, num_patch_height = get_anyres_image_grid_shape(

So if my understanding is correct, at the end we end up with the width and height in the places where they should be. Also we ran an equivalence test between two implementations and got nearly same logits, which I believe supports my claim that it's not a bug.

But I agree that it's quite counter-intuitive to see a sudden swap between the two in above lines. I will fix the naming next week :)

from transformers.

yinsong1986 commented on August 16, 2024 1

@zucchini-nlp

FYI: in the original implementation from https://github.com/LLaVA-VL/LLaVA-NeXT, they didn't do any swap of the (width, height), when calling get_anyres_image_grid_shape. The source code is as below:

Hope it helps with your refactoring code. Thank you!

from transformers.

zucchini-nlp commented on August 16, 2024

Same as #31327. I asked the authors of LLaVa-NeXT and didn't get any reply yet.

For me it also look like swapped and should be the other way, but since that is how LLaVa-NeXT authors implemented it in their repo and I didn't see much difference by running a few examples between the two swaps, I decided to not flag it as a bug yet and wait for the authors' reply.

Let me know if you ran an evaluation and found that swapping back to num_patch_height, num_patch_width is better in some (OCR, high-res images?) or all tasks!

cc @NielsRogge also, who added the model

from transformers.

yinsong1986 commented on August 16, 2024

Hi @zucchini-nlp Thanks for your reply!

I think in the original implementation, they kind of keeping the order as (width, height), but for this hf implementation, you kind of keep the order as (height, width) almost everywhere. An example of comparing the two can be found below:

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/6944062b9bb2e61c48436f1a65c3ea339095ec91/llava/mm_utils.py#L234
transformers/src/transformers/models/llava_next/modeling_llava_next.py

Line 73 in 12b1620

height, width = select_best_resolution(image_size, grid_pinpoints)

so your current implementation probably is not quite implemented same as the original implement, as far as I understand it :)

Pls correct me if I am wrong.Thanks!

from transformers.

yinsong1986 commented on August 16, 2024

Thank you and look forward to the updated code!

from transformers.

Potential Bug in llava_next when calling pack_image_features function. about transformers HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	height, width = select_best_resolution(image_size, grid_pinpoints)
	return height // patch_size, width // patch_size