Comments (5)
@yinsong1986 Right, I didn't notice that first time I looked. So, now I did more digging and compared both implementations. From what I see, there is no bug and it's simply the naming which is a bit different from the original LLaVarepo.
If we compare select_best_resolution
as you pointed out, the height and width are swapped (only names since the resulting best resolution is same regardless of how you call it). Later in this piece of code we still follow the "height, width" naming,
but we swap back the names as it should be here
So if my understanding is correct, at the end we end up with the width and height in the places where they should be. Also we ran an equivalence test between two implementations and got nearly same logits, which I believe supports my claim that it's not a bug.
But I agree that it's quite counter-intuitive to see a sudden swap between the two in above lines. I will fix the naming next week :)
from transformers.
FYI: in the original implementation from https://github.com/LLaVA-VL/LLaVA-NeXT, they didn't do any swap of the (width, height), when calling get_anyres_image_grid_shape. The source code is as below:
- https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/llava/mm_utils.py#L235
- https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/llava/model/llava_arch.py#L200
Hope it helps with your refactoring code. Thank you!
from transformers.
Same as #31327. I asked the authors of LLaVa-NeXT and didn't get any reply yet.
For me it also look like swapped and should be the other way, but since that is how LLaVa-NeXT authors implemented it in their repo and I didn't see much difference by running a few examples between the two swaps, I decided to not flag it as a bug yet and wait for the authors' reply.
Let me know if you ran an evaluation and found that swapping back to num_patch_height, num_patch_width
is better in some (OCR, high-res images?) or all tasks!
cc @NielsRogge also, who added the model
from transformers.
Hi @zucchini-nlp Thanks for your reply!
I think in the original implementation, they kind of keeping the order as (width, height), but for this hf implementation, you kind of keep the order as (height, width) almost everywhere. An example of comparing the two can be found below:
so your current implementation probably is not quite implemented same as the original implement, as far as I understand it :)
Pls correct me if I am wrong.Thanks!
from transformers.
Thank you and look forward to the updated code!
from transformers.
Related Issues (20)
- Problem with the masked language modeling tutorial HOT 1
- When running `ruff format src/transformers`, some files needs to be reformatted HOT 2
- Something wrong for `StoppingCriteria` HOT 5
- Index out of range when generate using optimum HOT 2
- Fail to load model without .safetensors file HOT 2
- GGUFTokenizerSkeleton AttributeError during conversion HOT 3
- Fixing Tensor Shape/Dimension Mismatch Errors in TimeSeries Transformer for Stock Price Prediction HOT 9
- You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. HOT 8
- NotImplementedError: Cannot copy out of meta tensor; no data when embedding to meta HOT 10
- Add argument to set number of eval steps in Trainer HOT 4
- First token optimization in beam search
- Transformers master version breaks compatibility with `torch<2.3` HOT 1
- Missing upper bound in numpy requirements breaks transformers HOT 5
- Trainer: To keep unused columns for `compute_metrics` HOT 4
- RuntimeError: slow_conv2d_forward_mps: input(device='cpu') and weight(device=mps:0') HOT 1
- OOM when loading 300B models with `AutoModelForCausalLM.from_pretrained` and `BitsAndBytesConfig` quantization. HOT 3
- A question about the implementation of Sinkcache. HOT 2
- Multi-GPU inference affects LLM's (Llama2-7b-chat-hf) generation.
- `pip install accelerate` (and similar) error messages should specify min version HOT 2
- Incorrect docstring of `get_anyres_image_grid_shape` HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.