I have a couple of questions.
--------------------------------------------- Q1 --------------------------------------------------------
I was trying to reproduce the results using the balloon.jpg image available in the repo using the prompt "Describe the image. Please output interleaved segmentation mask." However the network does not seem to generate multiple masks inspite of the generate text being "The image shows a <p> hot air balloon </p> [SEG] flying over a <p> river </p> [SEG] . The <p> sky </p> [SEG] is visible over the river."
I went a step further to check if the issue is from my side. Below are the generated "generated_output_ids "
[ 319, 13563, 1546, 263, 12758, 5199, 322, 385, 23116, 21082,
20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892, 322,
1248, 568, 6089, 304, 278, 5199, 29915, 29879, 5155, 29889,
3148, 1001, 29901, 450, 32000, -200, 29871, 32001, 16123, 2247,
385, 975, 1493, 310, 278, 7623, 29889, 13, 4002, 29581,
278, 1967, 29889, 3529, 1962, 1006, 280, 10511, 10768, 362,
11105, 29889, 319, 1799, 9047, 13566, 29901, 450, 1967, 3697,
263, 32005, 7375, 4799, 6411, 417, 265, 32006, 32004, 22764,
975, 263, 32005, 8580, 32006, 32004, 869, 450, 32005, 14744,
32006, 32004, 338, 7962, 975, 278, 8580, 29889, 2]
--------------------------------------------- Q2 --------------------------------------------------------
Another interesting property I observed, when I run tokenizer("[SEG]").input_ids
the output indices are [ 1, 29871, 32004]
where as tokenizer("a [SEG]").input_ids
returns [ 1, 263, 32004]
as you can notice the tokenizer outputs id 29871(seg_token_idx) in the first case is this expected, I am curious to understand the intuition behind this.
Thank you, I appreciate any time you can spend to help with my questions.