What happened? A bug happened! I am using embedd

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents about fastembed HOT 6 CLOSED

shima-khoshraftar commented on August 18, 2024

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents

from fastembed.

Comments (6)

joein commented on August 18, 2024

Hello @shima-khoshraftar

It's not really a bug.

ColBERT is different from the usual BERT-like models, it does not emit a single [CLS] token, but it emits embeddings for each token in the document.

If you have a sentence "I have an apple", and tokenizer split it into ["I", "have", "an", "apple"], then output shape will be (4, 128)

If you have a sentence "I have an apple and an orange", and tokenizer split it into ["I", "have", "an", "apple", "and", "an", "orange"], the output will be of a shape (7, 128)

Fastembed pads the sequences to the max amount of tokens in a batch, but length across different batches might be different.

It is meant to be used with vector databases like Qdrant, so you would not need to compute scores on your own, but leave it to the specified tools. (Qdrant will support ColBERT as of the next release).

from fastembed.

shima-khoshraftar commented on August 18, 2024

Thanks for your reply. But just to use this at the moment, I was using the compute_relevance_scores that was defined in the Late Interaction Text Embedding Models link: https://qdrant.github.io/fastembed/examples/ColBERT_with_FastEmbed/

Exactly, if fastembed pads the sequence to the max number of tokens in the dataset(rather than each batch), this issue will not happen. Do you think it is not efficient to compute the max number of token across the dataset rather than each batch? I was thinking if it is, this can be added to fastembed and can be used until Qdrant next release. Thanks.

from fastembed.

joein commented on August 18, 2024

No, if dataset is large (dozens of millions of records), then it would mean to read it all, compute tokens, then either save those tokens, or just drop, and then compute again, and it is not really possible.

from fastembed.

shima-khoshraftar commented on August 18, 2024

right. Thanks for the reply.

from fastembed.

joein commented on August 18, 2024

If you really want to have all of the embeddings have the same shape, you can modify padding, so it would pad sequences to some pre-defined length, e.g. if you set length to 100, embeddings for each of the documents will have shape (100, 128)

It is not exposed to the users, however, you can still do it (I haven't thoroughly tested it)

colbert = LateInteractionTextEmbedding('colbert-ir/colbertv2.0')
padding = colbert.model.tokenizer.padding
padding['length'] = 100
colbert.model.tokenizer.enable_padding(**padding)

from fastembed.

shima-khoshraftar commented on August 18, 2024

Great, thanks I will try it.

from fastembed.

Recommend Projects

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents about fastembed HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent