Comments (6)
Hello @shima-khoshraftar
It's not really a bug.
ColBERT is different from the usual BERT-like models, it does not emit a single [CLS] token, but it emits embeddings for each token in the document.
If you have a sentence "I have an apple", and tokenizer split it into ["I", "have", "an", "apple"], then output shape will be (4, 128)
If you have a sentence "I have an apple and an orange", and tokenizer split it into ["I", "have", "an", "apple", "and", "an", "orange"], the output will be of a shape (7, 128)
Fastembed pads the sequences to the max amount of tokens in a batch, but length across different batches might be different.
It is meant to be used with vector databases like Qdrant, so you would not need to compute scores on your own, but leave it to the specified tools. (Qdrant will support ColBERT as of the next release).
from fastembed.
Thanks for your reply. But just to use this at the moment, I was using the compute_relevance_scores that was defined in the Late Interaction Text Embedding Models link: https://qdrant.github.io/fastembed/examples/ColBERT_with_FastEmbed/
Exactly, if fastembed pads the sequence to the max number of tokens in the dataset(rather than each batch), this issue will not happen. Do you think it is not efficient to compute the max number of token across the dataset rather than each batch? I was thinking if it is, this can be added to fastembed and can be used until Qdrant next release. Thanks.
from fastembed.
No, if dataset is large (dozens of millions of records), then it would mean to read it all, compute tokens, then either save those tokens, or just drop, and then compute again, and it is not really possible.
from fastembed.
right. Thanks for the reply.
from fastembed.
If you really want to have all of the embeddings have the same shape, you can modify padding, so it would pad sequences to some pre-defined length, e.g. if you set length to 100, embeddings for each of the documents will have shape (100, 128)
It is not exposed to the users, however, you can still do it (I haven't thoroughly tested it)
colbert = LateInteractionTextEmbedding('colbert-ir/colbertv2.0')
padding = colbert.model.tokenizer.padding
padding['length'] = 100
colbert.model.tokenizer.enable_padding(**padding)
from fastembed.
Great, thanks I will try it.
from fastembed.
Related Issues (20)
- Please add BAAI/bge-large-zh-v1.5 model
- [Bug/Model Request]: Is slower than sentence transformer for all-minilm-l6-v2 HOT 10
- [Model Request] please add "pkshatech/GLuCoSE-base-ja" HOT 3
- [Bug]: Bug when trying to use FastEmbedEmbeddings() HOT 4
- [Model Request]: Please add jinaai/jina-embeddings-v2-base-de
- Not able to install fastembed in windows machine. HOT 1
- [Bug]: Faiss Search Error with TextEmbedding HOT 1
- Download the model at Docker image build time HOT 2
- [Model Request]: Support italian BM25
- [Model Request]: Support lier007/xiaobu-embedding-v2
- In AWS Lambda "Unable to import module 'app': /lib64/libm.so.6: version `GLIBC_2.27' not found"
- [Bug/Model Request]: Does this version support cuDNN 9.x and onnxruntime-gpu 1.18.1? HOT 1
- [Bug/Model Request]: Installation failed getting [SSL: CERTIFICATE_VERIFY_FAILED] HOT 1
- [Bug/Model Request]: Load model files from path, not from huggingface cach directory HOT 2
- [Bug/Model Request]: Support for Alibaba-NLP/gte-multilingual-base
- [Bug/Model Request]: Newly added supported models
- Deprecate prithvida splade due to a typo in the name
- [Bug/Model Request]: Issue: DeprecationWarning for tar.extractall Filter Parameter in Python 3.14 and Inconsistent Behavior Across Platforms
- [Documentation] Querying with Splade++
- [Bug/Model Request]:
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastembed.