hannibal046 / nanocolbert Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 11.0 870 KB

Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).

Python 96.24% Shell 3.76%

nanocolbert's People

Contributors

Stargazers

Watchers

Forkers

codeaudit liyongkang123 jansystemic kyoungrok0517 nlpjcl gnehuy leixy76 techthiyanes gogo00007 kagaii xgl0626

nanocolbert's Issues

why embedding2pid list index out of range

When I run retrieve.py, I encounter an error with the line：

 top_relevant_doc_pids = [embedding2pid[x] for y in I for x in y]

Upon inspection, I found the error indexes retrieved. Why is this happening?

My debug：

total_embeddings.shape= torch.Size([596847651, 128])
len(embedding2pid)= 596847471

I:
[[406822639 117817889 368295592 ... 340840928  44116417  58849792]
 [ 47088131 301290898  47088044 ... 126308015 284803824 288093606]
 [ 47088131 442551456  47088044 ... 540478083 300737160 316488359]
 ...
 [150980019 102136117 410337895 ...  39484138 289150555 571997339]
 [404093947 376114314 404093855 ... 166449253  32960475 549894389]
 [421370365  89232576 399436776 ... 501989308 415117993 378354340]]

IndexError occurred. Details:
Current y: [406822639 117817889 368295592 ... 340840928  44116417  58849792]
Current x: 596867001

Traceback (most recent call last):
  File "retrieve.py", line 140, in <module>
    raise e
  File "retrieve.py", line 138, in <module>
    raise e
  File "retrieve.py", line 132, in <module>
    pid = embedding2pid[x]
IndexError: list index out of range

Data link changes in scripts/download.sh

The following links are currently available

wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/triples.train.small.tar.gz
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/top1000.eval.tar.gz

请教代码中的一个细节

加载数据的时候，对于doc，采用了trim_padding,我理解这里是想在训练的过程中，计算query和doc的得分时，不把doc中pad的token计算进去，但是目前这个实现，是把一个batch内的doc，从原来的pad到max_len，缩减到这个doc内实际最长的doc。
那么是不是对于这个batch内稍短的doc，还是没有达到目的呢？
还是说，这里只是为了缩短seq_len，训练的时候，doc的padding token可以算进得分，只要预测的时候不算就行了？

@staticmethod
def collate_fn(samples,tokenizer):

    def trim_padding(input_ids,padding_id):
        ## because we padding it to make length in the preprocess script
        ## we need to trim the padded sequences in a 2-dimensional tensor to the length of the longest non-padded sequence
        non_pad_mask = input_ids != padding_id
        non_pad_lengths = non_pad_mask.sum(dim=1)
        max_length = non_pad_lengths.max().item()
        trimmed_tensor = input_ids[:,:max_length]
        return trimmed_tensor

    queries  = [x[0] for x in samples]
    pos_docs = [x[1] for x in samples]
    neg_docs = [x[2] for x in samples]

    query_input_ids = torch.from_numpy(np.stack(queries).astype(np.int32))
    query_attention_mask = (query_input_ids != tokenizer.mask_token_id).int() ## not pad token, called *query augmentation* in the paper

    doc_input_ids = torch.from_numpy(np.stack(pos_docs+neg_docs).astype(np.int32))
    doc_input_ids = trim_padding(doc_input_ids,padding_id = tokenizer.pad_token_id)
    doc_attetion_mask = (doc_input_ids != tokenizer.pad_token_id).int()

请问下为啥要复现这样一个模型呢？

或者请教下作者，在大家都直接优化双塔模型的时代，bge,m3e等。这个模型有什么意义呢？如果直接来排序，可能没交互式的好，如果来检索，又比双塔重一点、

download.sh中的数据链接失效

运行 scripts/download.sh 脚本的报错信息如下：

# bash scripts/download.sh 
--2024-02-18 05:52:59--  https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:00 ERROR 404: The specified resource does not exist..

--2024-02-18 05:53:00--  https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:01 ERROR 404: The specified resource does not exist..

--2024-02-18 05:53:01--  https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:02 ERROR 404: The specified resource does not exist..

tar (child): top1000.eval.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
tar (child): triples.train.small.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
tar (child): collectionandqueries.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove '*.gz': No such file or directory

consult the cross-encoder for reranking

Could you kindly recommend or guide a simple implementation of BERT for reranking? For instance, something akin to Figure c, involving all-to-all interaction or a cross-encoder approach.

I am aware of the implementation available at https://github.com/nyu-dl/dl4marco-bert, but I am finding it a bit complex and this repository does not utilize the PyTorch architecture but instead employs the TensorFlow framework. Thank you for your help.

咨询 build faiss index for efficient retrieval 的报错问题

你好，请问每次运行python build_index.py --embedding_dir embedding/colbert --output_path embedding/colbert/ivfpq.faiss.index
命令的时候报错如上，有什么建议办法吗？感谢

hannibal046 / nanocolbert Goto Github PK

nanocolbert's People

Contributors

Stargazers

Watchers

Forkers

nanocolbert's Issues

why embedding2pid list index out of range

Data link changes in scripts/download.sh

请教代码中的一个细节

请问下为啥要复现这样一个模型呢？

download.sh中的数据链接失效

consult the cross-encoder for reranking

咨询 build faiss index for efficient retrieval 的报错问题

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent