Giter Club home page Giter Club logo

nanocolbert's People

Contributors

hannibal046 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nanocolbert's Issues

why embedding2pid list index out of range

When I run retrieve.py, I encounter an error with the line:

 top_relevant_doc_pids = [embedding2pid[x] for y in I for x in y]

Upon inspection, I found the error indexes retrieved. Why is this happening?

My debug:

total_embeddings.shape= torch.Size([596847651, 128])
len(embedding2pid)= 596847471

I:
[[406822639 117817889 368295592 ... 340840928  44116417  58849792]
 [ 47088131 301290898  47088044 ... 126308015 284803824 288093606]
 [ 47088131 442551456  47088044 ... 540478083 300737160 316488359]
 ...
 [150980019 102136117 410337895 ...  39484138 289150555 571997339]
 [404093947 376114314 404093855 ... 166449253  32960475 549894389]
 [421370365  89232576 399436776 ... 501989308 415117993 378354340]]

IndexError occurred. Details:
Current y: [406822639 117817889 368295592 ... 340840928  44116417  58849792]
Current x: 596867001

Traceback (most recent call last):
  File "retrieve.py", line 140, in <module>
    raise e
  File "retrieve.py", line 138, in <module>
    raise e
  File "retrieve.py", line 132, in <module>
    pid = embedding2pid[x]
IndexError: list index out of range

Data link changes in scripts/download.sh

The following links are currently available

wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/triples.train.small.tar.gz
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/top1000.eval.tar.gz

请教代码中的一个细节

加载数据的时候,对于doc,采用了trim_padding,我理解这里是想在训练的过程中,计算query和doc的得分时,不把doc中pad的token计算进去,但是目前这个实现,是把一个batch内的doc,从原来的pad到max_len,缩减到这个doc内实际最长的doc。
那么是不是对于这个batch内稍短的doc,还是没有达到目的呢?
还是说,这里只是为了缩短seq_len,训练的时候,doc的padding token可以算进得分,只要预测的时候不算就行了?

@staticmethod
def collate_fn(samples,tokenizer):

    def trim_padding(input_ids,padding_id):
        ## because we padding it to make length in the preprocess script
        ## we need to trim the padded sequences in a 2-dimensional tensor to the length of the longest non-padded sequence
        non_pad_mask = input_ids != padding_id
        non_pad_lengths = non_pad_mask.sum(dim=1)
        max_length = non_pad_lengths.max().item()
        trimmed_tensor = input_ids[:,:max_length]
        return trimmed_tensor

    queries  = [x[0] for x in samples]
    pos_docs = [x[1] for x in samples]
    neg_docs = [x[2] for x in samples]

    query_input_ids = torch.from_numpy(np.stack(queries).astype(np.int32))
    query_attention_mask = (query_input_ids != tokenizer.mask_token_id).int() ## not pad token, called *query augmentation* in the paper

    doc_input_ids = torch.from_numpy(np.stack(pos_docs+neg_docs).astype(np.int32))
    doc_input_ids = trim_padding(doc_input_ids,padding_id = tokenizer.pad_token_id)
    doc_attetion_mask = (doc_input_ids != tokenizer.pad_token_id).int()

请问下为啥要复现这样一个模型呢?

或者请教下作者,在大家都直接优化双塔模型的时代,bge,m3e等。这个模型有什么意义呢?如果直接来排序,可能没交互式的好,如果来检索,又比双塔重一点、

download.sh中的数据链接失效

运行 scripts/download.sh 脚本的报错信息如下:

# bash scripts/download.sh 
--2024-02-18 05:52:59--  https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:00 ERROR 404: The specified resource does not exist..

--2024-02-18 05:53:00--  https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:01 ERROR 404: The specified resource does not exist..

--2024-02-18 05:53:01--  https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 404 The specified resource does not exist.
2024-02-18 05:53:02 ERROR 404: The specified resource does not exist..

tar (child): top1000.eval.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
tar (child): triples.train.small.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
tar (child): collectionandqueries.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove '*.gz': No such file or directory

consult the cross-encoder for reranking

Could you kindly recommend or guide a simple implementation of BERT for reranking? For instance, something akin to Figure c, involving all-to-all interaction or a cross-encoder approach.
ColBERT
I am aware of the implementation available at https://github.com/nyu-dl/dl4marco-bert, but I am finding it a bit complex and this repository does not utilize the PyTorch architecture but instead employs the TensorFlow framework. Thank you for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.