Hi! I see that retriv's speed is really impressive in seepd.md. Did you also compare t

Compare retriv's permance to rank_bm25 and pyserini about retriv HOT 4 CLOSED

MarshtompCS commented on July 28, 2024

Compare retriv's permance to rank_bm25 and pyserini

from retriv.

Comments (4)

MarshtompCS commented on July 28, 2024 1

That's great! Thanks for repoting this!

from retriv.

AmenRa commented on July 28, 2024

Hi, performance should be roughly the same for pyserini and retriv.
pyserini is built on top of lucene and retriv's BM25 implementation is based on elasticsearch, which is built on top of lucene. The only difference could be the BM25 hyper-parameter setting. retriv uses the same setting of elasticsearch out-of-the-box. pyserini probably uses that of lucene. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box.
I dunno about rank_bm25. I never looked at its source code.

from retriv.

MarshtompCS commented on July 28, 2024

Hi, performance should be roughly the same for pyserini and retriv. pyserini is built on top of lucene and retriv's BM25 implementation is based on elasticsearch, which is built on top of lucene. The only difference could be the BM25 hyper-parameter setting. retriv uses the same setting of elasticsearch out-of-the-box. pyserini probably uses that of lucene. Text pre-processing could have some minor differences. In the end, you can make them behave the same and they should both performs similarly out-of-the-box. I dunno about rank_bm25. I never looked at its source code.

I think it is really necessary to compare the performance through datasets. pyserini's authors said there are many weak BM25 implementation, leading to poor performances. https://arxiv.org/pdf/2104.05740.pdf

from retriv.

AmenRa commented on July 28, 2024

The main problem with BM25 baselines is that most of the people do not optimize its hyper-parameters when performing comparisons. That's one of the main motivation retriv as a feature to allow you doing that very easily.

Regarding performances, as of now, retriv out-of-the-box performs as follows:
MSMARCO Dev MRR@10: 0.185 Recall: 0.873
TREC DL 2019 NDCG@10: 0.479 Recall: 0.753
TREC DL 2020 NDCG@10: 0.496 Recall: 0.811

Pyserini out-of-the-box performs as follows:
MSMARCO Dev MRR@10: 0.184 Recall: 0.853
TREC DL 2019 NDCG@10: 0.506 Recall: 0.750
TREC DL 2020 NDCG@10: 0.480 Recall: 0.786

The differences you see are mainly due to the the different default BM25's hyper-parameters setting of the two libraries and to a slightly different text pre-processing pipeline.

from retriv.

Compare retriv's permance to rank_bm25 and pyserini about retriv HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent