MP4 Neural Ranking Methods

Due Nov 13th, 2022 at 11:59pm

In this MP you will explore how you can use Neural Network based methods for ranking for information retrieval. The goal of this assignment is to get you familiar with some popular approaches appling neural methods for information retrieval. For each of the two portions of the assignment you need to generate candidate rankings and a discussion where you evaluate performance and what factors may be impacting it. To complete this assignment you will submit your code and 3 candidate ranking files and 2 discussion files. Details for what to include in each file can be found in the deliverables portions.

IMPORTANT MAKE SURE TO READ THIS

For each of the candidate ranking files create the ranking for the top 20 documents only! Otherwise your files will be very large! For each candidate ranking you should match the TREC format where each ranking is on its own line with the format #QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trun_id. Remeber that query ids start at 1, document IDs at 0 and ranking at 0. The score and run_id do not impact scoring but can be useful for debugging.

Retrieval Using Bi-Encoders and a Vector Database

While term based retireval based methods are effective they are commonly not effective at understanding semantic differences which are common in text. Bi encoders or dual encoders are a popular method because they are incredibly efficient. Using sentence transformers and Approximate Nearest Neighbor Index FAISS you will search semantically on the CS410 document corpus. The goal in this portion of the assignement is to retrieve documents using a bi encoder and vector datasbase and produce a candidate ranking. Please submit a few candidate ranking and associated scores for the queries and collection found in the data folder. What differences in relevance do you see? how wide are the variations? . Your outputed ranking should match the TREC format (#QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trunid) and you can use the TRECEVAL notebook to explore how well diferent models perform.

To help you explore how bi-encoders work and how they can be used with vector databases checkout ot the Bi-Encoder notebook. Details on how to format submissions and evaluate them can be found in the Evaluation notebook. To avoid long inference times we have gone ahead and generate a few sets of embeddings for the document and query corpus. If you want to explore further and look for other models they can be found in the sentence transformer library.

Deliverables

For this portion of the assignment you will submit 2 things: a candidate ranking file matching the TREC format discussed above, a brief paragraph where you discuss what model or representation you used, its performance, and how its relevance varies with depth. For each of the candidate ranking files create the ranking for the top 20 documents only! The files for this part of the submission should be named as follows:

MP4.1-candidate-ranking.trec
MP4.1-discussion.txt

Reranking using Cross Encoders

While bi-encoders can be effective they do not always produce the most relevant candidate sets. A common approach to improve this is to use a cross encoder to rerank the candidate sets generated by either bi-encoders or term based systems like BM25. A brief summary on how to use a cross encoder can be found in CrossEncoder notebook. In part 1 you generated ranking candidates using bi-encoders. Using that set, and the BM25 candidates found in the data/bm25-top1000 , you will improve the ranking using a cross encoders. For this portion of the assignment you will need to write a reranking function that uses a cross encoder to rerank a given set of documents and a query. Running a cross encoder is more computationally expensive than a bi-encoder so pick a resonable depth which you will rerank the candidates. Try out some of the cross encoders found in the hub. How does performance differ? How does the size of the reranking set impact how well cross encoders work?

Deliverables

For this portion of the assignment you will submit two candidate TREC ranking files: bi-encoder-rerank, bm25rerank, and a brief paragraph that discusses how you went about reranking inputs, what models you used, how reranking impacted performance and any similarities or difference you see between re-ranking of bi-encoders and BM25. The files for this part of the submission should be named as follows:

MP4.2-bi-encoder-candidate-reranking.trec
MP4.2-bm25-candidate-reranking.trec
MP4.2-discussion.txt

Tips and Suggestions

To learn more about the how to use ANN retrieval we suggest checking out this demo. Another great resource is Pinecone

To learn more about the models which can be used for semantic search check out SBERT

To learn more about TREC Eval and the ir-measure wrapper check out their website

The BM25 candidate set was generated using pyserini. It is one of the most popular python based search libraries out there. If you want to explore any experiments or how ranking methods work together check out the examples in their repo.

jsc723 / mp4_cs410 Goto Github PK

mp4_cs410's Introduction

MP4 Neural Ranking Methods

Due Nov 13th, 2022 at 11:59pm

IMPORTANT MAKE SURE TO READ THIS

Retrieval Using Bi-Encoders and a Vector Database

Deliverables

Reranking using Cross Encoders

Deliverables

Tips and Suggestions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent