Giter Club home page Giter Club logo

mp4_cs410's Introduction

MP4 Neural Ranking Methods

Due Nov 13th, 2022 at 11:59pm

In this MP you will explore how you can use Neural Network based methods for ranking for information retrieval. The goal of this assignment is to get you familiar with some popular approaches appling neural methods for information retrieval. For each of the two portions of the assignment you need to generate candidate rankings and a discussion where you evaluate performance and what factors may be impacting it. To complete this assignment you will submit your code and 3 candidate ranking files and 2 discussion files. Details for what to include in each file can be found in the deliverables portions.

IMPORTANT MAKE SURE TO READ THIS

For each of the candidate ranking files create the ranking for the top 20 documents only! Otherwise your files will be very large! For each candidate ranking you should match the TREC format where each ranking is on its own line with the format #QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trun_id. Remeber that query ids start at 1, document IDs at 0 and ranking at 0. The score and run_id do not impact scoring but can be useful for debugging.

Retrieval Using Bi-Encoders and a Vector Database

While term based retireval based methods are effective they are commonly not effective at understanding semantic differences which are common in text. Bi encoders or dual encoders are a popular method because they are incredibly efficient. Using sentence transformers and Approximate Nearest Neighbor Index FAISS you will search semantically on the CS410 document corpus. The goal in this portion of the assignement is to retrieve documents using a bi encoder and vector datasbase and produce a candidate ranking. Please submit a few candidate ranking and associated scores for the queries and collection found in the data folder. What differences in relevance do you see? how wide are the variations? . Your outputed ranking should match the TREC format (#QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trunid) and you can use the TRECEVAL notebook to explore how well diferent models perform.

To help you explore how bi-encoders work and how they can be used with vector databases checkout ot the Bi-Encoder notebook. Details on how to format submissions and evaluate them can be found in the Evaluation notebook. To avoid long inference times we have gone ahead and generate a few sets of embeddings for the document and query corpus. If you want to explore further and look for other models they can be found in the sentence transformer library.

Deliverables

For this portion of the assignment you will submit 2 things: a candidate ranking file matching the TREC format discussed above, a brief paragraph where you discuss what model or representation you used, its performance, and how its relevance varies with depth. For each of the candidate ranking files create the ranking for the top 20 documents only! The files for this part of the submission should be named as follows:

  1. MP4.1-candidate-ranking.trec
  2. MP4.1-discussion.txt

Reranking using Cross Encoders

While bi-encoders can be effective they do not always produce the most relevant candidate sets. A common approach to improve this is to use a cross encoder to rerank the candidate sets generated by either bi-encoders or term based systems like BM25. A brief summary on how to use a cross encoder can be found in CrossEncoder notebook. In part 1 you generated ranking candidates using bi-encoders. Using that set, and the BM25 candidates found in the data/bm25-top1000 , you will improve the ranking using a cross encoders. For this portion of the assignment you will need to write a reranking function that uses a cross encoder to rerank a given set of documents and a query. Running a cross encoder is more computationally expensive than a bi-encoder so pick a resonable depth which you will rerank the candidates. Try out some of the cross encoders found in the hub. How does performance differ? How does the size of the reranking set impact how well cross encoders work?

Deliverables

For this portion of the assignment you will submit two candidate TREC ranking files: bi-encoder-rerank, bm25rerank, and a brief paragraph that discusses how you went about reranking inputs, what models you used, how reranking impacted performance and any similarities or difference you see between re-ranking of bi-encoders and BM25. The files for this part of the submission should be named as follows:

  1. MP4.2-bi-encoder-candidate-reranking.trec
  2. MP4.2-bm25-candidate-reranking.trec
  3. MP4.2-discussion.txt

Tips and Suggestions

To learn more about the how to use ANN retrieval we suggest checking out this demo. Another great resource is Pinecone

To learn more about the models which can be used for semantic search check out SBERT

To learn more about TREC Eval and the ir-measure wrapper check out their website

The BM25 candidate set was generated using pyserini. It is one of the most popular python based search libraries out there. If you want to explore any experiments or how ranking methods work together check out the examples in their repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.