Giter Club home page Giter Club logo

tripler's Introduction

tripleR

This is a repository for tripleR: Recontextualize, Revise and Retrieve, final project for SNU 23 Spring Natural Lanugage Processing.

This project is based on GPL: Generative Pseudo Labeling (NAACL 2022).

You can also refer to presentation slide for details.

Our contributions are as follows:

  1. We address the problem of LM based-retrieval, generating pseudo-documents containing errors.

  2. We hypothesize that revising the pseudo-document thoroughly with the original document will enhance the retrieval performance.

  3. We propose tripleR, novel methods on recontextualizing and revising for dense retrieval.

  4. We conduct extensive experiments and achieved better performance than the original paper

Installation

One can install necessary packages via git clone

git clone https://github.com/lilys012/tripleR.git && cd tripleR
pip install -e .

Meanwhile, please make sure the correct version of PyTorch has been installed according to your CUDA version.

Dataset

tripleR accepts data in the BeIR-format. Please make sure the dataset is in tripleR/dataset directory.

Or else, you can modify evaluation_output argument below.

Usage

We offer three datasets currently, which are arguana, nfcorpus and scifact.

One can run our code using the command below.

python tripleR.py \
    --dataset "arguana" \
    --method 0

We currently offer 8 methods including the modified default GPL, but 13 methods were initially sketched.

Integers mapped to each method are as below.

methods

0 : default
1 : generate pseudo-document w/ flan-t5-xl and task-specific prompt
2 : pseudo-document is revised w/ flan-t5-xl based on doc
3 : erase pseudo-document based on confidence
4 : [MASK] random 1 + put to distilbert-base-uncased {pseudo-doc [SEP] doc} to revise
5 : [MASK] pseudo-document based on confidence + put to distilbert-base-uncased {pseudo-doc [SEP] doc} to revise
6 : concatenate v3 with generated query
7 : concatenate v5 with generated query
8 : revise v3 with flan-t5-xl

-- not implemented --
9 : generate pseudo-document w/ flan-t5-xl and one-shot example
10 : [MASK] pseudo-document based on confidence + put to decoder {prompt, pseudo-doc, doc} to revise (X)
11 : [MASK] pseudo-document based on confidence + retrieve relevant docs to revise
12 : generate queries in style of msmarco queries (distribution, few-shot, etc..)

While method 2 and 4 are implemented in data_loader.py, method 1, 3, 5~8 are implemented in auto_model.py. Also, train.py is modified from original GPL code. You can easily find if-statements for each method, or refer to the commits of "comments".

Modification

We slightly modified training config of GPL due to limitation of resources.

We decreased the training step from 140K to 70K and changed |corpus size| x queries per passage from 250K to 100K. Furthermore, we didn't use TSDAE nor TAS-B methods.

Results

We report our experiment results below.

We note that v1 and v2 for nfcorpus and arguana were conducted with the original GPL training settings, and v8 was conducted on only arguana dataset.

tripler's People

Contributors

lilys012 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.