Giter Club home page Giter Club logo

copacrr's Introduction

(CO-)PACRR Neural IR models

This is a Keras (TensorFlow) implementation of the model described in:

Kai Hui, Andrew Yates, Klaus Berberich, Gerard de Melo. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In EMNLP, 2017.

Kai Hui, Andrew Yates, Klaus Berberich, Gerard de Melo. Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval. In WSDM, 2018.

Contact

Code author: Kai Hui and Andrew Yates

Pull requests and issues: @khui @andrewyates

Model overview

The (RE-)PACRR models the interaction between a query-document pair, evaluating the relevance of the document for the query. The input is the similarity matrix between a query and a document, and the output is a scaler. The full pipeline of the model will be published soon.

Getting Started

Install Required Packages

First install Anaconda (Recommended).

Thereafter install tensorflow and keras on Anaconda.

Preparation: word2vec and similarity matrices

The preprocessing is to generate a simmilarity matrix for individual query-document pairs.

There are two phases:

  1. Extract the text from warc;

  2. Prepare the word embedding for individual terms in the query and docs, by using a pre-trained word2vec corpus.One needs to make up the missing vectors for the terms that are not included in the pre-trained word2vec. In the preprocess/wordembedding directory, the train_w2v.py makes up the missing ones by keep training the word2vec meanwhile fixing the vectors that are presented already;

  3. the computation of the similarity matrices for individual query-doc pairs (not included in the code)

At this moment, we include the pre-computed simmilarity matrices in the followings. One could download and unpack the similarity matrices for clueweb as described in PACRR and RE-PACRR.

run the following:

       tar xvf simmat.tar.gz

Usage: train, predict and evaluation

Configure the $parentdir in *.sh as the root directory for all outputs.

Configure the sim_dir in utils/config.py, holding the similarity matrices.

Train

python -m train_model with expname=$expname train_years=$train_years {param_name=param_val}

or use the script

bash bin/train_model.sh

Configure different parameters in train.sh or utils/config.py

Predict

python -m pred_per_epoch with expname=$expname train_years=$train_years test_year=$test_year {param_name=param_val}

or use the script

bash bin/pred_per_epoch.sh

Configure different parameters in pred_per_epoch.sh or utils/config.py

Evaluation

Evaluate the prediction over the three benchmarks as described in our RE-PACRR paper. Note that for Rerank-ALL benchmark one needs to dump all trec-runs and their corresponding evaluation results under data/trec_runs and data/eval_trec_runs respectively.

The evaluation code strictly corresponds to the exp reported in PACRR and REPACRR papers, but one could easily develop evaluation pipeline for different dataset or even different benchmarks based on the code. For example, the usage of the six years' Web Track for training/validation/test is hard coded in the train_test_years in utils/config.py, and one needs to edit it when fewer years are used.

python -m evals.docpairs with expname=$expname train_years=$train_years {param_name=param_val}

python -m evals.rerank with expname=$expname train_years=$train_years {param_name=param_val}

or use the script

bash bin/evals.sh

Configure different parameters in evals.sh or utils/config.py

Citation

If you use the code, please cite the following papers:

PACRR

@inproceedings{hui2017pacrr,
  title={PACRR: A Position-Aware Neural IR Model for Relevance Matching},
  author={Hui, Kai and Yates, Andrew and Berberich, Klaus and de Melo, Gerard},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  pages={1060--1069},
  year={2017}
}

CO-PACRR

@inproceedings{hui2018co,
  title={Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval},
  author={Hui, Kai and Yates, Andrew and Berberich, Klaus and de Melo, Gerard},
  booktitle={Proceedings of Web Search and Data Mining 2018},
  year={2018},
  location={Los Angeles, CA USA}
}

copacrr's People

Contributors

khui avatar andrewyates avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.