Giter Club home page Giter Club logo

sphere's Introduction

Sphere

About

In our paper The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus we propose to use a web corpus as a universal, uncurated and unstructured knowledge source for multiple KI-NLP tasks at once.

We leverage an open web corpus coupled with strong retrieval baselines instead of a black-box, commercial search engine - an approach which facilitates transparent and reproducible research and opens up a path for future studies comparing search engines optimised for humans with retrieval solutions designed for neural networks. We use a subset of CCNet covering 134M documents split into 906M passages as the web corpus which we call Sphere.

In this repository we open source indices of Sphere both for the sparse retrieval baseline, compatible with Pyserini, and our best dense model compatible with distributed-faiss. We also provide instructions on how to evaluate the retrieval performance for both standard and newly introduced retrieval metrics, using the KILT API.

Reference

If you use the content of this repository in your research, please cite the following:

@article{DBLP:journals/corr/abs-2112-09924,
  author    = {Aleksandra Piktus and Fabio Petroni
               and Vladimir Karpukhin and Dmytro Okhonko
               and Samuel Broscheit and Gautier Izacard
               and Patrick Lewis and Barlas Oguz
               and Edouard Grave and Wen{-}tau Yih
               and Sebastian Riedel},
  title     = {The Web Is Your Oyster - Knowledge-Intensive {NLP} against a Very
               Large Web Corpus},
  journal   = {CoRR},
  volume    = {abs/2112.09924},
  year      = {2021},
  url       = {https://arxiv.org/abs/2112.09924},
  eprinttype = {arXiv},
  eprint    = {2112.09924},
  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-09924.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Installation

git clone [email protected]:facebookresearch/Sphere.git
cd Sphere
conda create -n sphere -y python=3.7 && conda activate sphere
pip install -e .

Index download

We open source pre-built Sphere indices:

You can download and unpack respective index files directly e.g. via the browser of wget:

mkdir -p faiss_index

wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz
tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index

wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_dense_index.tar.gz
tar -xzvf faiss_index/sphere_dense_index.tar.gz -C faiss_index

Evaluation with KILT

We implement the retrieval metrics introduced in the paper:

  • the answer-in-context@k,
  • the answer+entity-in-context@k,
  • as well as the entity-in-input ablation metric

within the KILT repository. Follow instruction below to perform and evaluate retrieval on KILT tasks for both sparse and dense Sphere indices.

KILT dependencies

pip install -e git+https://github.com/facebookresearch/KILT#egg=KILT

Download KILT data. Check out instructions in the KILT repo for more details.

mkdir -p data
python src/kilt/scripts/download_all_kilt_data.py
python src/kilt/scripts/get_triviaqa_input.py

Dense index

Install dependencies

pip install -e git+https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss
pip install -e git+https://github.com/facebookresearch/DPR@multi_task_training#egg=DPR
pip install spacy==2.1.8
python -m spacy download en

Launch distributed-faiss server

More details here.

python src/distributed-faiss/scripts/server_launcher.py \
    --log-dir logs \
    --discovery-config faiss_index/disovery_config.txt \
    --num-servers 32 \
    --num-servers-per-node 4 \
    --timeout-min 4320 \
    --save-dir faiss_index/ \
    --mem-gb 500 \
    --base-port 13034 \
    --partition dev &

Download assets

mkdir -p checkpoints
wget -P checkpoints http://dl.fbaipublicfiles.com/sphere/dpr_web_biencoder.cp

mkdir -p configs
wget -P configs https://dl.fbaipublicfiles.com/sphere/dpr_web_sphere.yaml

Subsequently update the following fields in the dpr_web_sphere.yaml configuration file:

n_docs: 100 # the number of documents to retrieve per query
model_file: checkpoints/dpr_web_biencoder.cp # path to the downloaded model file
rpc_retriever_cfg_file: faiss_index/disovery_config.txt # path to the discovery config file used when launching the distributed-faiss server
rpc_index_id: dense # the name of the folder contaning dense index partitions

Execute retrieval

In order to perform retrieval from the dense index you first need to launch the distributed-faiss server as described above. You can control the KILT datasets you perform retrieval for by modifying respective config files, e.g. src/kilt/configs/dev_data.json.

python src/kilt/scripts/execute_retrieval.py \
    --model_name dpr_distr \
    --model_configuration configs/dpr_web_sphere.yaml \
    --test_config src/kilt/kilt/configs/dev_data.json \
    --output_folder output/dense/

Sparse index

Install dependencies

Our sparse index relies on Pyserini, and therfore requires an install of Java 11 to be available on the machine.

pip install jnius
pip install pyserini==0.9.4.0

Next, download the following file:

mkdir -p configs
wget -P configs https://dl.fbaipublicfiles.com/sphere/bm25_sphere.json

Subsequently update the following field in the bm25_sphere.json configuration file:

"k": 100, # the number of documents to retrieve per query
"index": "faiss_index/sparse", # path to the unpacked sparse BM25 index

Execute retrieval

python src/kilt/scripts/execute_retrieval.py \
    --model_name bm25 \
    --model_configuration configs/bm25_sphere.json \
    --test_config src/kilt/kilt/configs/dev_data.json \
    --output_folder output/sparse/

Retrieval evaluation

python src/kilt/kilt/eval_retrieval.py \
    output/$index/$dataset-dev-kilt.jsonl \ # retrieval results - the output of running eval_retrieval.py
    data/$dataset-dev-kilt.jsonl \ # gold KILT file (available for download in the KILT repo)
    --ks="1,20,100"

Standalone dense index usage

Install and launch distributed-faiss. More details on the distributed-faiss server here.

pip install -e git+https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss
python src/distributed-faiss/scripts/server_launcher.py \
    --log-dir logs/ \
    --discovery-config faiss_index/disovery_config.txt \
    --num-servers 32 \
    --num-servers-per-node 4 \
    --timeout-min 4320 \
    --save-dir faiss_index/ \
    --mem-gb 500 \
    --base-port 13034 \
    --partition dev &

Standalone client example

For a minimal working example of querying the Sphere dense index, we propose to interact with the DPR model via transformers API. To that end please install dependencies:

pip install transformers==4.17.0

Using the DPR checkpoing with transformers API requires reformatting the original checkpoint. You can download and unpack the transformers-complatible DPR_web query encoder here:

mkdir -p checkpoints
wget -P checkpoints https://dl.fbaipublicfiles.com/sphere/dpr_web_query_encoder_hf.tar.gz
tar -xzvf checkpoints/dpr_web_query_encoder_hf.tar.gz -C checkpoints/

Alternatively, you can convert the dpr_web_biencoder.cp model yourself using available scripts.

Then you can run the interactive demo:

python scripts/sphere_client_demo_hf.py \
    --encoder checkpoints/dpr_web_query_encoder_hf \
    --discovery-config faiss_index/disovery_config.txt \
    --index-id dense

License

Sphere is released under the CC-BY-NC 4.0 license. See the LICENSE file for details.

sphere's People

Contributors

ola13 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.