Giter Club home page Giter Club logo

genre's Introduction

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@inproceedings{decao2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={Nicola De Cao and Gautier Izacard and Sebastian Riedel and Fabio Petroni},
  booktitle={International Conference on Learning Representations},
  url={https://openreview.net/forum?id=5k8F6UU39V},
  year={2021}
}

The mGENRE system as presented in Multilingual Autoregressive Entity Linking

@inproceedings{decao2020multilingual,
  title={Multilingual Autoregressive Entity Linking}, 
  author={Nicola De Cao and Ledell Wu and Kashyap Popat and Mikel Artetxe and 
          Naman Goyal and Mikhail Plekhanov and Luke Zettlemoyer and 
          Nicola Cancedda and Sebastian Riedel and Fabio Petroni},
  booktitle={arXiv pre-print 2103.12528},
  url={https://arxiv.org/abs/2103.12528},
  year={2021},
}

Please consider citing our works if you use code from this repository.

In a nutshell, (m)GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture or mBART (for multilingual). (m)GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annotated with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

mGENRE performs multilingual entity linking in 100+ languages treating language as latent variables and marginalizing over them:

Main dependencies

  • python>=3.7
  • pytorch>=1.6
  • fairseq>=0.10 (optional for training GENRE) NOTE: fairseq is going though changing without backward compatibility. Install fairseq from source and use this commit for reproducibilty. See here for the current PR that should fix fairseq/master.
  • transformers>=4.2 (optional for inference of GENRE)

Examples & Usage

For a full review of (m)GENRE API see:

GENRE

After importing and loading the model and a prefix tree (trie), you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

import pickle
from genre.trie import Trie
from genre.fairseq_model import GENRE

# load the prefix tree (trie)
with open("../data/kilt_titles_trie_dict.pkl", "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))

# load the model
model = GENRE.from_pretrained("models/fairseq_entity_disambiguation_aidayago").eval()

# generate Wikipedia titles
model.sample(
    sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."],
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
[[{'text': 'Germany', 'score': tensor(-0.1856)},
  {'text': 'Germans', 'score': tensor(-0.5461)},
  {'text': 'German Empire', 'score': tensor(-2.1858)}]

mGENRE

Making predictions with mGENRE is very similar, but we additionally need to map (title, language_ID) to Wikidata IDs and (optionally) marginalize over predictions of the same entity:

import pickle
from genre.trie import Trie, MarisaTrie
from genre.fairseq_model import mGENRE

with open("../data/lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
    lang_title2wikidataID = pickle.load(f)

# memory efficient prefix tree (trie) implemented with `marisa_trie`
with open("../data/titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
    trie = pickle.load(f)

# generate Wikipedia titles and language IDs
model = mGENRE.from_pretrained("../models/fairseq_multilingual_entity_disambiguation").eval()

model.sample(
    sentences=["[START] Einstein [END] era un fisico tedesco."],
    # Italian for "[START] Einstein [END] was a German physicist."
    prefix_allowed_tokens_fn=lambda batch_id, sent: [
        e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
    ],
    text_to_id=lambda x: max(lang_title2wikidataID[
        tuple(reversed(x.split(" >> ")))
    ], key=lambda y: int(y[1:])),
    marginalize=True,
)
[[{'id': 'Q937',
   'texts': ['Albert Einstein >> it',
    'Alberto Einstein >> it',
    'Einstein >> it'],
   'scores': tensor([-0.0808, -1.4619, -1.5765]),
   'score': tensor(-0.0884)},
  {'id': 'Q60197',
   'texts': ['Alfred Einstein >> it'],
   'scores': tensor([-1.4337]),
   'score': tensor(-3.2058)},
  {'id': 'Q15990626',
   'texts': ['Albert Einstein (disambiguation) >> en'],
   'scores': tensor([-1.0998]),
   'score': tensor(-3.6478)}]]

Models & Datasets

For GENRE use this script to download all models and this to download all datasets. See here the list of all individual models for each task and for both pytorch fairseq and huggingface transformers. See the example on how to download additional optional files like the prefix tree (trie) for KILT Wikipedia.

For mGENRE we only have a model available here. See the example on how to download additional optional files like the prefix tree (trie) for Wikipedia in all languages and the mapping between titles and Wikidata IDs.

Pre-trained mBART model on 125 languages available here.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

genre's People

Contributors

fabiopetroni avatar nicola-decao avatar ynouri avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.