Giter Club home page Giter Club logo

xling-el's Introduction

Code for running the entity linking model. This is part of the code for the xelms project.

Requirements

  1. pytorch (0.2.0+21f8ad4): installed from source, and patched for sparse tensor operations (instructions below).
  2. python3.
  3. cogcomp-nlpy.
  4. Download the resources and trained models here and place them in the folder xling-el/data. Right now, pre-trained models are available for German, Spanish, French, Italian, and Chinese.

Resources for Candidate Generation

  1. First set up candidate generation and other resources as described in projects wikidump_preprocessing and wiki_candgen.
  2. A mongo daemon needs to be running. This is where the resources generated in wiki_candgen will be kept for fast (and parallel) access.

Note: These resources are provided in the resources directory downloaded in step 4. above, so ideally you do not need to regenerate them, unless you plan to use a newer Wikipedia dump or a larger knowledge base.

Patching Pytorch for Sparse Tensor Operations

This is best done in a new conda environment.

  1. First checkout the sparse_patch branch from this repository.
git clone https://github.com/shyamupa/pytorch
cd pytorch
git checkout sparse_patch
  1. Install the patched code from source using the following commands,
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

# Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
cd pytorch_patched
python setup.py install

Ensure that the patched pytorch was successfully installed,

>>> import torch
>>> torch.__version__
'0.2.0+43662e7'

Mention Detection using NER

  1. For German, Spanish, French and Italian, download relevant Spacy NER Models
pip install spacy
python -m spacy download de_core_news_sm
python -m spacy download es_core_news_md
python -m spacy download fr_core_news_md
python -m spacy download it_core_news_sm
  1. For Chinese, download stanford corenlp jar and the chinese model jar and place them in a stanford_jars directory.
$ ls stanford_jars/
stanford-corenlp-full-2018-10-05
$ ls stanford_jars/stanford-corenlp-full-2018-10-05
...
...
stanford-chinese-corenlp-2018-10-05-models.jar
...

And set the bash environment variable CORENLP_HOME to path/to/stanford_jars/stanford-corenlp-full-2018-10-05.

export CORENLP_HOME=path/to/stanford_jars/stanford-corenlp-full-2018-10-05

Running the Model

To run the model, use the command,

./run_inference_on_doc.sh <lang> <infile> <outfile>

For instance, for running on a German document test_docs/de_doc.txt, one would run

./run_inference_on_doc.sh de test_docs/de_doc.txt test_docs/de_doc_output.txt

The json output will be produced in test_docs/de_doc_output.txt.

Output

The output file is a json serialized text annotation, with a view named NEURAL_XEL_<lang>. The view consists of a list of the constituents that have been linked to a Wikipedia title. Below is the output for the German test document provided in the repo,

...
"viewName": "NEURAL_XEL_de",
...
...
"constituents": [
      {
       "end": 2,
       "label": "en.wikipedia.org/wiki/Angela_Merkel",
       "score": 0.5128146075318596,
       "start": 0,
       "tokens": "Angela Merkel"
      },
      {
       "end": 5,
       "label": "NULLTITLE",
       "score": 0.05000000074505806,
       "start": 4,
       "tokens": "Elim-Krankenhaus"
      },
      ...

The label field for each constituent is the predicted Wikipedia entity for the span identified by the start and end token index. Here a label of NULLTITLE means that the named entity detected by the mention detection system could not be linked to any entity.

xling-el's People

Contributors

shyamupa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

xling-el's Issues

dataset

hello, I am sorry for disturbing you. I can not open the linkof resources for candidate generation, could you please share this dataset with me?

The data is lost

The file named "data_release_v1.tar.gz" is not found on the server.
Could you upload it again?
Thank you.

How did you deal with nil mention?

If there are no candidates, it's easy to tag the mention with nil. However sometimes a mention has generated candidates, but all of them are wrong. How did you deal with this case?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.