Giter Club home page Giter Club logo

path2vec's Introduction

path2vec

This repository contains code related to this paper:

Andrey Kutuzov, Mohammad Dorgham, Oleksiy Oliynyk, Chris Biemann, Alexander Panchenko (2019)

Making Fast Graph-based Algorithms with Graph Metric Embeddings

Path2vec is a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

Pre-trained models and datasets

You can download pre-trained dense vector representations of WordNet synsets approximating several different graph distance metrics:

Prepared training datasets are also available:

Models training

Train your own graph embedding model with:

python3 embeddings.py --input_file TRAINING_DATASET --vocab_file synsets_vocab.json.gz --use_neighbors

Run python3 embeddings.py -h for help on tunable hyperparameters.

Models evaluation

python3 evaluation.py MODELFILE SIMFILE0 SIMFILE1

MODELFILE is the file with synset vectors in word2vec text format.

SIMFILE is one of semantic similarity datasets. It is expected that SIMFILE0 will contain Wordnet similarities, while SIMFILE1 will contain SimLex999 similarities, and that they correspond to the graph distance metrics on which the model was trained. The model will be tested on both of these test sets, and additionally on the raw SimLex999 (dynamically assigning synsets to lemmas).

For example, to evaluate on the shortest path metrics (shp):

python3 evaluation.py shp.vec.gz simlex/simlex_shp.tsv simlex/simlex_synsets/max_shp_human.tsv

Model Wordnet Static Dynamic

shp 0.9473 0.5121 0.5551

The resulting score 0.9473 is the Spearman rank correlation between model-produced similarities and WordNet similarities (using SIMFILE0). The second score 0.5121 is calculated on SIMFILE1 (human judgments). The 3rd score (0.5551 in the example) is always calculated on the original Simlex with dynamically selected synsets (see below for details).

Evaluation with dynamic synset selection

One can also evaluate using dynamic synset selection on the original SimLex test set.

'Dynamic synset selection' here means that the test set contains lemmas, not synsets. From all possible WordNet synsets for words A and B in each test set pair, we choose the synset combination which yields maximum similarity in the model under evaluation. For example, for the words weekend and week we choose the synsets weekend.n.01 and workweek.n.01, etc.

To evaluate the model this way, use the evaluate_lemmas.py script:

python3 evaluate_lemmas.py MODELFILE simlex/simlex_original.tsv

BibTex

@inproceedings{kutuzov-etal-2019-making,
    title = "Making Fast Graph-based Algorithms with Graph Metric Embeddings",
    author = "Kutuzov, Andrey  and
      Dorgham, Mohammad  and
      Oliynyk, Oleksiy  and
      Biemann, Chris  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 57th Conference of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1325",
    pages = "3349--3355",
    abstract = "Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.",
}

path2vec's People

Contributors

akutuzov avatar alexanderpanchenko avatar alexteua avatar m-dorgham avatar mdorgham5 avatar snkohail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

path2vec's Issues

Obtain results for the SemEval 2013 (WSD)

We need to address another comment of the reviewer that complained on the fact that the WSD dataset is old.

Please obtain the WSD evaluation results also for this dataset:

"SemEval-13 task 12 (Navigli et al., 2013). This dataset includes thirteen documents from various domains. In this case the original sense inventory was WordNet 3.0, which is the same that we use for all datasets. The number of sense annotations is 1644, although only nouns are considered."

This dataset is in exact same format (http://lcl.uniroma1.it/wsdeval/evaluation-data) so it should be just a matter of running the script against the new data. We need to reproduce the table 5 but for the new dataset (https://arxiv.org/pdf/1808.05611.pdf).

Redraw the plot

image

  • remove number prefix for words

  • add the original words below

  • try making labels bold font

  • make the connections and nodes blue (but the selected sensese are red)

  • make connections blue as well

  • encode the weight of the connection with the thickness of the line

baseline based on the deepwalk

  • the goal is to train a baseline model based on the deepwal on wordnet graph of nouns (verbs, adjectives are not relevant)

  • to generate the graph, nltk library should be used http://www.nltk.org/howto/wordnet.html . iterate over all noun synsets and generate all hyponyms and hypernyms for each synset. alternatively, add transitive relations (hyponyms and hypernyms of distance n = 2, 3, 4, 5)

  • as a sanity check: load trained embeddings in gensim and look at the most similar words

  • try various configurations (dim, batch size, num. of walks, ...) and save the embeddings in thew word2vec format

  • evaluation of each w2v model using the scripts from this repository (simlex dir.)

Perform meta-parameter search using PyTorch implementation

Report the results of the new implementation and in comparison to the old one. Of particular interest is if increasing the number of nearest neighbors would NOT now decrease the performance (as in the case with the current implementation). Try to optimize other meta-parameters as well - they can be different from the Keras/TF implementation for some random technical reasons.

Rank the nearest neighbours for the regularization

  1. Best NN strategy. Use the NLTK wordnet similarity package to rank the nearest neighbors used for the regularization (select the most semantically relevant).

  2. All NN strategy Alternatively, just rotate the neighbours during different iterations. E.g. for k=1 nearest neighbour in the model, and n=5 neighbours, different iterations should use different neighbours.

node2vec baseline

  1. Download the WordNet graph here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/

  2. Convert the adjlist file to edgelist file (which the node2vec takes as an input) using networkX

  1. Train the graph embeddings using the node2vec (https://github.com/snap-stanford/snap/tree/master/examples/node2vec) for all combinations of the following parameters:

Number of dimensions: 50, 100, 200, 300
Number of walks per source. 5, 10, 25
Context size for optimization: 5, 10, 25
Number of epochs in SGD: 1, 5, 10
Graph is directed: true, false

Overall, you need to output 4 * 3 * 3 * 3 * 2 models (the names of each embedding file should contain the hyperparameters).

  1. Lookup names of the synsets for each file from the pickle file (http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/nodes.pkl).

You can do it using this script: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/convert_embedding.py

OR manually, like ...


with open('nodes.pkl', 'rb') as f:
    pkl = pickle.load(f)
synset_name = pkl.get(node_index)
  1. Perform the evalution of each model using the evaluation script. See here details: https://github.com/uhh-lt/shortpath2vec.

  2. Save the results in this table in the 'node2vec' sheet (similarly as for the deepwalk method):
    https://docs.google.com/spreadsheets/d/1KjNns16ld3pVUY1K7aA0HH9Lrb1PiKBH7-ZIbWrD9Kc/edit#gid=1803816318

Improvements in the embeddings script

  1. Add command line interface using argparse. These parameters should be configurable using argparse:

  trainfile = sys.argv[1]  # Gzipped file with pairs and their similarities
  embedding_dimension = int(sys.argv[2])  # vector size (for example, 20)
  batch_size = int(sys.argv[3])  # number of pairs in a batch (for example, 10)
  learn_rate = float(sys.argv[4])   # Learning rate
  
  negative = 3  # number of negative samples

  1. Merge embeddings_2.py and embeddings.py. Add the changes in the new script to the old one making them optional. Add a special argument (turned off by default) like --use-neighbours and --number-of-neighbours which controls the number of used nearest neighbours.

Training the model on the entire WordNet

Hi,

First of all thanks for publishing this code. Second, I want to re-train the model on the entire WordNet, taking all parts-of-speech into account. Where do I need to make changes to accomplish this?

Thanks in advance.

Evaluation based on word sense disambiguation

Motivation

We plan to submit a paper to the EMNLP conference http://emnlp2018.org about the graph embedding approach implemented in this repository. In this experiment, we learn using our model vector representations of WordNet nodes (synsets). Each vector is in the word2vec format.

One of the evaluations methods of the obtained word embeddings is to apply them to perform word sense disambiguation (https://en.wikipedia.org/wiki/Word-sense_disambiguation). In particular, it is straightforward to do it with a pre-exisiting graph based algorithm described in the following paper:
http://www.cse.unt.edu/~tarau/research/misc/wsd/papers/sinha.ieee07.pdf

Unfortunately, the implementation of the algorithm is not available. The goal is to re-implement the algorithm presented in the paper and to reproduce the experiment presented in this paper. After this, the goal would be to replace the similarity metrics used in this paper based on the WordNet with the graph embedddings similarities and to show that the difference in the results is small.

Implementation

  1. Read first this paper carefully to understand the method which has to be implemented. Make sure you understand the Algorithm 1.

  2. Implement in the Python 3 programming language the word sense disambiguation (WSD) presented in the paper in Algorithm 1. Use the NetworkX library to (a) use the graph structure, (b) rely on the PageRank algorithm implementation. https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html .

  • Implement two options of the algorithm: using the PageRank and InDegree (see the paper). Do not implement other versions of the algorithm.

  • Use NLTK library for implementations of the JCN (Jiang-Conrath) and LCH (Leocock-Chodorow) similarity measures between words: See the 'similarity' here: http://www.nltk.org/howto/wordnet.html. Do not implement options for other measures described in the paper.

  1. For conducting the evalution experiment on SENSEVAL 2/3 datasets, rely on this framework: http://lcl.uniroma1.it/wsdeval/evaluation-data

  2. Report the results in a Google Sheets table. Compare them to the original scores from the paper (they should be reasonably close).

  3. Use the gensim library to load graph embeddings of the synsets. Replace the JCN and LCH word similarity measures with their vectorized versions. Models in the word2vec format are available here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/models/

  4. Make a version of the algorithm which uses not the original JCN / LCH measures but their vectorized counterparts.

  5. Report in the Google Sheet performance of the WSD algorithm on the same dataset based on the vector-based synset similarities.

How to get the right embeddings for ambiguous words?

I want to combine the embeddings of all found word synsets to represent sentence embeddings.

Consider a sentence containing words with multiple meanings. For example, "the customer has opened a bank account". The word "bank" has multiple synsets in WordNet. So how can I get the right one given the context of the sentence?

My guess was to use sentence_wsd() in wsd\graph_wsd_test_v2.py. I adjusted the code a little bit to support any sentence (without having the senseeval annotations), but the result of selected synsets was not really satisfying.

Results of regularized models on WordNet similarities

@m-dorgham can you add to the 'regularizers' tab in the Google Spreadsheet the evaluation scores calculated on WordNet similarities?
What you have there now are the scores calculated either in the 'dynamic' setup (on the simlex_original.tsv test set) or in the 'static' setup (on the test sets from simlex/simlex_synsets subdirectory).

What I'm asking for now are the Spearman correlation values calculated on the test sets from the simlex subdirectory:

  • simlex_jcn_brown.tsv
  • simlex_jcn_semcor.tsv
  • simlex_lch.tsv

It seems we should focus more on this metrics, since it is more closely related to our original aim, that is, approximating arbitrary graph distance metrics with dense vectors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.