uhh-lt / path2vec Goto Github PK

Learning to represent shortest paths and other graph-based measures of node similarities with graph embeddings

License: Apache License 2.0

Python 42.33% Shell 0.93% Jupyter Notebook 56.74%

graph embeddings graph-embeddings word2vec wordnet synset synset-embeddings semantic-relatedness semantic-similarity semantic-similarity-measures

path2vec's Introduction

path2vec

This repository contains code related to this paper:

Andrey Kutuzov, Mohammad Dorgham, Oleksiy Oliynyk, Chris Biemann, Alexander Panchenko (2019)

Making Fast Graph-based Algorithms with Graph Metric Embeddings

Path2vec is a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

Pre-trained models and datasets

You can download pre-trained dense vector representations of WordNet synsets approximating several different graph distance metrics:

Prepared training datasets are also available:

datasets

Models training

Train your own graph embedding model with:

python3 embeddings.py --input_file TRAINING_DATASET --vocab_file synsets_vocab.json.gz --use_neighbors

Run python3 embeddings.py -h for help on tunable hyperparameters.

Models evaluation

python3 evaluation.py MODELFILE SIMFILE0 SIMFILE1

MODELFILE is the file with synset vectors in word2vec text format.

SIMFILE is one of semantic similarity datasets. It is expected that SIMFILE0 will contain Wordnet similarities, while SIMFILE1 will contain SimLex999 similarities, and that they correspond to the graph distance metrics on which the model was trained. The model will be tested on both of these test sets, and additionally on the raw SimLex999 (dynamically assigning synsets to lemmas).

For example, to evaluate on the shortest path metrics (shp):

python3 evaluation.py shp.vec.gz simlex/simlex_shp.tsv simlex/simlex_synsets/max_shp_human.tsv

Model Wordnet Static Dynamic

shp 0.9473 0.5121 0.5551

The resulting score 0.9473 is the Spearman rank correlation between model-produced similarities and WordNet similarities (using SIMFILE0). The second score 0.5121 is calculated on SIMFILE1 (human judgments). The 3rd score (0.5551 in the example) is always calculated on the original Simlex with dynamically selected synsets (see below for details).

Evaluation with dynamic synset selection

One can also evaluate using dynamic synset selection on the original SimLex test set.

'Dynamic synset selection' here means that the test set contains lemmas, not synsets. From all possible WordNet synsets for words A and B in each test set pair, we choose the synset combination which yields maximum similarity in the model under evaluation. For example, for the words weekend and week we choose the synsets weekend.n.01 and workweek.n.01, etc.

To evaluate the model this way, use the evaluate_lemmas.py script:

python3 evaluate_lemmas.py MODELFILE simlex/simlex_original.tsv

BibTex

@inproceedings{kutuzov-etal-2019-making,
    title = "Making Fast Graph-based Algorithms with Graph Metric Embeddings",
    author = "Kutuzov, Andrey  and
      Dorgham, Mohammad  and
      Oliynyk, Oleksiy  and
      Biemann, Chris  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 57th Conference of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1325",
    pages = "3349--3355",
    abstract = "Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.",
}

path2vec's People

Contributors

Stargazers

Watchers

Forkers

guanlongtianzi nicholas-botzer hyeyeankkim dunglason6789p almoslmi fatmalearningphd maxtrem colinsongf bharat-runwal lian6605 pbagherzadeh hoetosaki standardgalactic doridoricomgong

path2vec's Issues

Add results of node2vec 600 to the paper to figure Figure 3

specify the regularization coefficients from the command line

also for l1_factor = 1e-10

pass the lr to lr=0.001

Proof reading

@alexteua @snkohail @m-dorgham

Your help is needed in proofreading of the paper.

Please login to overleaf an try to fix grammar errors, typos and leave here further more generic comments.

Compare to holographic graph embeddings

Report results on our datasets for the following models in this order:

TransX (all models starting from "Trans" from [2]).
HolE (holographic embeddings from [1]).

[1] holographic:
https://github.com/laxatives/GraphEmbeddings

[2] TransX:
https://github.com/thunlp/TensorFlow-TransX

Obtain results for the SemEval 2013 (WSD)

We need to address another comment of the reviewer that complained on the fact that the WSD dataset is old.

Please obtain the WSD evaluation results also for this dataset:

"SemEval-13 task 12 (Navigli et al., 2013). This dataset includes thirteen documents from various domains. In this case the original sense inventory was WordNet 3.0, which is the same that we use for all datasets. The number of sense annotations is 1644, although only nouns are considered."

This dataset is in exact same format (http://lcl.uniroma1.it/wsdeval/evaluation-data) so it should be just a matter of running the script against the new data. We need to reproduce the table 5 but for the new dataset (https://arxiv.org/pdf/1808.05611.pdf).

Redraw the plot

remove number prefix for words
add the original words below
try making labels bold font
make the connections and nodes blue (but the selected sensese are red)
make connections blue as well
encode the weight of the connection with the thickness of the line

baseline based on the deepwalk

the goal is to train a baseline model based on the deepwal on wordnet graph of nouns (verbs, adjectives are not relevant)
to generate the graph, nltk library should be used http://www.nltk.org/howto/wordnet.html . iterate over all noun synsets and generate all hyponyms and hypernyms for each synset. alternatively, add transitive relations (hyponyms and hypernyms of distance n = 2, 3, 4, 5)
as a sanity check: load trained embeddings in gensim and look at the most similar words
try various configurations (dim, batch size, num. of walks, ...) and save the embeddings in thew word2vec format
evaluation of each w2v model using the scripts from this repository (simlex dir.)

Remove from the training dataset nodes with no neighbours and train the model with the neighbour regularizer (k=1) again

Perform meta-parameter search using PyTorch implementation

Report the results of the new implementation and in comparison to the old one. Of particular interest is if increasing the number of nearest neighbors would NOT now decrease the performance (as in the case with the current implementation). Try to optimize other meta-parameters as well - they can be different from the Keras/TF implementation for some random technical reasons.

Report results on all datasets with the regularization based on the adjacent nodes

Try the model without the similarities

Generate results similar to https://docs.google.com/spreadsheets/d/1esXh-eNz76_86PikQe9-FwkeEpMz7Of6rzeiSS6lrD8/edit#gid=1778525383, but ONLY using L2 regularization and regularization using nearest neighbours (the 'main' training data are not used at all in the form of the pair similarities).

Different Deepwalk results on static synset selection

Should not deep walk results here be the same (as for the dynamic setting)?

Report results on all datasets with L2 and other types of regularization

Rank the nearest neighbours for the regularization

Best NN strategy. Use the NLTK wordnet similarity package to rank the nearest neighbors used for the regularization (select the most semantically relevant).
All NN strategy Alternatively, just rotate the neighbours during different iterations. E.g. for k=1 nearest neighbour in the model, and n=5 neighbours, different iterations should use different neighbours.

Make the batch generator thread-safe

keras-team/keras#1638

https://stackoverflow.com/questions/41194726/python-generator-thread-safety-using-keras

node2vec baseline

Download the WordNet graph here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/
Convert the adjlist file to edgelist file (which the node2vec takes as an input) using networkX

Train the graph embeddings using the node2vec (https://github.com/snap-stanford/snap/tree/master/examples/node2vec) for all combinations of the following parameters:

Number of dimensions: 50, 100, 200, 300
Number of walks per source. 5, 10, 25
Context size for optimization: 5, 10, 25
Number of epochs in SGD: 1, 5, 10
Graph is directed: true, false

Overall, you need to output 4 * 3 * 3 * 3 * 2 models (the names of each embedding file should contain the hyperparameters).

Lookup names of the synsets for each file from the pickle file (http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/nodes.pkl).

You can do it using this script: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/convert_embedding.py

OR manually, like ...


with open('nodes.pkl', 'rb') as f:
    pkl = pickle.load(f)
synset_name = pkl.get(node_index)

Perform the evalution of each model using the evaluation script. See here details: https://github.com/uhh-lt/shortpath2vec.
Save the results in this table in the 'node2vec' sheet (similarly as for the deepwalk method):
https://docs.google.com/spreadsheets/d/1KjNns16ld3pVUY1K7aA0HH9Lrb1PiKBH7-ZIbWrD9Kc/edit#gid=1803816318

graph construction code

please commit the code for construction of the graph

Improvements in the embeddings script

Add command line interface using argparse. These parameters should be configurable using argparse:


  trainfile = sys.argv[1]  # Gzipped file with pairs and their similarities
  embedding_dimension = int(sys.argv[2])  # vector size (for example, 20)
  batch_size = int(sys.argv[3])  # number of pairs in a batch (for example, 10)
  learn_rate = float(sys.argv[4])   # Learning rate
  
  negative = 3  # number of negative samples

Merge embeddings_2.py and embeddings.py. Add the changes in the new script to the old one making them optional. Add a special argument (turned off by default) like --use-neighbours and --number-of-neighbours which controls the number of used nearest neighbours.

Report the results for different values of beta/gamma

Try beta/gamma in 10e-1, 10e-2, 10e-3, 10e-4

Try different dimensions of the embeddings

Please add to https://docs.google.com/spreadsheets/d/1esXh-eNz76_86PikQe9-FwkeEpMz7Of6rzeiSS6lrD8/edit#gid=1687995786 new results for various dimensions of the embeddings to be able to plot it in this figure:

Compare to the GCN embeddings

About

Implementations

pytorch: https://github.com/tkipf/pygcn
TF: https://github.com/tkipf/gcn
Keras: https://github.com/tkipf/keras-gcn

Relation to Glove and word2vec

Add a description (in one paragraph) which relates our model (in terms of the math notation already introduced in the paper) to the Glove https://nlp.stanford.edu/pubs/glove.pdf and word2vec.

Glove:

construction of a random graph

https://github.com/nlpub/chinese-whispers-python/blob/master/samples.ipynb

testing the number of nearest neighbours from the 'wrong' communities

Recompute results for WSD for the best model

Update the table with the new best results

Training the model on the entire WordNet

Hi,

First of all thanks for publishing this code. Second, I want to re-train the model on the entire WordNet, taking all parts-of-speech into account. Where do I need to make changes to accomplish this?

Thanks in advance.

Evaluation based on word sense disambiguation

Motivation

We plan to submit a paper to the EMNLP conference http://emnlp2018.org about the graph embedding approach implemented in this repository. In this experiment, we learn using our model vector representations of WordNet nodes (synsets). Each vector is in the word2vec format.

One of the evaluations methods of the obtained word embeddings is to apply them to perform word sense disambiguation (https://en.wikipedia.org/wiki/Word-sense_disambiguation). In particular, it is straightforward to do it with a pre-exisiting graph based algorithm described in the following paper:
http://www.cse.unt.edu/~tarau/research/misc/wsd/papers/sinha.ieee07.pdf

Unfortunately, the implementation of the algorithm is not available. The goal is to re-implement the algorithm presented in the paper and to reproduce the experiment presented in this paper. After this, the goal would be to replace the similarity metrics used in this paper based on the WordNet with the graph embedddings similarities and to show that the difference in the results is small.

Implementation

Read first this paper carefully to understand the method which has to be implemented. Make sure you understand the Algorithm 1.
Implement in the Python 3 programming language the word sense disambiguation (WSD) presented in the paper in Algorithm 1. Use the NetworkX library to (a) use the graph structure, (b) rely on the PageRank algorithm implementation. https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html .

Implement two options of the algorithm: using the PageRank and InDegree (see the paper). Do not implement other versions of the algorithm.
Use NLTK library for implementations of the JCN (Jiang-Conrath) and LCH (Leocock-Chodorow) similarity measures between words: See the 'similarity' here: http://www.nltk.org/howto/wordnet.html. Do not implement options for other measures described in the paper.

For conducting the evalution experiment on SENSEVAL 2/3 datasets, rely on this framework: http://lcl.uniroma1.it/wsdeval/evaluation-data
Report the results in a Google Sheets table. Compare them to the original scores from the paper (they should be reasonably close).
Use the gensim library to load graph embeddings of the synsets. Replace the JCN and LCH word similarity measures with their vectorized versions. Models in the word2vec format are available here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/models/
Make a version of the algorithm which uses not the original JCN / LCH measures but their vectorized counterparts.
Report in the Google Sheet performance of the WSD algorithm on the same dataset based on the vector-based synset similarities.

How to get the right embeddings for ambiguous words?

I want to combine the embeddings of all found word synsets to represent sentence embeddings.

Consider a sentence containing words with multiple meanings. For example, "the customer has opened a bank account". The word "bank" has multiple synsets in WordNet. So how can I get the right one given the context of the sentence?

My guess was to use sentence_wsd() in wsd\graph_wsd_test_v2.py. I adjusted the code a little bit to support any sentence (without having the senseeval annotations), but the result of selected synsets was not really satisfying.

Results of regularized models on WordNet similarities

@m-dorgham can you add to the 'regularizers' tab in the Google Spreadsheet the evaluation scores calculated on WordNet similarities?
What you have there now are the scores calculated either in the 'dynamic' setup (on the simlex_original.tsv test set) or in the 'static' setup (on the test sets from simlex/simlex_synsets subdirectory).

What I'm asking for now are the Spearman correlation values calculated on the test sets from the simlex subdirectory:

simlex_jcn_brown.tsv
simlex_jcn_semcor.tsv
simlex_lch.tsv

It seems we should focus more on this metrics, since it is more closely related to our original aim, that is, approximating arbitrary graph distance metrics with dense vectors.