malllabiisc / wordgcn Goto Github PK

View Code? Open in Web Editor NEW

289.0 14.0 64.0 5.19 MB

ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks

License: Apache License 2.0

C++ 6.78% Python 93.01% Makefile 0.21%

graph-convolutional-networks gcn tensorflow deep-learning-tutorial word-embeddings natural-language-processing acl2019

wordgcn's Introduction

WordGCN

Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks

Overview of WordGCN

Overview of SynGCN: SynGCN employs Graph Convolution Network for utilizing dependency context for learning word embeddings. For each word in vocabulary, the model learns its representation by aiming to predict each word based on its dependency context encoded using GCNs. Please refer Section 5 of the paper for more details.

Dependencies

Compatible with TensorFlow 1.x and Python 3.x.
Dependencies can be installed using requirements.txt.
- pip3 install -r requirements.txt
Install word-embedding-benchmarks used for evaluating learned embeddings.
- The test and valid dataset splits used in the paper can be downloaded from this link. Replace the original ~/web_data folder with the provided one.
- For switching between valid and test split execute python switch_evaluation_data.py -split <valid/valid>

Dataset:

We used Wikipedia corpus. The processed version can be downloaded from here or using the script below:

pip install gdown
gdown --id 1iFpuKFpDnXCD9QpUw8wStG3ndKl7-KwX -O data.zip
unzip data.zip
rm data.zip

The processed dataset includes:
- voc2id.txt mapping of words to to their unique identifiers.
- id2freq.txt contains frequency of words in the corpus.
- de2id.txt mapping of dependency relations to their unique identifiers.
- data.txt contains the entire Wikipedia corpus with each sentence of corpus stored in the following format:
```
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
```
  - Here, num_words is the number of words and num_dep_rels denotes the number of dependency relations in the sentence.
  - tok_1, tok_2 ... is the list of tokens in the sentence and dep_e1, dep_e2 ...is the list of dependency relations where each is of form source_token|destination_token|dep_rel_label.

Training SynGCN embeddings:

Download the processed Wikipedia corpus (link) and extract it in ./data directory.
Execute make to compile the C++ code for creating batches.

To start training run:

python syngcn.py -name test_embeddings -gpu 0 -dump 
                 -maxsentlen <max_sentence_length in your data.txt> 
                 -maxdeplen <max_dependency_length in your data.txt> 
                 -embed_dim 300

The trained embeddings will be stored in ./embeddings directory with the provided name test_embeddings .
Note: As reported in TensorFlow issue #13048. The current SynGCN's TF-based implementation is slow compared to Mikolov's word2vec implementation. For training SynGCN on a very large corpus might require multi-GPU or C++ based implementation.

Fine-tuning embedding using SemGCN:

Pre-trained 300-dimensional SynGCN embeddings can be downloaded from here.

For incorporating semantic information in given embeddings run:

python semgcn.py -embed ./embeddings/pretrained_embed.txt 
                 -semantic synonyms -embed_dim 300 
                 -name fine_tuned_embeddings -dump -gpu 0

The fine-tuned embeddings will be saved in ./embeddings directory with name fine_tuned_embeddings.

Extrinsic Evaluation:

For extrinsic evaluation of embeddings the models from the following papers were used:

NCR (Neural Co-reference Resolution): Higher-order Coreference Resolution with Coarse-to-fine Inference.
NER (Named Entity Recognition): NeuroNER: an easy-to-use program for named-entity recognition based on neural networks.
POS (Part-of-speech tagging): BiLSTM-CNN-CRF architecture for sequence tagging.
SQuAD (Question Answering): Simple and Effective Multi-Paragraph Reading Comprehension

Citation:

Please cite the following paper if you use this code in your work.

@inproceedings{wordgcn2019,
    title = "Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks",
    author = "Vashishth, Shikhar  and
      Bhandari, Manik  and
      Yadav, Prateek  and
      Rai, Piyush  and
      Bhattacharyya, Chiranjib  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 57th Conference of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1320",
    pages = "3308--3318"
}

For any clarification, comments, or suggestions please create an issue or contact Shikhar.

wordgcn's People

Contributors

Stargazers

Watchers

wordgcn's Issues

urllib.error.URLError

Hello, I am running "python syngcn.py -name test_embeddings -gpu 0 -dump "to get the URL error. I want to ask how to solve this error and how to crawl the dataset after it has been downloaded? Thank you very much for your help.

Verse BERT

I think your work is really interesting. But Does it have any meaning after the BERT model?
It seems your work is like word2vec or Glove, these static word embedding. Can it encode text dynamically like BERT?

How can I replace

Hi,
To test and validate the embeddings, I need to "replace the original ~/web_data folder with the provided one", but I don't know who to replace the dataset in word-embeddings-benchmark project. The evaluation tool will automatically download datasets from google drive.
Can you provide a more detailed instruction?
Thank you very much!

“from web.embedding import Embedding” What package for this？

I can‘t find a package for “from web.embedding import Embedding”

No embeddings generated?

Hi,

I came up with a problem while using semgcn to fine tune the given 'syngcn_embeddings.txt'.
sudo python3 semgcn.py -embed ./embeddings/syngcn_embeddings.txt -semantic synonyms -embed_dim 300 -name fine_tuned_embeddings -epoch 10 -gpu 0

Everything seems going well. However, after the training finished (and it printed success message), I found nothing in the ./embeddings directory except for syngcn_embeddings.txt, which I put it there as the training set.
In addition, the log files were output to ./log successfully, and those evaluations in them accord with the data given in the paper.

I tried several times and it turned out the same.
Could anyone please tell me why this happened? Thanks!

Hi, what can I use this model?

If I pre-train a model, how can I use to process the text? Is it same as word2vec or Glove?

About the stopping criteria

Hi @svjan5 ,

I am curious about the stopping criteria for training you used in the paper. Is it the same as in the code by depending on the average score of all sorts of word similarity/analogy/categorisation tasks? Because I found that using the average score to save best model brought so much stochastics. Though the final average scores are similar for multiple runs, the score for specific tasks could have huge differences. Besides, do you think it is plausible to use those intrinsic tasks to decide the best model during training given the fact that you would evaluate your model on those tasks for comparison with other models?

Best,
Qiwei

"ModuleNotFoundError: No module named 'web"

Hello,there are some problem with me.
When I run "python syngcn.py -name test_embeddings -gpu 0", It has a error
"ModuleNotFoundError: No module named 'web"
So I run "pip install web.py",
Then "ModuleNotFoundError: No module named 'web.embedding".
I wanna to know how can I use web.embedding.
Thanks.

About using own text data for SynGCN and SemGCN

Your WordGCN paper is very exciting and very well written, so I want to try to use your code in my current work, and I would like to ask you some questions.
For training SynGCN and SemGCN, If I try to use other text data such as transcripts of speech recognition benchmark corpus (AMI) rather than the Wikipedia corpus and receive the AMI corpus-based SynGCN and SemGCN word embeddings, what is the first step I need to do, or how to process my own text data.
Thanks!

Shih-Hsuan

Segmentation fault (core dumped)

Hi, when I run "python syngcn.py -name test_embeddings -gpu 0". A problem has arisen, Segmentation fault (core dumped).Have you encountered such a problem?
Thank you a lot.

Question about the edge direction in SemGCN

In Figure 2 of your paper, for hypernym relation the edge direction is water -> liquid. In NLTK WordNet API, liquid is the hypernym of water, why the edge direction is not water <- liquid?

About SemGCN embeddings

I downloaded the pretrained SynGCN embeddings from your WordGCN github and then run the script "python semgcn.py -embed ./embeddings/syngcn_embeddings.txt -gpu 0 -epoch 10 -name fine_tuned_embeddings", but after the model was successfully trained, I cannot find/get the finetuned SemGCN embeddings. What should I do?
Thanks!

About the gating mechanism

Hi @svjan5 ,
After reading your source code, I have found some places that make me confused. In the paper, the formula you mentioned is like:

And where:

while in your code, it is like this:

with tf.name_scope("in_arcs-%s_name-%s_layer-%d" % (lbl, name, layer)):
	inp_in     = tf.tensordot(gcn_in, w_in, axes=[2,0]) + tf.expand_dims(b_in, axis=0)
	adj_matrix = tf.transpose(adj_mat[lbl], [0,2,1])
	in_t 	   = self.aggregate(inp_in, adj_matrix)							
	if self.p.dropout != 1.0: in_t    = tf.nn.dropout(in_t, keep_prob=self.p.dropout)
	if w_gating:
		inp_gin = tf.tensordot(gcn_in, tf.sigmoid(w_gin), axes=[2,0]) + tf.expand_dims(b_gin, axis=0)
		in_act  = self.aggregate(inp_gin, adj_matrix)
	else:
		in_act   = in_t

It seems to me that the calculated in_t or inp_in is never used when enable gating which might not align with the formula where there is a multiplication in between. And the weight w_in and w_out would never be updated in the code. May you please give me some information how the calculated in_t or the w_in and w_out are used under gating mechanism in your code?

Many thanks.

SSLError: Certificate verify failed

Hi,
Nice paper!

I had some problem during running 'semgcn.py'.
I downloaded pretrained 300-dimensional SynGCN embeddings from your README.md, and tried to fine tune them using semgcn.py. Here is how I typed:
sudo python3 semgcn.py -embed ./embeddings/syngcn_embeddings.txt -semantic synonyms -embed_dim 300 -name fine_tuned_embeddings -gpu 0

However, after the percentile hit 100, some Error occurred:

2019-09-20 19:57:05,665 - [INFO] - E:0 (Sents: 64640/64640 [100.0]): Train Loss0.36636	fine_tuned_embeddings_20_09_2019_19:44:56	0.0

Traceback (most recent call last):
  File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.5/http/client.py", line 1122, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1167, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1118, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 944, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 887, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 1276, in connect
    server_hostname=server_hostname)
  File "/usr/lib/python3.5/ssl.py", line 377, in wrap_socket
    _context=self)
  File "/usr/lib/python3.5/ssl.py", line 752, in __init__
    self.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 988, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 633, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "semgcn.py", line 613, in <module>
    model.fit(sess)
  File "semgcn.py", line 567, in fit
    self.checkpoint(train_loss, epoch, sess)
  File "semgcn.py", line 494, in checkpoint
    results		= evaluate_on_all(embedding)
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/evaluate.py", line 370, in evaluate_on_all
    "TR9856": fetch_TR9856(),
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/similarity.py", line 335, in fetch_TR9856
    'similarity', uncompress=True, verbose=0),
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/utils.py", line 741, in _fetch_file
    handlers=handlers)
  File "/usr/local/lib/python3.5/dist-packages/web-0.0.1-py3.5.egg/web/datasets/utils.py", line 648, in _fetch_helper
    data = url_opener.open(request)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1297, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.5/urllib/request.py", line 1256, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)>

I'm sure that my Ubuntu is properly connected to the Internet. So why does this happen?

Thanks!

i need help！！！

why i will have this question？

Traceback (most recent call last):
  File "D:\environment\Anaconda\lib\logging\config.py", line 562, in configure
    handler = self.configure_handler(handlers[name])
  File "D:\environment\Anaconda\lib\logging\config.py", line 735, in configure_handler
    result = factory(**kwargs)
  File "D:\environment\Anaconda\lib\logging\__init__.py", line 1087, in __init__
    StreamHandler.__init__(self, self._open())
  File "D:\environment\Anaconda\lib\logging\__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
OSError: [Errno 22] Invalid argument: 'D:\\workspace\\WordGCN-master\\WordGCN-master\\log\\test_run_15_06_2020_08:32:34'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:/workspace/WordGCN-master/WordGCN-master/syngcn.py", line 592, in <module>
    model = SynGCN(args)
  File "D:/workspace/WordGCN-master/WordGCN-master/syngcn.py", line 439, in __init__
    self.logger = get_logger(self.p.name, self.p.log_dir, self.p.config_dir)
  File "D:\workspace\WordGCN-master\WordGCN-master\helper.py", line 66, in get_logger
    logging.config.dictConfig(config_dict)
  File "D:\environment\Anaconda\lib\logging\config.py", line 799, in dictConfig
    dictConfigClass(config).configure()
  File "D:\environment\Anaconda\lib\logging\config.py", line 570, in configure
    '%r' % name) from e
ValueError: Unable to configure handler 'file_handler'

Process finished with exit code 1

If i just want to train the model, do i need to replace the data?

If i just want to train the model, do i need to replace the data with the web_data you provided, such as synGCN?
And if i fine-tuned the symGCN using your provided embedding, should I replace the data?

how to build a syntactic graph

Hi,Thank you for your amazing work
However,this code doesn't seem to contain the part of how to build the syntactic graph. can i ask for sharing the code of this part?
thank you very much.

How do you acquire semantic_info?

I wonder how did you obtained lists of semantic relation pairs in the folder semantic_info?

Handling of outgoing and incoming arcs

Hello,
Thank you for your paper, as well as releasing the code.

My question is about Processing of edges in the GCN, the original paper differentiated incomings and outgoing arcs by modeling two matrices, each one for each direction. They do so to avoid overparametrize the model (when adding reversed edges), also I saw that you compute the self loop vector, but you did not include it in the update formulae.

Could you tell me if I misunderstood the code.

Thank you

Kindly regards

Can we upload our own dataset?

Do you have scripts available/any easy way to convert raw data to your processed dataset files . So that i can test your on my own dataset .

Can not compile the batch_generator.cpp

Hi,

The WordGCN is an interesting thing. However, I can't compile the batch_generator.cpp with the command "make" following the Readme. Meanwhile, the requirements can not be found in the repository.

g++ batch_generator.cpp -o batchGen.so -fPIC -shared -pthread -O3 -march=native -std=c++11
batch_generator.cpp:96:3: error: expected identifier before ‘)’ token
) {
^
makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

Thanks.

Detailed experimental parameters settings

Hi,
Thank you for your paper, as well as releasing the code.
I follow your source code with your default settings, we obtain a poor result. The experimental setup as shown below:

2019-09-22 11:04:39,606 - test_embeddings_22_09_2019_11:04:39 - [INFO] - {'embed_loc': None, 'gcn_layer': 1, 'batch_size': 512, 'sample': 0.0001, 'lr': 0.001, 'config_dir': './config/', 'dropout': 1.0, 'max_epochs': 5, 'total_sents': 56974869, 'num_neg': 25, 'log_dir': './log/', 'side_int': 10000, 'log_db': 'aaai_runs', 'emb_dir': './embeddings/', 'opt': 'adam', 'onlyDump': False, 'restore': False, 'l2': 0.0, 'context': False, 'gpu': '0', 'seed': 1234, 'name': 'test_embeddings_22_09_2019_11:04:39', 'embed_dim': 300}

| WS353S | WS353R | SimLex999 | RW | AP | Battig | BLESS | SemEval2012 | MSR|
SynGCN | 73.2 | 45.7 | 45.5 | 33.7 | 69.3 | 45.2 | 85.2 | 23.4 | 52.8
our imp. | 75.4 | 39.9 | 44.7 | 30.1 | 66.8 | 44.9 | 77.0 | 21.5 | 41.3

Where did I go wrong?

Would it possible for you to release your pre-trained model checkpoint?

Hi,

Thanks very much for your work, it's really impressive. I have managed to run the code with default setting on given dataset which consists of 57 million sentences on a Titan V, it takes around 18 hours to go over just one epoch(I noticed that your negative samples are set to be 100, wouldn't it be too large?). I wonder would it be possible for you to also release a pre-trained checkpoint and may I also ask your gpu and runtime?

Many thanks.

Segmentation fault (core dumped)

I clone the bug fixed code and run it with setting max length to 50, 70, 90, 150, segmentation fault comes as before. My TF version is gpu-1.12 and run it on two pieces of Tesla K80.

Issue with GetBatches function

I am trying to replicate your code with the same dataset which you used for training. My code stops running once it enters run_epoch function in syngcn.py code.
The issue seems to appear in self.getBatches(shuffle), however I ran make command and BatchGen.so is created. So not really sure why my code stops running without any error.

What does data.txt mean?

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

This problem confuses me for a long time... I tried setting the last part the same as the sentence tokens, and it kept showing segmentation fault...

Would you please give a description of data.txt, and also update the README.md?

Thanks!
@svjan5

from models import model？？？

from models import model
where is the models

Reproduction Problem

Hi @svjan5 ,

Thanks for your paper, as well as releasing the code.

I follow your current code and default settings, after several runs it seems hard to reproduce your reported result on test set.

My results on five runs with deviation are like below:

Analogy task:

	Google	MSR	SemEval2012_2
our	45.16±1.61	49.41±0.60	16.28±1.73
reported		52.8	23.4

Similarity task

	MEN	WS353	WS353R	WS353S	SimLex999	RW	RG65	MTurk	TR9856
our	69.99±0.19	58.35±0.52	43.51±1.68	70.68±0.51	47.61±0.29	37.91±0.47	58.19±1.33	59.90±0.86	17.23±0.26
reported			45.7	73.2	45.5	33.7

Categorisation task

	AP	BLESS	Batting	ESSLI_2c	ESSLI_2b	ESSLI_1a
our	59.22±1.97	69.04±0.86	39.50±1.34	67.41±3.63	77.78±6.20	80.67±1.33
reported	69.3	85.2	45.2

As you can see, there is a large gap for tasks like SemEval2012_2 and categorisation tasks. The deviations for several tasks are also a little bit large.

I wonder where did I go wrong? Forgive my carelessness, Is there anything I missed?

StopIteration error from referenced web package in python 3.7.0

Described in detail here:

https://stackoverflow.com/questions/51700960/runtimeerror-generator-raised-stopiteration-every-time-i-try-to-run-app

Using python 3.5.0 resolved the issue