ibm / grapher Goto Github PK

Code that implements efficient knowledge graph extraction from the textual descriptions

License: Apache License 2.0

Python 95.32% Shell 4.68%

grapher's Introduction

Knowledge Graph Generation From Text

Description

Grapher is an end-to-end multi-stage Knowledge Graph (KG) construction system, that separates the overall generation process into two stages.

The graph nodes are generated first using pretrained language model, such as T5.The input text is transformed into a sequence of text entities. The features corresponding to each entity (node) is extracted and then sent to the edge generation module.

Edge construction, using generation (e.g.,GRU) or a classifier head. Blue circles represent the features corresponding to the actual graph edges (solid lines) and the white circles are the features that are decoded into ⟨NO_EDGE⟩ (dashed line).

Environment

To run this code, please install PyTorch and Pytorch Lightning (we tested the code on Pytorch 1.13 and Pytorch Lightning 1.8.1)

Setup

Install dependencies

# clone project   
git clone [email protected]:IBM/Grapher.git

# navigate to the directory
cd Grapher

# clone an external repository for reading the data
git clone https://gitlab.com/webnlg/corpus-reader.git corpusreader

# clone another external repositories for scoring the results
git clone https://github.com/WebNLG/WebNLG-Text-to-triples.git WebNLG_Text_to_triples

Data

WebNLG 3.0 dataset

# download the dataset   
git clone https://gitlab.com/shimorina/webnlg-dataset.git

How to train

There are two scripts to run two versions of the algorithm

# naviagate to scripts directory
cd scripts

# run Grapher with the edge generation head
bash train_gen.sh

# run Grapher with the classifier edge head
bash train_class.sh

How to test

# run the test on experiment "webnlg_version_1" using latest checkpoint last.ckpt
python main.py --run test --version 1 --default_root_dir output --data_path webnlg-dataset/release_v3.0/en

# run the test on experiment "webnlg_version_1" using checkpoint at iteration 5000
python main.py --run test --version 1 --default_root_dir output --data_path webnlg-dataset/release_v3.0/en --checkpoint_model_id 5000

How to run inference

# run inference on experiment "webnlg_version_1" using latest checkpoint last.ckpt
python main.py --run inference --version 1 --default_root_dir output --inference_input_text "Danielle Harris had a main role in Super Capers, a 98 minute long movie."

Results

Results can be visualized in Tensorboard

tensorboard --logdir output

Citation

@inproceedings{grapher2022,
  title={Knowledge Graph Generation From Text},
  author={Igor Melnyk, Pierre Dognin, Payel Das},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP)},
  year={2022}
}

grapher's People

Contributors

Stargazers

Watchers

grapher's Issues

Timeout errors

unable to train model either way due to connection timeout errors

Segment fault: core dumped

bash train_gen.sh 

train_gen.sh: line 29: 3411753 Segmentation fault      (core dumped) python main.py --version 2 --default_root_dir output --run train --max_epochs 100 --accelerator gpu --num_nodes 1 --num_data_workers 4 --lr 1e-4 --batch_size 11 --num_sanity_val_steps 0 --fast_dev_run 0 --overfit_batches 0 --limit_train_batches 1.0 --limit_val_batches 1.0 --limit_test_batches 1.0 --accumulate_grad_batches 10 --detect_anomaly True --data_path webnlg-dataset/release_v3.0/en --log_every_n_steps 100 --val_check_interval 1000 --checkpoint_step_frequency 1000 --focal_loss_gamma 3 --dropout_rate 0.5 --num_layers 2 --edges_as_classes 0 --checkpoint_model_id -1

Sorry, I have no idea why this could be a problem.
linux, pytorch_lightning 1.8.5, pytorch 1.13 cuda 113, python 3.8

xml.etree.ElementTree.ParseError

Traceback (most recent call last):
File "/public/home/jsj_duanpf/zhyifang/llm4kg/Grapher/main.py", line 170, in
main(args)
File "/public/home/jsj_duanpf/zhyifang/llm4kg/Grapher/main.py", line 40, in main
dm.prepare_data()
File "/public/home/jsj_duanpf/zhyifang/llm4kg/Grapher/data/dataset.py", line 100, in prepare_data
self.prepareWebNLG()
File "/public/home/jsj_duanpf/zhyifang/llm4kg/Grapher/data/dataset.py", line 59, in prepareWebNLG
b.fill_benchmark(files)
File "/public/home/jsj_duanpf/zhyifang/llm4kg/Grapher/corpusreader/benchmark_reader.py", line 144, in fill_benchmark
tree = Et.parse(myfile)
File "/public/home/jsj_duanpf/miniconda3/envs/grapher/lib/python3.9/xml/etree/ElementTree.py", line 1222, in parse
tree.parse(source, parser)
File "/public/home/jsj_duanpf/miniconda3/envs/grapher/lib/python3.9/xml/etree/ElementTree.py", line 580, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

Unable to display the output.

Hi I am trying to run the code on google colab, but at the end I am unable to get the final output.

2023-02-27 08:39:09.639300: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-27 08:39:09.769144: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-02-27 08:39:10.469656: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-27 08:39:10.469742: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-27 08:39:10.469758: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.11.2 at http://localhost:6006/ (Press CTRL+C to quit)

Encoding in dataset.py:61 json.load()

When starting training, I got the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 238339: character maps to <undefined>

By adding an encoding parameter, the error was resolved:

D = json.load(open(os.path.join(self.output_path, f'{split}.json'), encoding="utf-8"))

Last checkpoint Error

Hi @imelnyk
I hope you are well. I intended to use its inference option without training the model and convert my text into a graph. Do I have a way other than model training?

Or do you not put the weights of your model here?

A problem I ran into when I ran the code

I would highly appreciate it if you can give me some suggestions.

requirements.txt

It is hard to find the exact versions you used, would you be so kind to provide a requirements.txt file with all libs and version numbers?

Dataset preprocessing

Hi @imelnyk.
In the paper, three datasets have been mentioned.

WebNLG
NYT
TekGen
but codes for only webnlg are provided. Can you kindly guide me about the others? where can I get access to those from and how can they be used?

Also will you be open to sharing the trained model if i just wish to run inference without training, maybe via email. I'm currently working on review of models for my research and just want to check the performance of the system/

bug(loss not ignoring padding token in sequence)

I think the loss on this (line)[https://github.com/IBM/Grapher/blob/main/misc/utils.py#L19] should be ignore the padding tokens in the sequence, because they are just there because of variable sequence length. Let me know what you think.

CUDA Out of Memory

Hi, I've been following your instructions on the README to train the model. However, I ran into CUDA out-of-memory issues even with a 16 GB GPU.

I have tried to solve the above with the following solutions, but none has worked so far:

Decreasing batch_size to 4 -> 2 -> 1
Decreasing num_data_workers to 2 -> 1
Use torch.cuda.empty_cache()
Use gc.collect()
Use Google Colab and Kaggle to run a Notebook version

For point 5, the training is able to run up until the end of validation at the first epoch before the entire website crashes.

The following is a more detailed log of the error I received:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function results = function(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 624, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1061, in _run results = self._run_stage() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1140, in _run_stage self._run_train() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1163, in _run_train self.fit_loop.run() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 239, in _run_optimization closure() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__ self._result = self.closure(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1443, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 280, in training_step return self.model(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/chenweiyi/Grapher/model/litgrapher.py", line 102, in training_step logits_nodes, logits_edges= self.model(text_input_ids, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/Grapher/model/grapher.py", line 58, in forward output = self.transformer(input_ids=text, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1660, in forward decoder_outputs = self.decoder( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1052, in forward layer_outputs = layer_module( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 684, in forward self_attention_outputs = self.layer[0]( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 590, in forward attention_output = self.SelfAttention( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 520, in forward scores = torch.matmul( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.92 GiB total capacity; 10.48 GiB already allocated; 11.50 MiB free; 10.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I wonder that how query nodes is implemented in code?

Hi! I am interested in node generations of query node section in Grapher, but I can not find anything about it in code. Did it implemented in code?

codes running problem

Excuse me, I would like to ask you a question about how to run your code. Do I have to install all the configuration you require?

But when l want to install external repositories, l meet problem like photo shows. l can't get repositories required, please tell me how to sovle this problem?

query node

Hello author, do you still have the archive of the query node code? I am more interested in this piece, so I want to study it.

bug(edge feature generation batch collation)

I think this line in your code assumes that there is one token required to describe each edge feature which I don't think is a fair assumption given that the edge features are a combination of two words without a space between and not a word by themselves. Please let me know what you think.

Add documentation on how to use a model after it is trained.

How do we actually use the trained model?
How would we feed in text and get the graph associated with that text?

TEKGEN implementation

Hi @imelnyk,
I hope you are well. It is a good project for me. I wonder if there is a TEKGEN implementation for your code. I am trying to train your model with TEKGEN dataset and to see whether there is an improvement for my task using the trained model. Thanks!
Best