bupt-gamma / hgat Goto Github PK

Heterogeneous graph attention network for semi-supervised short text classification (EMNLP 2019, TOIS 2021)

Python 100.00%

hgat's Introduction

An implement of EMNLP 2019 paper "Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification" and its extension "HGAT: Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification" (TOIS 2021).

Thank you for your interest in our work! 😄

Requirements

Anaconda3 (python 3.6)
Pytorch 1.3.1
gensim 3.6.0

Easy Run

cd ./model/code/
python train.py

You may change the dataset by modifying the variable "dataset = 'example'" in the top of the code "train.py" or use arguments (see train.py).

Our datasets can be downloaded from Google Drive. PS: I have accidentally deleted some files, but I tried to restore them, hope they will run correctly.

Prepare for your own dataset

The following files are required:

./model/data/YourData/
    ---- YourData.cites                // the adjcencies
    ---- YourData.content.text         // the features of texts
    ---- YourData.content.entity       // the features of entities
    ---- YourData.content.topic        // the features of topics
    ---- train.map                     // the index of the training node
    ---- vali.map                      // the index of the validation nodes
    ---- test.map                      // the index of the testing nodes

The format is as following:

YourData.cites

Each line contains an edge: "idx1\tidx2\n". eg: "98 13"
YourData.content.text

Each line contains a node: "idx\t[features]\t[category]\n", note that the [features] is a list of floats with '\t' as the delimiter. eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 computers" If used for multi-label classification, [category] must be one-hot with space as delimiter, eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 0 1 1 0 1 0".
YourData.content.entity

Similar with .text, just change the [category] to "entity". eg: "13 0.0 0.0 1.0 0.0 0.0 entity"
YourData.content.topic

Similar with .text, just change the [category] to "topic". eg: "64 0.10 1.21 8.09 0.10 topic"
*.map

Each line contains an index: "idx\n". eg: "98"

You can see the example in ./model/data/example/*

A simple data preprocessing code is provided. Successfully running it requires a token of tagme's account (my personal token is provided in tagme.py, but may be invalid in the future), Wikipedia's entity descriptions, and a word2vec model containing entity embeddings. You can prepare them yourself or obtain our files from Google Drive and unzip them to ./data/ .

Then, you should prepare a data file like ./data/example/example.txt, whose format is: "[idx]\t[category]\t[content]\n".

Finally, modify the variable "dataset = 'example'" in the top of following codes and run:

python tagMe.py
python build_network.py
python build_features.py
python build_data.py

Use HGAT as GNN

If you just wanna use the HGAT model as a graph neural network, you can just prepare some files following the above format:

 ./model/data/YourData/
    ---- YourData.cites                // the adjcencies
    ---- YourData.content.*            // the features of *, namely node_type1, node_type2, ...
    ---- train.map                     // the index of the training node
    ---- vali.map                      // the index of the validation nodes
    ---- test.map                      // the index of the testing nodes

And change the "load_data()" in ./model/code/utils.py

type_list = [node_type1, node_type2, ...]
type_have_label = node_type

See the codes for more details.

Citation

If you make advantage of the HGAT model in your research, please cite the following in your manuscript:

@article{yang2021hgat,
  author = {Yang, Tianchi and Hu, Linmei and Shi, Chuan and Ji, Houye and Li, Xiaoli and Nie, Liqiang},
  title = {HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification},
  year = {2021},
  publisher = {Association for Computing Machinery},
  volume = {39},
  number = {3},
  doi = {10.1145/3450352},
  journal = {ACM Transactions on Information Systems},
  month = may,
  articleno = {32},
  numpages = {29},
}

@inproceedings{linmei2019heterogeneous,
  title={Heterogeneous graph attention networks for semi-supervised short text classification},
  author={Linmei, Hu and Yang, Tianchi and Shi, Chuan and Ji, Houye and Li, Xiaoli},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={4823--4832},
  year={2019}
}

hgat's People

Contributors

Stargazers

Watchers

hgat's Issues

中文异质图如何构建

中文异质图如何构建，有什么方法吗

Is the code complete？

I can't find the defination of experimentType, write_embeddings and suPara.
Anybody could tell me how to use them？ thank you very much!

RuntimeError: CUDA out of memory for tagmynews

Hello, when I run the demo script for tagmynews, it shows that error:
"RuntimeError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 10.73 GiB total capacity; 8.07 GiB already allocated; 1.69 GiB free; 41.40 MiB cached)"
My device is GeForce RTX 2080 Ti.
What's your device for running program with tagmynews?

Or what's your distributed solution for hgat?

How to use unlabeled data

The paper says "all the left documents are for testing, which are also used as unlabeled documents during training", but I did't find the explanation of how to use unlabeled data. Could anyone make sense of it?

Maybe ER.py is missing?

hi,when I run the utile_inductive.py,it report an error：

I can't find the ER.py, is it missing？
I searched your code base, but did not find this code，it appeared in utile_inductive.py， line 93

TagMyNews Dataset run error: CUDA out of memory

TagMyNews dataset run errot: CUDA out of memory
Is there an error in the author’s code? If I run multiple GPUs in parallel, an error will be reported here，util.py

等待+2

等待

虽然在石川老师的主页找到了code，但是期待出个带readme版本的，以及想看数据预处理部分。等待！

TypeError

how can I fix this error?

Tagme生成实体

作者，您好，请问您在使用tagme生成实体时有没有遇到过这样超时的问题？

加油加油！！

neighbors

Where is the code of sampling of neighbors thank you

请问代码中提到的.pkl文件在哪里获取呢？

with open(datapath + 'topic_word_distribution.pkl', 'rb') as f:
topic_word = pickle.load(f)
with open(datapath + 'doc_topic_distribution.pkl', 'rb') as f:
doc_topic = pickle.load(f)
with open(datapath + 'doc_index_LDA.pkl', 'rb') as f:

这里提到的后缀为.pkl的文件，您能告诉我如何获取吗？不胜感激

train.py

Some variables in “train.py” are undefined, such as file_hid_para and experimentType.

Looking forward to your reply~

data

请问一下可以提供一下原始的数据集吗

TagMyNews数据集

请问可以发一下TagMyNews数据集吗

Visualization of short text embeddings

Hi, just wonder if you can share the visualization of the short text embeddings on the test set you mentioned in the paper?

Thanks

能否提供原始的短文本数据集呢，没有找到下载链接？

No such file or directory: './data/wikiAbstract/0000'

运行 build_features.py 文件时，缺少报错'No such file or directory: './data/wikiAbstract/0000'，缺少的是什么文件啊，要自己去哪里下载吗

code

Where is HIN for Short Texts

TypeError: 'NoneType' object is not iterable

有没有人遇到这个问题呀，求解答！

build_network.py 训练和测试数据差距太大了，请问是怎么回事？

trainnum 4
4 4 79993
train: 4
vali: 4
test: 79993
AllTexts: 80001
text-ent: : 48366it [00:01, 28899.68it/s]

请问使用自己的数据集跑通tagme,..,build_data，再跑model/code/train训练时会占用很大的内存吗？

请问主题节点和文档节点之间的边权重是怎么构建的呢？

build_network.py: 'NoneType' object is not iterable

on this line:
entityList = json.loads(entityList)
when entityList input string equals to null(which was produced for my dataset by tagMe.py), entityList would be of type None and it will cause error while iterating on upcoming lines.

For now I assigned entityList to an empty list when it is None.

word2vec_gensim_5

你好，能帮忙发一下word2vec_gensim_5或者源数据集吗？给的链接需要链接外网才能下载

waiting~

老哥，等你呢~加油coding

等待+1

能解释这两句代码的意思吗

attention = torch.mul(attention, adj_dense.sum(1).repeat(M, 1).t())
attention = torch.add(attention * self.gamma, adj_dense * (1 - self.gamma))

code ready? what`s the release date?

build_network.py运行报错，原因好像是缺少这个word2vec_gensim_5文件，gensim已安装，想问下，这个是需要自己下载么？

Confusion when compared with the Paper.

It seems that in paper, type-level attention is calculated followed by node-level. But in code it looks quite the opposite. Also, in node-level attention, there is no concatenation operation like in paper. And even after applying softmax, there are extra steps, I don't understand what it has to do with the equations in the paper.

attention = F.softmax(attention, dim=1)
attention = torch.mul(attention, adj.sum(1).repeat(M, 1).t())
attention = torch.add(attention * self.gamma, adj.to_dense() * (1 - self.gamma))
h_prime = torch.matmul(attention, g)

What is the significance of these steps? And also can someone explain the Node-Level Attention part in the code?