yangji9181 / hne Goto Github PK

Heterogeneous Network Embedding: Survey, Benchmark, Evaluation, and Beyond

Python 59.67% Shell 2.72% Makefile 0.63% C++ 25.30% C 11.68%

hne's Introduction

Heterogeneous Network Representation Learning: Benchmark with Data and Code

Citation

Please cite the following work if you find the data/code useful.

@article{yang2020heterogeneous,
  title={Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark},
  author={Yang, Carl and Xiao, Yuxin and Zhang, Yu and Sun, Yizhou and Han, Jiawei},
  journal={TKDE},
  year={2020}
}

Contact

Please contact us if you have problems with the data/code, and also if you think your work is relevant but missing from the survey.

Yuxin Xiao ([email protected]), Carl Yang ([email protected])

Guideline

Stage 1: Data

We provide 4 HIN benchmark datasets: DBLP, Yelp, Freebase, and PubMed.

Each dataset contains:

3 data files (node.dat, link.dat, label.dat);
2 evaluation files (link.dat.test, label.dat.test);
2 description files (meta.dat, info.dat);
1 recording file (record.dat).

Please refer to the Data folder for more details.

Stage 2: Transform

This stage transforms a dataset from its original format to the training input format.

Users need to specify the targeting dataset, the targeting model, and the training settings.

Please refer to the Transform folder for more details.

Stage 3: Model

We provide 13 HIN baseline implementaions:

5 Proximity-Preserving Methods (metapath2vec-ESim, PTE, HIN2Vec, AspEm, HEER);
4 Message-Passing Methods (R-GCN, HAN, MAGNN, HGT);
4 Relation-Learning Methods (TransE, DistMult, ComplEx, ConvE).

Please refer to the Model folder for more details.

Stage 4: Evaluate

This stage evaluates the output embeddings based on specific tasks.

Users need to specify the targeting dataset, the targeting model, and the evaluation tasks.

Please refer to the Evaluate folder for more details.

hne's People

Contributors

Stargazers

Watchers

hne's Issues

Data missing

The file "path.dat" is lost in all datasets, while these files are necessary in MAGNN/utils.py

Hi,
I followed your steps in running the DistMult. I got this error
File "src/main.py", line 182, in
main(args)
File "src/main.py", line 83, in main
train_batcher = StreamBatcher(args.data, 'train', args.batch_size, randomize=True, keys=input_keys, loader_threads=args.loader_threads)
File ".....HNE/Model/DistMult/src/spodernet/spodernet/preprocessing/batching.py", line 217, in init
log.error('Path {0} does not exists! Have you forgotten to preprocess your dataset?', config_path)
File "....HNE/Model/DistMult/src/spodernet/spodernet/utils/logger.py", line 106, in error
raise Exception(message.format(*args))
Exception: Path ...../.data/PubMed/train/hdf5_config.pkl does not exists! Have you forgotten to preprocess your dataset?
Exception ignored in: <bound method StreamBatcher.del of <spodernet.preprocessing.batching.StreamBatcher object at 0x7f4869e3b5c0>>
Traceback (most recent call last):
File ".....HNE/Model/DistMult/src/spodernet/spodernet/preprocessing/batching.py", line 268, in del
for worker in self.loaders:
AttributeError: 'StreamBatcher' object has no attribute 'loaders'

Question about R-GCN training

In model.eval(), it iterate through every batch to get node embeddings.
main.py line 112: for batch_num in range(batch_total)

I am wondering why in model.train(), you don't iterate through all batch in each epoch?

Segmentation Fault [TransE]

Hi,
I tried to run TranE on two different datasets. Both times it throws segmentation fault.
### Program terminated with signal 11, Segmentation fault.
#0 _int_malloc (av=av@entry=0x7fda5cf9b760 <main_arena>, bytes=bytes@entry=800000) at malloc.c:3780
3780 set_head(remainder, remainder_size | PREV_INUSE);

It works on PubMed and my other dataset (133k triplets)
Any pointers to solve this problem..

ps: The system has around 700 Gb ram...

Embedding of Relations

Hi,
I ran your code for DistMult on PubMed dataset.
It saves embeddings of entities, where does it save the embedding of relations.
I want to use them in scoring function and calculate MRR and Hits.
Thanks.

About source of HAN

Hi, I am curious about why do you refer the source code of HAN as NLAH? Is the model in HAN folder HAN model?

https://github.com/yangji9181/HNE/tree/master/Model/HAN

May I know how you find the best hyperparameters for the unsupervised task?

May I know how you find the best hyperparameters for the unsupervised task and use that embedding for downstream tasks? Are the dataset used for hyperparameter tuning of unsupervised tasks and downstream tasks the same (link.dat.test)? Thank you.

Data

Error in MAGNN

my dgl version is 0.5.2
in MAGNN utils.py line 206:
g.from_network(ng)
raise an error:
raise DGLError('DGLGraph.from_networkx is deprecated. Please call the following\n\n'
dgl._ffi.base.DGLError: DGLGraph.from_networkx is deprecated. Please call the following
dgl.from_networkx(nx_graph, node_attrs, edge_attrs)
, which creates a new DGLGraph from the networkx graph.

Is it possible to train HAN on weighted graph?

HAN training input file:
link.dat: Each line is formatted as {head_node_id}\t{tail_node_id}\t{link_type}

Is it possible to add edge weight as input and training HAN on a weighted graph? If yes, please advise where shall I make the code modifications. Thanks.

DistMult running error

Traceback (most recent call last):
  File "src/main.py", line 13, in <module>
    from spodernet.hooks import LossHook, ETAHook
  File "/home/usr/anaconda3/lib/python3.7/site-packages/spodernet/hooks.py", line 6, in <module>
    from spodernet.utils.util import Timer
ModuleNotFoundError: No module named 'spodernet.utils'

HAN kill!

Hello
Thanks for this valuable work
I try to run HAN on colab
I install PyTorch
and after run transform.sh , try to execute run.sh in the HAN folder
but I see this message in the orange box

could you please guide me in this issue?
Thank you

Dimension out of range (expected to be in range of [-1, 0], but got -2) [HGT]

pytorch 1.4.0
Dimension out of range (expected to be in range of [-1, 0], but got -2)

dataset : freebase
model: HGT

maybe the data format is wrong

Traceback (most recent call last):
File "src/main.py", line 137, in
node_rep, _ = model.forward(node_feature.to(device), node_type.to(device), edge_time.to(device), edge_type.to(device), edge_index.to(device))
File "/home/bopa/project/HNE/Model/HGT/src/model.py", line 170, in forward
meta_xs = gc(meta_xs, node_type, edge_index, edge_type, edge_time)
File "/home/bopa/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/bopa/project/HNE/Model/HGT/src/model.py", line 49, in forward
return self.propagate(edge_index, node_inp=node_inp, node_type=node_type, edge_type=edge_type, edge_time=edge_time)
File "/home/bopa/.local/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 233, in propagate
kwargs)
File "/home/bopa/.local/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 156, in collect
self.set_size(size, dim, data)
File "/home/bopa/.local/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 119, in set_size
elif the_size != src.size(self.node_dim):
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)

unsupervised training with HAN and MAGNN

Happy new year! I got a problem in running the code of HAN and MAGNN with unsuperviesd training, using Yelp's data. And this problem is that these codes run out of my 16G memory. ( T^T ) I am not sure that there are another problems，but I wonder if it's related to data's link number ? Also I want to ask whether there are any ways to help me run this code, such as minibatch?
Then I also try to use my data on MAGNN, which occurred the problem that the items numbers of " batch_node_features " and "batch_targets" are not equal. So that when compute the unsupervised loss, the node_features will occur the Keyword Error. ( Of course, the PubMed runs well ) So I think there is something wrong with my path.dat file , I can't figure out what problem is.

R-GCN RuntimeError: Found dtype Long but expected Float

Dataset: Yelp

Supervised = 'True'

Traceback (most recent call last):
File "src/main.py", line 188, in
main(args)
File "src/main.py", line 94, in main
if args.supervised=='True': loss = model.get_supervised_loss(pred, matched_labels, matched_index, multi)
File "/home/lqd/code/HNE-master/Model/R-GCN/src/model.py", line 142, in get_supervised_loss
predict_loss = F.binary_cross_entropy(torch.sigmoid(embed[matched_index]), matched_labels)
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/torch/nn/functional.py", line 2526, in binary_cross_entropy
input, target, weight, reduction_enum)
RuntimeError: Found dtype Long but expected Float

About config.dat in HAN

I am confused about what to put in config.dat of model HAN, you said config.dat: The first line specifies the targeting node type. The second line specifies the targeting link type. The third line specifies the information related to each link type, e.g., {head_node_type}\t{tail_node_type}\t{link_type}. I don't understand it fully.

What does the meta='1, 2, 4, 8' mean in HAN run.sh file?

Hi, can you give a few more examples or explanations on the meta meaning in run.sh? I noticed that you wrote a comment saying
"# Choose the meta-paths used for training. Suppose the targeting node type is 1 and link type 1 is between node type 0 and 1, then meta="1" means that we use meta-paths "101"." Do the metapaths all end with targeting node type 1? So, 1, 2, 4, 8 actually mean four meta-paths: 101, 201, 401 and 801? Thank you very much!

DistMult running error

Sorry, since the issue #9 is closed. I reopen this issue.

I installed spodernet. However,

>>> import spodernet
>>> import spodernet.utils
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spodernet.utils'

As you can see, spodernet is imported normally. But utils not found.

MAGNN training memory consumption

Hello,

First of all, thank you for your paper and your code, it is a big help for anyone interested in HNE problem.
I have a question regarding MAGNN training. I tried fitting it to unattributed PubMed (smallest dataset of 4) in an unsupervised fashion. However, I couldn't do it - after the training started, all 32 gigs of RAM I have available were taken and then the script crashed.
I didn't change MAGNN parameters after cloning the repo and ran the data transform stage as indicated in readme. My question is, am I doing something wrong? Not entirely sure how 118 MB dataset and 2-layer net could do this. How much memory did you need for this task while obtaining the results for the paper?
I have seen in the closed issues that someone else has run into the same problem but there was no definite resolution there :(
Please help :)

Can you add the weight information into link.dat.test?

In some case, We want to split a smaller dataset from your preprocess dataset, but I find that the weight information is missing.So, I think maybe you can add the weight information into link.dat.test?

dgl version conflict

When I run MAGNN model, the dgl version matches 0.3, but this version is not suit for R-GCN model. I see this awesome rep contains 13 algorithms, they all have original git rep. So there may be conflict in some packages, if anyone successfully run all algorithms, it will be very nice to provide the version of the important packages such as dgl

Cannot assign node feature "h" on device cuda:0 to a graph on device cpu.

R-GCN Model
dataset="PubMed"
python 3.7 pytorch 1.7.0 dgl-cu102 0.5.2

Traceback (most recent call last):
File "src/main.py", line 185, in
main(args)
File "src/main.py", line 90, in main
embed, pred = model(g, node_id, edge_type, edge_norm)
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lqd/code/HNE-master/Model/R-GCN/src/model.py", line 117, in forward
output = self.rgcn.forward(g, h, r, norm)
File "/home/lqd/code/HNE-master/Model/R-GCN/src/model.py", line 52, in forward
h = layer(g, h, r, norm)
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 274, in forward
g.srcdata['h'] = feat
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/dgl/view.py", line 81, in setitem
self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
File "/home/lqd/software/anaconda3/envs/HNE/lib/python3.7/site-packages/dgl/heterograph.py", line 3811, in _set_n_repr
' same device.'.format(key, F.context(val), self.device))
dgl._ffi.base.DGLError: Cannot assign node feature "h" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device.

Negative samples

Hi,
DistMult uses negative sampling for the training...
I couldn't find Negative sampling parameter in the algo.
How did you generate the negative samples and how many NS per positive triplet?