melifluos / subgraph-sketching Goto Github PK

View Code? Open in Web Editor NEW

88.0 88.0 14.0 339 KB

code for Graph Neural Networks for Link Prediction with Subgraph Sketching https://arxiv.org/abs/2209.15486

License: Apache License 2.0

Python 100.00%

subgraph-sketching's People

Contributors

Stargazers

Watchers

Forkers

venomouscyanide juanhui28 yuwvandy techthiyanes xueqic xiransong barcavin duanxnuo silver68211 bushizhe chens676 vimukthirandika1997 rahul-krupani

subgraph-sketching's Issues

reproduce the results of GCN and SAGE on Planetoid

Thanks for providing the code, but how to reproduce the link prediction performances of GNN baselines on Planetoid?

Missing (1,3) and (3, 1) Subgraph Features when max_hash_hops is 3

Hey,

Thanks for sharing the code!

I observed that when running with --max_hash_hops equal to 3 the counts for the (1, 3) and (3, 1) labels are 0.

Here's an example on ogbl-collab. Below is the mean subgraph features for the train/valid/test set. The 4th, 5th, 11th, and 12th entries of each are all 0. Per the lookup table in hashing.py , the 4th and 5th indices correspond to the (3, 1) and (1, 3) labels.

tensor([   6.5106,    7.8842,   12.4787,   70.1434,    0.0000,    0.0000,
         107.7395,  164.5204,  837.0035,    3.4168,    3.3445,    0.0000,
           0.0000, 1545.4868, 2061.3445])

tensor([   3.5791,    4.6430,    8.1396,   53.2897,    0.0000,    0.0000,
          83.6231,  136.0730,  767.9641,    4.1845,    4.8364,    0.0000,
           0.0000, 1907.1871, 2633.1541])

tensor([2.3125e+00, 4.6931e+00, 7.8365e+00, 4.6649e+01, 0.0000e+00, 0.0000e+00,
        9.1180e+01, 1.4621e+02, 8.0607e+02, 6.8422e+00, 7.3138e+00, 0.0000e+00,
        0.0000e+00, 2.1475e+03, 2.8840e+03])

This seems to be caused by the function ElphHashes.get_subgraph_features() (here) when the --use_zero_one isn't set.

When max_hops == 2 the 4th and 5th entries correspond to the (0, 1) and (1, 0) labels. However when max_hops == 3 it's the (1, 3) and (3, 1) labels. The same problem also seems to be found in the code here.

From my understanding this is a bug and the indices should be changed to 9 and 10 when max_hops == 3?

Please let me know if I'm misunderstanding what's happening in some way.

Thanks!

Harry

Possible mistake in ElphHashes

I think the following code is incorrect:

subgraph-sketching/src/hashing.py

Lines 281 to 282 in 3732cc7

 features[:, 7] = cards1[:, 1] - features[:, 0] - torch.sum(features[:, 0:4], dim=1) - features[:, 

 5] # (2, 0)

It should be:
features[:, 7] = cards1[:, 1] - torch.sum(features[:, 0:4], dim=1) - features[:, 5] # (2, 0)

reproduce the ppa

Dear authors,

Thanks for providing the code. We have some issues to reproduce the results for the ppa dataset. I tuned the parameters by myself:--lr in the range of [0.01, 0.001], --label_dropout in the range of [0.1, 0.3,0.5], --feature_dropout in the range of [0.1, 0.3, 0.5], --hidden_channels to be 256, and I also add the--use_RAto include the RA feature. By selecting the parameter based on the validation performance, I got the hit100 result around 38 which is quite lower than 49 as reported in the paper.

To reproduce the result, what paramter do I need to tune? Could you please kindly give some suggestions?

The results in the paper cannot be reproduced

Thank you for sharing your code！When I reproduced BUDDY's results using the parameter Settings provided, I got a lower result than reported in the paper. There is a 3%-5% decline in almost every dataset. Whether some additional setup is required?

Sincerely

Is a node its own neighbour?

Hi @melifluos, I have finished an open-source implementation in Rust of the sketching features described in this paper, but after I ran it on a graph as sparse as directed WikiData I realized that many if not all features in the leaves are zeros if I consider each node not as part of its neighbourhood, aka introducing self-loops.

I was wondering what was your opinion in the matter, as I am not sure whether your paper suggests including them or not.

I guess I'll just add a flag and the user can decide for himself - as you may find in my implementation I have added the option to ingest as part of the sketch several other features, such as node types and edge types.

Please find attached the performance of running it on WikiData (over 1 billion nodes a 5 billion edges) on my desktop.

Problem for running the ddi

Dear authors,

Thanks for providing the source code of the ELPH/BUDDY. When I tried to run the ddi following the command python runners/run.py --dataset ogbl-ddi --K 20 --train_node_embedding --propagate_embeddings --epochs 120 --num_negs 6 --model BUDDY , I got the error: AttributeError: 'BUDDY' object has no attribute 'sign_embedding'. Could you please help?

Best,
Juanhui

torch_sparse

My versions are all correct, but I keep getting this error, "FileNotFoundError: Could not find module 'D:\Anacondas\envs\ss1\Lib\site-packages\torch_sparse_convert_cuda.pyd' (or one of its dependencies). Try using the full path with the constructor syntax."

Reproduce SEAL on Cora,Citeseer,Pubmed

Thanks for sharing the code. I am reproducing the link prediction performances of SEAL on Cora, Citeseer, and Pubmed. However, using the official code in https://github.com/facebookresearch/SEAL_OGB produces a far lower score than that reported in this paper. Could you please provides some details on how to reproduce SEAL?

Reproduction of GCN and SAGE

Hi, when i reproduce SAGE on Plantoid dataset Citeseer and Pubmed, i get HR@100 on citeseer 37 while on pubmed 57. so i want to know if you may confuse the results, it seems a bit weird...

Suggestion for hyperparameter tuning for other non-attributes datasets

Hi,

I am trying to apply ELPH/BUDDY to other non-attributed graphs (no node features attached graph). There seem to be a lot of hyperparameters to tweak. Can you provide some general suggestions about how to find good hyperparameters?

I plan to play with those non-default parameters in README on ogbl-ddi and ogbl-ppa. More advice is greatly appreciated.

Initialize the hashing table of ELPH at every step/epoch

subgraph-sketching/src/models/elph.py

Lines 189 to 214 in 3732cc7

 if self.init_hashes == None: 

 self.init_hashes = self.elph_hashes.initialise_minhash(num_nodes).to(x.device) 

 if self.init_hll == None: 

 self.init_hll = self.elph_hashes.initialise_hll(num_nodes).to(x.device) 

 # initialise data tensors for storing k-hop hashes 

 cards = torch.zeros((num_nodes, self.num_layers)) 

 node_hashings_table = {} 

 for k in range(self.num_layers + 1): 

 logger.info(f"Calculating hop {k} hashes") 

 node_hashings_table[k] = { 

 'hll': torch.zeros((num_nodes, self.hll_size), dtype=torch.int8, device=edge_index.device), 

 'minhash': torch.zeros((num_nodes, self.num_perm), dtype=torch.int64, device=edge_index.device)} 

 start = time() 

 if k == 0: 

 node_hashings_table[k]['minhash'] = self.init_hashes 

 node_hashings_table[k]['hll'] = self.init_hll 

 if self.feature_prop in {'residual', 'cat'}: # need to get features to the hidden dim 

 x = self._encode_features(x) 

 else: 

 node_hashings_table[k]['hll'] = self.elph_hashes.hll_prop(node_hashings_table[k - 1]['hll'], 

 hash_edge_index) 

 node_hashings_table[k]['minhash'] = self.elph_hashes.minhash_prop(node_hashings_table[k - 1]['minhash'], 

 hash_edge_index) 

 cards[:, k - 1] = self.elph_hashes.hll_count(node_hashings_table[k]['hll']) 

 x = self.feature_conv(x, edge_index, k)

In the ELPH model, the hashing of the nodes is calculated on the fly. However, the hashing table of ELPH is only initialized once for the entire training process. In this case, I think the propagation of hashing is also only needed once since both the initial hashing and the graph structure don't change.

Alternatively, is it possible to re-initialize the hashing table at every training step/epoch? It then indeed requires repeated propagation of the hashing. Also, it may reduce the variance of the structural feature estimation.

Does the order of src/dst node matter for structural features?

When generating the structural features, the count for (2,1) and (1,2) are calculated as two values and present in the feature vector. Does it violate the permutation-equivalence property for an undirected graph?

Under such implementation, it may give different predictions to the same edge when flipping the order of the src/dst nodes.

I have some problem when running the project

I am nearly mad. I spent the whole afternoon trying to run the code, but it seems that I can't execute it according to the given instructions now. It seems that we should use "pip install torch-geometric" instead of "conda install pyg -c pyg" on gpu.I don't know why. Can someone tell me about the difference?

	features[:, 7] = cards1[:, 1] - features[:, 0] - torch.sum(features[:, 0:4], dim=1) - features[:,
	5] # (2, 0)

	if self.init_hashes == None:
	self.init_hashes = self.elph_hashes.initialise_minhash(num_nodes).to(x.device)
	if self.init_hll == None:
	self.init_hll = self.elph_hashes.initialise_hll(num_nodes).to(x.device)
	# initialise data tensors for storing k-hop hashes
	cards = torch.zeros((num_nodes, self.num_layers))
	node_hashings_table = {}
	for k in range(self.num_layers + 1):
	logger.info(f"Calculating hop {k} hashes")
	node_hashings_table[k] = {
	'hll': torch.zeros((num_nodes, self.hll_size), dtype=torch.int8, device=edge_index.device),
	'minhash': torch.zeros((num_nodes, self.num_perm), dtype=torch.int64, device=edge_index.device)}
	start = time()
	if k == 0:
	node_hashings_table[k]['minhash'] = self.init_hashes
	node_hashings_table[k]['hll'] = self.init_hll
	if self.feature_prop in {'residual', 'cat'}: # need to get features to the hidden dim
	x = self._encode_features(x)

	else:
	node_hashings_table[k]['hll'] = self.elph_hashes.hll_prop(node_hashings_table[k - 1]['hll'],
	hash_edge_index)
	node_hashings_table[k]['minhash'] = self.elph_hashes.minhash_prop(node_hashings_table[k - 1]['minhash'],
	hash_edge_index)
	cards[:, k - 1] = self.elph_hashes.hll_count(node_hashings_table[k]['hll'])
	x = self.feature_conv(x, edge_index, k)