Giter Club home page Giter Club logo

graphsage's Introduction

GraphSage: Representation Learning on Large Graphs

Overview

This directory contains code necessary to run the GraphSage algorithm. GraphSage can be viewed as a stochastic generalization of graph convolutions, and it is especially useful for massive, dynamic graphs that contain rich feature information. See our paper for details on the algorithm.

Note: GraphSage now also has better support for training on smaller, static graphs and graphs that don't have node features. The original algorithm and paper are focused on the task of inductive generalization (i.e., generating embeddings for nodes that were not present during training), but many benchmarks/tasks use simple static graphs that do not necessarily have features. To support this use case, GraphSage now includes optional "identity features" that can be used with or without other node attributes. Including identity features will increase the runtime, but also potentially increase performance (at the usual risk of overfitting). See the section on "Running the code" below.

Note: GraphSage is intended for use on large graphs (>100,000) nodes. The overhead of subsampling will start to outweigh its benefits on smaller graphs.

The example_data subdirectory contains a small example of the protein-protein interaction data, which includes 3 training graphs + one validation graph and one test graph. The full Reddit and PPI datasets (described in the paper) are available on the project website.

If you make use of this code or the GraphSage algorithm in your work, please cite the following paper:

 @inproceedings{hamilton2017inductive,
     author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
     title = {Inductive Representation Learning on Large Graphs},
     booktitle = {NIPS},
     year = {2017}
   }

Requirements

Recent versions of TensorFlow, numpy, scipy, sklearn, and networkx are required (but networkx must be <=1.11). You can install all the required packages using the following command:

$ pip install -r requirements.txt

To guarantee that you have the right package versions, you can use docker to easily set up a virtual environment. See the Docker subsection below for more info.

Docker

If you do not have docker installed, you will need to do so. (Just click on the preceding link, the installation is pretty painless).

You can run GraphSage inside a docker image. After cloning the project, build and run the image as following:

$ docker build -t graphsage .
$ docker run -it graphsage bash

or start a Jupyter Notebook instead of bash:

$ docker run -it -p 8888:8888 graphsage

You can also run the GPU image using nvidia-docker:

$ docker build -t graphsage:gpu -f Dockerfile.gpu .
$ nvidia-docker run -it graphsage:gpu bash	

Running the code

The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSage, respectively.

If your benchmark/task does not require generalizing to unseen data, we recommend you try setting the "--identity_dim" flag to a value in the range [64,256]. This flag will make the model embed unique node ids as attributes, which will increase the runtime and number of parameters but also potentially increase the performance. Note that you should set this flag and not try to pass dense one-hot vectors as features (due to sparsity). The "dimension" of identity features specifies how many parameters there are per node in the sparse identity-feature lookup table.

Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance. We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).

Note: For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the --sigmoid flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.

Input format

As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:

  • <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
  • <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
  • <train_prefix>-class_map.json -- A json-stored dictionary mapping the graph node ids to classes.
  • <train_prefix>-feats.npy [optional] --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
  • <train_prefix>-walks.txt [optional] --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)

To run the model on a new dataset, you need to make data files in the format described above. To run random walks for the unsupervised model and to generate the -walks.txt file) you can use the run_walks function in graphsage.utils.

Model variants

The user must also specify a --model, the variants of which are described in detail in the paper:

  • graphsage_mean -- GraphSage with mean-based aggregator
  • graphsage_seq -- GraphSage with LSTM-based aggregator
  • graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
  • graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
  • gcn -- GraphSage with GCN-based aggregator
  • n2v -- an implementation of DeepWalk (called n2v for short in the code.)

Logging directory

Finally, a --base_log_dir should be specified (it defaults to the current directory). The output of the model and log files will be stored in a subdirectory of the base_log_dir. The path to the logged data will be of the form <sup/unsup>-<data_prefix>/graphsage-<model_description>/. The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them. The unsupervised embeddings will be stored in a numpy formated file named val.npy with val.txt specifying the order of embeddings as a per-line list of node ids. Note that the full log outputs and stored embeddings can be 5-10Gb in size (on the full data when running with the unsupervised variant).

Using the output of the unsupervised models

The unsupervised variants of GraphSage will output embeddings to the logging directory as described above. These embeddings can then be used in downstream machine learning applications. The eval_scripts directory contains examples of feeding the embeddings into simple logistic classifiers.

Acknowledgements

The original version of this code base was originally forked from https://github.com/tkipf/gcn/, and we owe many thanks to Thomas Kipf for making his code available. We also thank Yuanfang Li and Xin Li who contributed to a course project that was based on this work. Please see the paper for funding details and additional (non-code related) acknowledgements.

graphsage's People

Contributors

aksakalli avatar bkj avatar gokceneraslan avatar m30m avatar rexying avatar williamleif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphsage's Issues

how to feed data when the number of nodes is big?

Hello, I have 7 millions nodes. When I train the model, the use of memory is big,and tensorflow throws a error when create the variable of adj_info,so how to feed datas when the number of nodes is big? thanks!
Tensorflow throws a error like this:

ValueErrorTraceback (most recent call last)
in ()
----> 1 train(train_data)

in train(train_data, test_data)
149 max_degree=FLAGS.max_degree,
150 context_pairs=context_pairs)
--> 151 adj_info = tf.Variable(tf.constant(minibatch.adj, dtype=tf.int32), trainable=False, name="adj_info")
152
153 if FLAGS.model == 'graphsage_mean':

/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.pyc in constant(value, dtype, shape, name, verify_shape)
100 tensor_value = attr_value_pb2.AttrValue()
101 tensor_value.tensor.CopyFrom(
--> 102 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
103 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
104 const_tensor = g.create_op(

/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.pyc in make_tensor_proto(values, dtype, shape, verify_shape)
431 if nparray.size * nparray.itemsize >= (1 << 31):
432 raise ValueError(
--> 433 "Cannot create a tensor proto whose content is larger than 2GB.")
434 tensor_proto.tensor_content = nparray.tostring()
435 return tensor_proto

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Question about sampling methods in minibatch

Hi

I am reading your code( now only mean aggregate). I am confused why the sampling method is in a reversed layer order?

def sample(self, inputs, layer_infos, batch_size=None):
""" Sample neighbors to be the supportive fields for multi-layer convolutions.

    Args:
        inputs: batch inputs
        batch_size: the number of inputs (different for batch inputs and negative samples).
    """
    
    if batch_size is None:
        batch_size = self.batch_size
    samples = [inputs]
    # size of convolution support at each layer per node
    support_size = 1
    support_sizes = [support_size]
    for k in range(len(layer_infos)):
        t = len(layer_infos) - k - 1
        support_size *= layer_infos[t].num_samples
        sampler = layer_infos[t].neigh_sampler
        node = sampler((samples[k], layer_infos[t].num_samples))
        samples.append(tf.reshape(node, [support_size * batch_size,]))
        support_sizes.append(support_size)
    return samples, support_sizes

how to gain embedding of new nodes

Hi, I want to do cross validation using the node embedding as input of a classifier, but I do not known how to gain the embedding vectors of new nodes, would you like to introduce me which your function or file have this function? Thank you very much!

Number nodes at each layers

In the unsupervised model, I notice that it use the second layers's num_samples as the number of neighbor nodes for the first layers. This lead to samples shape for next aggregate function become this:

  • batch_size x [1, 10, 25*10] (which is different from the paper that first layers 25 neighbor nodes, second layer is 10 neighbor nodes)

These line:

Thank you, I learned a lot from the code

codes problem

hi, I don't understand the meaning of the second 'for' loop in

models.py
--class SampleAndAggregate(GeneralizedModel):
----def aggregate(...)

It seems that it is not consistent with pseudo-code in your paper.

(By the way, your code is a little hard to follow T_T)

question about tf.stop_gradient in deepwalk test part

isn't it wrong to simply assign model.context_embeds with update_nodes+no_update_nodes, where no_update_nodes = tf.stop_gradient(...), for which you meant to stop the gradient for the already trained nodes? because the old model.context_embeds is still there in the tf graph, used by model.opt_op, right? plz correct me if I'm wrong, thx.

Generating graph from data & Questions

I am quiet new to graphs and am trying to translate my datasets (in HDFS, which I can read using scala/python/hive) to networkX graphs of the format -G.json The datasets are obviously not in graph format but as guest transactions, which I can translate to graph.

First, is there any utiliity to do that. Secondly, what is the use of features and label in the graph description. How are these features different from -feat.npy features for nodes. Is label only for supervised learning?

Third, is there any talk or detailed slides about the implementation? I got some hints from the paper, but a talk makes it easier to follow i guess (I have seen Jure's recent talks on this but they are overview talks, I was looking to detailed ones).

Across graphs learning example

Hi William,

Are there example code and data for the across learning example described in section 4.2 in your paper?
Currently, the PPI data and evaluation code seem like a single graph example.
Thank you very much for sharing the resources!

Cheers,
Oh-Hyun

code problem

I think line 480 in models.py:

neg_aff = tf.matmul(self.outputs2, tf.transpose(self.neg_outputs)) + self.neg_outputs_bias

should be self.outputs1 instead of outputs2.

Is it a mistake?

Unsupervised XENT loss

Are you able to provide some explanation about the definition of the XENT loss?

    def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):

        aff = self.affinity(inputs1, inputs2)
        neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)

        true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.ones_like(aff), logits=aff)

        negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.zeros_like(neg_aff), logits=neg_aff)

        loss = tf.reduce_sum(true_xent) + 0.01*tf.reduce_sum(negative_xent)
        return loss

That doesn't (appear to) follow Equation 1 in the paper -- specifically, where's the 0.01 constant coming from?

Questions about the algorithm

Hi, dear authors

In algorithm 1 line 4 of your paper, the aggregator is operating on the neighbors of each node. But I notice in your code, you have the adj_info and UniformNeighborSampler, which is used to generate the list of neighbor of each node. My question is are you sampling the same node in the neighborhood for multiple times and feed it to the aggregator?

Another question is that during the training, with different Weight , we can correspondingly update h, just in the first iteration you initialized h as x. Do we need to update h based on algorithm 1, in each iteration? If so, isn't this computationally expensive?

The last question is that, I assume the mini batch data is used for the following unsupervised or supervised task, not for algorithm 1?

Thanks a lot!

problem concerning ppi_eval.py

Hi, I download your code and tried the unsupervised training. When I run the ppi_eval.py strictly following the instructions, it output some F1 Score far different from the scores shown on your paper, for examples, some of them look like these
F1 score 0.6524257784214338
F1 score 0.7677407675597393
F1 score 0.7941708906589428
F1 score 0.7668356263577119
F1 score 0.8696596669080376
F1 score 0.8081100651701665
F1 score 0.7624909485879797
the only thing I changed in the code is that I replaced dict.iteritems() to dict.items(), which I think won't be the real problem. I wonder if there is something wrong? Are the scores "Micro F1" or "Macro F1" on your paper?

default parameters + mean aggregator + unsupervised training on ppi dataset(not the toy)

Why neg_samples can take the neighbor nodes but not only the disparate nodes

In the method of _build() in the class of SampleAndAggregate, there are codes of

labels = tf.reshape(
                tf.cast(self.placeholders['batch2'], dtype=tf.int64),
                [self.batch_size, 1])
self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
            true_classes=labels,
            num_true=1,
            num_sampled=FLAGS.neg_sample_size,
            unique=False,
            range_max=len(self.degrees),
            distortion=0.75,
            unigrams=self.degrees.tolist()))

Note that we only take the output of 'sampled_candidates' when call tf.nn.fixed_unigram_candidate_sampler, the parameter of 'true_classes' seem not use, so that, why we need 'labels' in here? And the output of 'sampled_candidates' also contains the elements in the 'labels', it means that 'self.neg_samples' not only contain the elements of disparate nodes but also contains the neighbor nodes, the sample rule is only base on the degrees of each node. But in the paper of "Inductive Representation Learning on Large Graphs", what it said is "The graph-based loss function encourages nearby nodes to have similar representations while enforcing that the representations of disparate nodes are highly distinct", I think it means the 'neg_samples' in code can only contain the disparate nodes but not the neighbor nodes. Or maybe neighbor nodes should be highly distinct when they are high degrees??

Add a requirements.txt

Some specific versions of certain packages are needed, and we should make this easier for users.

In particular, we need networkx<2.0.

Replicating the results in the paper

Hi,

In the paper, the F1 on the PPI data set for the supervised setting is 0.598 (for GraphSAGE-mean). The default hyperparameters yield an F1 of around 0.576 on the full data set.

Can I know whether the difference is due to the different hyperparameter choices or just some randomness from stochastic sampling? If the former, please share with me the concrete choice of parameters if possible.

Thanks!

Jianbo

Question regarding sample numbers in layer 1 and 2

Hi there,

I have a question on sample numbers in layer 1 and 2.

flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')

I'm wondering which layer contains the 1-hop neighbors of the center node?
I first thought center node is 1-hop away from nodes in layer 1, and 2-hop away from nodes in layer 2.
That means the sampled result should be [1 (center node), 25 (1-hop nodes), 25*10 (2-hop nodes)].
But it turns out samples from SampleAndAggregate.sample() to have the shape [1, 10, 250]. I'm pretty confused here.

Thanks,
Serena

Graph classification

What can I do to apply graphSAGE in graph classification?
Certain pooling will be required I guess but how can I train them?
Intuitively, that requires every sample in a batch to come from the same graph.
Am I right? or is there any better way of doing this

how to save well-trained model?

Hi,I use the save method of the model to keep well-trained models,but there were the following mistakes:

raise ValueError("No variables to save")
ValueError: No variables to save

So how do I keep well-trained models ,thanks!

Add easy support for featureless training / identity features.

For small/non-dynamic graphs, we should incorporate an option to use one-hot identity vectors as features, since this will give better predictive performance (at the cost of much slower runtime). We will need to tweak the code to deal with the sparsity though. I think the best solution would be to use tf.nn.embedding_lookup, rather than materializing the sparse one-hot vectors.

normalization in algorithm 1 is missing in paper

Hello,

On Page 4 of the paper https://arxiv.org/pdf/1706.02216.pdf, there is a normalization step in line 7 of algorithm 1. However, I cannot find this step in your code. The algorithm in aggregators.py implements until line 5 of the algorithm (activation) and return the output to line 326 in models.py for the next iteration.

Is there something I missed or you intentionally skip the normalization line to for other benefits?

Thank you very much,
Kai

Reddit Source data query on Google BigQuery

Hi William,

I am trying to replicate your results on Reddit data. However, I would like to get understanding of source data. Can you specify what query you fired on Google Big Query for retrieving source data?

Thanks,
Ayush

input-format description wrong

In the input-format description, id_map.json should change to class_map.json
<train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes.

Weight decay was not applied

In this line:

self.loss = self.link_pred_layer.loss(self.outputs1, self.outputs2, self.neg_outputs)

The loss is re-assigned to the link prediction loss. It should be self.loss += ...

(It doesn't affect anything since the weight decay is set to 0 anyways)

how to understand the mini-batch operation in your paper?

hi Rex Ying,
I am confused about the mini-batch operation in algorithm2 in your paper. So why do not we just sample a mini-batch samples and find the corresponding neighbours to train. Instead, you use a more complex operation to define this?
rhank you very much.

Is f1_score evaluation wrong in ppi_eval.py?

print("F1 score", f1_score(test_labels[:,i], log.predict(test_embeds)[:,i], average="micro"))

In ppi_eval.py, your f1_score evaluation seems wired which is different from your implementations in citation_eval.py and reddit_eval.py. According to my understanding, test_labels and log.predict(test_embeds) are of the shape [batch_size(5124), num_classes(121)], so you actually calculate the per-class f1_score. Does this satisfy your expectation and why?

A question about the example file

Hello! I'm reading this article recently and come across some problems. I'm not clear what the file "x-class_map.json" stands for and want to know which method in "networkx" module can produce the file. Hope for your answer. Thanks~

Embeddings for the supervised example

Hi!

Great work!
Is there a simple way to output the embeddings also for the supervised version of graphSage? I was trying to add save_val_embeddings function, but got stuck.

Question about the deepwalk implementation

Hi authors,

I have a question about the Node2Vec(DeepWalk) model you implemented. It seems that it only applies SkipGram model on a window size of 1, which means it only predicts context embeddings of immediate neighbors. Is this true?

Best,
Minjie

Training with multiple GPU?

Thank the authors for the paper and the code. We are going to train the model on multiple GPU. I also know that you did some experiments on multiple GPU.
Is the provided code can run directly with multiple GPU? Or we have to modify this code? Please help us with some explanation on doing with multiple GPU.

Parameters for LSTM model on Reddit dataset

Do you happen to have the parameters for the supervised LSTM model on the Reddit dataset that yields 0.954? I'm running

python -m graphsage.supervised_train \
    --train_prefix ./data/reddit \
    --model graphsage_seq \
    --sigmoid

on my machine, and getting results closer to ~0.92.

Similar question, do you have any sense of the statistical significance in the differences between the various supervised GraphSAGE models? mean, LSTM and pool appear to outperform gcn, and certainly outperform the baseline models, but is there strong evidence that LSTM performs better than mean, given different random starts, etc? (I ask because LSTM is significantly slower, so it'd be nice if I didn't have to use it...)

Thanks

Fix python 3 support

This still appears to be broken due to the api changes to dict.items/dict.iteritems etc.

Issue when running the image

Hi, can I ask you a simple question?
When I tried to run the image in docker image, I met a problem shown as follow:
issuegraphsage
I have tried my best to fix it, but it doesn't work. Could you give me any suggestion about fixing it?
Thank you.

Why is validation/test set needed for unsupervised setting?

Hi,

I am trying to use graphsage to train embeddings in completely unsupervised way. But when I prepare the data in -G.json format, if I have 0 nodes with 'val' attribute set to True, then I get a crash when the model starts to train. What is the significance of validation in unsupervised setting? Ideally for unsupervised setting 'val' and 'test' attributes shouldn't be needed at all.

Also does it make sense to use the graphsage embeddings obtained in unsupervised way for some other task like link prediction?

Thanks.

What are batch1 and batch2 (input1 and input2)?

Hi there,

I'm trying to understand the model but am a little bit confused here.
self.inputs1 = placeholders["batch1"] self.inputs2 = placeholders["batch2"]
I'm wondering why there are two inputs and two outputs? I saw the final node embeddings are save in model.output1. So then what are output2?

Thanks,
Serena

Question about the theorem proof in the paper

Hi ! I have a question about Lemma 3 in the appendices of the paper.

The proof of Lemma 3 says:

default

However, nodes that co-occur in a certain node's 3-hop neighborhood actually may not be adjacent in A^3, so the chromatic number cannot guarantee that these nodes have different colors. From another aspect of view, if nodes that co-occur in any node's 3-hop neighborhood are assigned different colors, then the maximum degree of all nodes in the A^3 graph is no more than the chromatic number, which is certainly not a general case.

If I'm wrong please correct me. Thank you.

Expected behavior of `example_unsupervised.sh`

Can you clarify the expected behavior of the unsupervised example script? I'm getting something like:

...
Iter: 3150 train_loss= 0.00115 train_mrr= 0.16871 train_mrr_ema= 0.18267 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01506
Iter: 3200 train_loss= 0.00115 train_mrr= 0.17344 train_mrr_ema= 0.18270 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01506
Iter: 3250 train_loss= 0.00115 train_mrr= 0.15224 train_mrr_ema= 0.18271 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01504
Iter: 3300 train_loss= 0.00115 train_mrr= 0.17956 train_mrr_ema= 0.18197 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01504
Iter: 3350 train_loss= 0.00115 train_mrr= 0.20141 train_mrr_ema= 0.18416 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01506
Iter: 3400 train_loss= 0.00115 train_mrr= 0.20716 train_mrr_ema= 0.18435 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01507
Iter: 3450 train_loss= 0.00114 train_mrr= 0.19132 train_mrr_ema= 0.18373 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01506
Iter: 3500 train_loss= 0.00115 train_mrr= 0.18440 train_mrr_ema= 0.18375 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01508
Iter: 3550 train_loss= 0.00115 train_mrr= 0.19627 train_mrr_ema= 0.18545 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01512
Iter: 3600 train_loss= 0.00115 train_mrr= 0.17484 train_mrr_ema= 0.18656 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01513
Iter: 3650 train_loss= 0.00114 train_mrr= 0.19171 train_mrr_ema= 0.18581 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01513
Iter: 3700 train_loss= 0.00114 train_mrr= 0.19534 train_mrr_ema= 0.18489 val_loss= 0.00271 val_mrr= 0.18308 val_mrr_ema= 0.18308 time= 0.01514
Optimization Finished!

So the train_loss is decreasing, but the val_loss appears to be exactly the same. Any thoughts?

Thanks
Ben

Input files for simple homogeneous network

I would like to test GraphSAGE on a static simple network with no features and class labels. All we know about the network is the connection of nodes. The input file is similar to:

0 1 
0 2
...

which means node 0 is connected to 1 and 2, and the output is just embeddings based on the proximity information.

What do i need to modify the input data for getting those embeddings? thanks for your details!

How is the initialization of identity features?

Hi there,

I'm not sure how the identity features for the nodes are initialized.
I saw it's from
self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
Does it mean it initializes from some uniform distribution for a given range using glorot_uniform_initializer?

Thanks,
Serena

Run supervised LSTM-based model with default hyperparameters yield an F1 of around 0.46

Hi:
I used this command :
python -m graphsage.supervised_train --train_prefix ./example_data/ppi --model graphsage_seq --sigmoid
and I get this result:
Optimization Finished! Full validation stats: loss= 0.52820 f1_micro= 0.46528 f1_macro= 0.23076 time= 0.33810 Writing test set stats to file (don't peak!)
Is default hyperparameters the optimal parameters?The result(f1_micro= 0.46) is far from the result(f1_micro=0.61) listed in the paper.
I set epoch=50, and I get the results range from 0.60 to 0.64. So, what are the optimal parameters for supervised LSTM model on the PPI?

How to generate the files in example_data

Hi,

I downloaded a dataset and, from it, I need to create the files in example_data folder. Should I create my own code to generate it or is there code ready for this?

Tanks in advance

Saving GraphSAGE model -- ValueError: GraphDef cannot be larger than 2GB

I am trying to save the model obtained by running GraphSAGE on a graph with a large number of nodes (in the order of few million nodes). I would like to use the model later on data not in the original training, validation or test sets. I tried using a tf.Saver object as follows, in supervised_train.py:

saver = tf.train.Saver()
for epoch in range(FLAGS.epochs):
    // training loop body goes here
    saver.save(sess, os.path.join(log_dir(), "model.ckpt"), global_step=total_steps)

This worked nicely for some of the small examples I tried, but now I am facing an error message saying the GraphDef to be written is too large:

<cut off a bunch of lines of traceback>

  File "/work/GraphSAGE/graphsage/supervised_train.py", line 272, in train
    saver.save(sess, os.path.join(log_dir(), "model.ckpt"), global_step=total_steps)
  File "tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1494, in save
    self.export_meta_graph(meta_graph_filename)
  File "tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1522, in export_meta_graph
    graph_def=ops.get_default_graph().as_graph_def(add_shapes=True),
  File "tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2361, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2324, in _as_graph_def
    raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.

I tried to cut down on the amount of data to save by specifying tf.Saver to only write the trainable parameters:

saver = tf.train.Saver(var_list=tf.trainable_variables())

However, the error persists.

Any ideas how I could save the necessary parts of the computational graph to be able to later load it from disk and apply it to new data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.