Giter Club home page Giter Club logo

caw's Introduction

Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks

Introduction

This is the reference PyTorch implementation of the paper:
Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks.

The project website is: https://snap.stanford.edu/caw/

Authors

Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, Pan Li

Requirements

  • python >= 3.7, PyTorch >= 1.4, please refer to their official websites for installation details.
  • Other dependencies:
pandas==0.24.2
tqdm==4.41.1
numpy==1.16.4
scikit_learn==0.22.1
matploblib==3.3.1
numba==0.51.2

Refer to environment.yml for more details.

Dataset and preprocessing

Option 1: Use our preprocessed data

We provide preprocessed datasets: Reddit, Wikipedia, Enron, and UCI. Download them from here to processed/. Then run the following:

cd processed/
unzip data.zip

You may check that each dataset corresponds to three files: one .csv containing timestamped links, and two .npy as node & link features. Note that some datasets do not have node & link features, in which case the .npy files will be all zeros.

Option 2: Use your own data

Put your data under processed folder. The required input data includes ml_${DATA_NAME}.csv, ml_${DATA_NAME}.npy and ml_${DATA_NAME}_node.npy. They store the edge linkages, edge features and node features respectively.

The .csv file has following columns

u, i, ts, label, idx

, which represents source node index, target node index, time stamp, edge label and the edge index.

ml_${DATA_NAME}.npy has shape of [#temporal edges + 1, edge features dimention]. Similarly, ml_${DATA_NAME}_node.npy has shape of [#nodes + 1, node features dimension].

All node index starts from 1. The zero index is reserved for null during padding operations. So the maximum of node index equals to the total number of nodes. Similarly, maxinum of edge index equals to the total number of temporal edges. The padding embeddings or the null embeddings is a vector of zeros.

We also recommend discretizing the timestamps (ts) into integers for better indexing.

Training Commands

Examples:

  • To train CAW-N-mean with Wikipedia dataset in inductive training, sampling 64 length-2 CAWs every node, and with alpha = 1e-5:
python main.py -d wikipedia --pos_dim 108 --bs 32 --n_degree 64 1 --mode i --bias 1e-5 --pos_enc lp --walk_pool sum --seed 0
  • To train CAW-N-attn with UCI dataset in transductive mode, sampling 32 length-1 CAWs every node, with alpha = 1e-6, and using another random seed 123:
python main.py -d uci --pos_dim 100 --bs 32 --n_degree 32 --n_layer 1 --mode t --bias 1e-6 --pos_enc lp --walk_pool attn --seed 123

Detailed logs can be found in log/, a one-line summary of the evaluation result will also be written to log/oneline_summary.log upon completion.

Usage Summary

usage: Interface for Inductive Dynamic Representation Learning for Link Prediction on Temporal Graphs
       [-h] [-d {wikipedia,reddit,socialevolve,uci,enron,socialevolve_1month,socialevolve_2weeks}] [-m {t,i}]
       [--n_degree [N_DEGREE [N_DEGREE ...]]] [--n_layer N_LAYER] [--bias BIAS] [--agg {tree,walk}] [--pos_enc {spd,lp,saw}]
       [--pos_dim POS_DIM] [--pos_sample {multinomial,binary}] [--walk_pool {attn,sum}] [--walk_n_head WALK_N_HEAD]
       [--walk_mutual] [--walk_linear_out] [--attn_agg_method {attn,lstm,mean}] [--attn_mode {prod,map}]
       [--attn_n_head ATTN_N_HEAD] [--time {time,pos,empty}] [--n_epoch N_EPOCH] [--bs BS] [--lr LR] [--drop_out DROP_OUT]
       [--tolerance TOLERANCE] [--seed SEED] [--ngh_cache] [--gpu GPU] [--cpu_cores CPU_CORES] [--verbosity VERBOSITY]

Optional arguments

  -h, --help            show this help message and exit
  -d {wikipedia,reddit,socialevolve,uci,enron,socialevolve_1month,socialevolve_2weeks}, --data {wikipedia,reddit,socialevolve,uci,enron,socialevolve_1month,socialevolve_2weeks}
                        data sources to use, try wikipedia or reddit
  -m {t,i}, --mode {t,i}
                        transductive (t) or inductive (i)
  --n_degree [N_DEGREE [N_DEGREE ...]]
                        a list of neighbor sampling numbers for different hops, when only a single element is input n_layer
                        will be activated
  --n_layer N_LAYER     number of network layers
  --bias BIAS           the hyperparameter alpha controlling sampling preference with time closeness, default to 0 which is
                        uniform sampling
  --agg {tree,walk}     tree based hierarchical aggregation or walk-based flat lstm aggregation
  --pos_enc {spd,lp,saw}
                        way to encode distances, shortest-path distance or landing probabilities, or self-based anonymous
                        walk (baseline)
  --pos_dim POS_DIM     dimension of the positional embedding
  --pos_sample {multinomial,binary}
                        two practically different sampling methods that are equivalent in theory
  --walk_pool {attn,sum}
                        how to pool the encoded walks, using attention or simple sum, if sum will overwrite all the other
                        walk_ arguments
  --walk_n_head WALK_N_HEAD
                        number of heads to use for walk attention
  --walk_mutual         whether to do mutual query for source and target node random walks
  --walk_linear_out     whether to linearly project each node's
  --attn_agg_method {attn,lstm,mean}
                        local aggregation method, we only use the default here
  --attn_mode {prod,map}
                        use dot product attention or mapping based, we only use the default here
  --attn_n_head ATTN_N_HEAD
                        number of heads used in tree-shaped attention layer, we only use the default here
  --time {time,pos,empty}
                        how to use time information, we only use the default here
  --n_epoch N_EPOCH     number of epochs
  --bs BS               batch_size
  --lr LR               learning rate
  --drop_out DROP_OUT   dropout probability for all dropout layers
  --tolerance TOLERANCE
                        toleratd margainal improvement for early stopper
  --seed SEED           random seed for all randomized algorithms
  --ngh_cache           (currently not suggested due to overwhelming memory consumption) cache temporal neighbors previously
                        calculated to speed up repeated lookup
  --gpu GPU             which gpu to use
  --cpu_cores CPU_CORES
                        number of cpu_cores used for position encoding
  --verbosity VERBOSITY
                        verbosity of the program output

Acknowledgement

Our implementation adapts the code here as the code base and extensively adapts it to our purpose. We thank the authors for sharing their code.

Cite us

If you compare with, build on, or use aspects of the paper and/or code, please cite us:

@inproceedings{
wang2021inductive,
title={Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks},
author={Yanbang Wang and Yen-Yu Chang and Yunyu Liu and Jure Leskovec and Pan Li},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=KYPz4YsCPj}
}

caw's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

caw's Issues

auc and ap swapped

in the log the results are changed

CAW/main.py

Line 142 in 0aab882

logger.info('Test statistics: {} new-new nodes -- acc: {}, auc: {}, ap: {}'.format(args.mode, test_new_new_acc, test_new_new_ap, test_new_new_auc ))

CAW/main.py

Line 144 in 0aab882

logger.info('Test statistics: {} new-old nodes -- acc: {}, auc: {}, ap: {}'.format(args.mode, test_new_old_acc, test_new_old_ap, test_new_old_auc))

The memory cost continuously increases when running.

I tried to run CAWN on reddit dataset. The command I tried is:
python main.py -d reddit --pos_dim 108 --bs 100 --n_degree 32 1 1 --mode t --bias 1e-8 --pos_enc lp --walk_pool sum --gpu 1
I found that the memory cost of the training process continues to increase. Why is that?

CMAKE The package name passed to `find_package_handle_standard_args` (OpenMP_CXX) does not match the name of the calling package (OpenMP).

It first through the following warning when I try to compile pytorch using source file

CMake Warning (dev) at /usr/share/cmake3/Modules/FindPackageHandleStandardArgs.cmake:272 (message):
The package name passed to find_package_handle_standard_args (OpenMP_CXX)
does not match the name of the calling package (OpenMP). This can lead to
problems in calling code that expects find_package result variables
(e.g., _FOUND) to follow a certain pattern.
Call Stack (most recent call first):
cmake/Modules/FindOpenMP.cmake:565 (find_package_handle_standard_args)
cmake/Modules/FindMKL.cmake:213 (FIND_PACKAGE)
cmake/Modules/FindMKL.cmake:307 (CHECK_ALL_LIBRARIES)
cmake/Dependencies.cmake:140 (find_package)
CMakeLists.txt:564 (include)
This warning is for project developers. Use -Wno-dev to suppress it.

And,

  • MKL OpenMP type: GNU
    -- MKL OpenMP library: -fopenmp
    -- Brace yourself, we are building NNPACK
    -- NNPACK backend is x86-64
    -- Failed to find LLVM FileCheck
    -- git Version: v1.4.0-505be96a
    -- Version: 1.4.0
    -- Performing Test HAVE_STD_REGEX -- compiled but failed to run
    -- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
    -- Performing Test HAVE_POSIX_REGEX -- compiled but failed to run
    CMake Error at third_party/benchmark/CMakeLists.txt:231 (message):
    Failed to determine the source files for the regular expression backend

and it exit with saying -- Configuring incomplete, errors occurred!

Any help how to configure OpenMP?

Features of datasets

Hi dear authors,
I have noticed that the processed file contains node features and edge features for all datasets. As far as I know, the original ENRON/UCI/social-evolution datasets do not contain any node/edge features. I have checked your paper, but the paper doesn't mention any details for initializing node or edge features. Can you describe the details of initializing node or edge features in these .npy files? Thanks!

Discrepancy between processed UCI data provided and referenced UCI Forum data (http://konect.cc/networks/opsahl-ucforum/)

Nice work!
I had some questions regarding the datasets used for the experiments.

The UCI dataset mentioned in the paper (Appendix C page 16) references (http://konect.cc/networks/opsahl-ucforum/).
The dataset is between users and forums (a bipartite network.) Has 1,421 total nodes with 899 users and 522 forums. It has 33,720 interactions.

However, the processed data provided in this repo and probably used to report numbers in the paper has 1,899 total nodes and 59,835 interactions.

On closer look at the processed dataset I find that you have actually used the UCI Messages dataset (http://konect.cc/networks/opsahl-ucsocial/) dataset which is not bipartite and the interactions are messages between user nodes.

If not, it would be great if you can clarify this discrepancy. Either the description in paper about the UCI dataset is incorrect or the processed dataset provided is wrong.

Experiment on UCI

Hello!
I tried your execution command for the UCI dataset in READEME.md, but I didn't get good experimental results like the paper. I tried to adjust the relevant parameter information, but there was no greater improvement. I would like to ask how can I reproduce the experimental results in the paper?
Thank you~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.