Giter Club home page Giter Club logo

bns-gcn's Introduction

BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling

Cheng Wan* (Rice University), Youjie Li* (UIUC), Ang Li (PNNL), Nam Sung Kim (UIUC), Yingyan Lin (Rice University)

(*Equal contribution)

Accepted at MLSys 2022 [Paper | Video | Slide | Docker | Sibling]

Directory Structure

|-- checkpoint   # model checkpoints
|-- dataset
|-- helper       # auxiliary codes
|   `-- timer
|-- module       # PyTorch modules
|-- partitions   # partitions of input graphs
|-- results      # experiment outputs
`-- scripts      # example scripts

Note that ./checkpoint/, ./dataset/, ./partitions/ and ./results/ are empty folders at the beginning and will be created when BNS-GCN is launched.

Setup

Environment

Hardware Dependencies

  • A X86-CPU machine with at least 120 GB host memory
  • At least five Nvidia GPUs (at least 11 GB each)

Software Dependencies

Installation

Option 1: Run with Docker

We have prepared a Docker package for BNS-GCN.

docker pull cheng1016/bns-gcn
docker run --gpus all -it cheng1016/bns-gcn

Option 2: Install with Pip

Running the following command will install all prerequisites from pip.

pip install -r requirements.txt

Option 3: Do it Yourself

If the above options fail to run BNS-GCN, please follow the official guides ([1], [2], [3]) to install PyTorch, DGL and OGB.

Datasets

We use Reddit, ogbn-products, Yelp and ogbn-papers100M for evaluating BNS-GCN. All datasets are supposed to be stored in ./dataset/ by default.

Basic Usage

Core Training Options

  • --dataset: the dataset you want to use
  • --model: the GCN model (only GCN, GraphSAGE and GAT are supported at this moment)
  • --lr: learning rate
  • --sampling-rate: the sampling rate of BNS-GCN
  • --n-epochs: the number of training epochs
  • --n-partitions: the number of partitions
  • --n-hidden: the number of hidden units
  • --n-layers: the number of GCN layers
  • --partition-method: the method for graph partition ('metis' or 'random')
  • --port: the network port for communication
  • --no-eval: disable evaluation process

Run Example Scripts

Simply running scripts/reddit.sh, scripts/ogbn-products.sh and scripts/yelp.sh can reproduce BNS-GCN under the default settings. For example, after running bash scripts/reddit.sh, you will get the output like this

...
Process 000 | Epoch 02999 | Time(s) 0.3578 | Comm(s) 0.2267 | Reduce(s) 0.0108 | Loss 0.0716
Process 001 | Epoch 02999 | Time(s) 0.3600 | Comm(s) 0.2314 | Reduce(s) 0.0136 | Loss 0.0867
(rank 1) memory stats: current 562.96MB, peak 1997.89MB, reserved 2320.00MB
(rank 0) memory stats: current 557.01MB, peak 2087.31MB, reserved 2296.00MB
Epoch 02999 | Accuracy 96.55%
model saved
Max Validation Accuracy 96.68%
Test Result | Accuracy 97.21%

Run Full Experiments

If you want to reproduce core experiments of our paper (e.g., accuracy in Table 4, throughput in Figure 4, time breakdown in Figure 5, peak memory in Figure 6), please run scripts/reddit_full.sh, scripts/ogbn-products_full.sh or scripts/yelp_full.sh, and the outputs will be saved to ./results/ directory. Note that the throughput of these experiments will be significantly slower than the results in our paper because the training is performed along with validation.

Run Customized Settings

You may adjust --n-partitions and --sampling-rate to reproduce the results of BNS-GCN under other settings. To verify the exact throughput or time breakdown of BNS-GCN, please add --no-eval argument to skip the evaluation step. You may also use the argument --partition-method=random to explore the performance of BNS-GCN with random partition.

Run with Multiple Compute Nodes

Our code base also supports distributed GCN training with multiple compute nodes. To achieve this, you should specify --master-addr, --node-rank and --parts-per-node for each compute node. An example is provided in scripts/reddit_multi_node.sh where we train the Reddit graph over 4 compute nodes, each of which contains 10 GPUs, with 40 partitions in total. You should run the command on each node and specify the corresponding node rank. Please turn on --fix-seed argument so that all nodes initialize the same model weights.

If the compute nodes do not share storage, you should partition the graph in a single device first and manually distribute the partitions to other compute nodes. When run the training script, please enable --skip-partition argument.

Citation

@article{wan2022bns,
  title={{BNS-GCN}: Efficient Full-graph Training of Graph Convolutional Networks with Partition-parallelism and Random Boundary Node Sampling},
  author={Wan, Cheng and Li, Youjie and Li, Ang and Kim, Nam Sung and Lin, Yingyan},
  journal={Proceedings of Machine Learning and Systems},
  volume={4},
  pages={673--693},
  year={2022}
}

License

Copyright (c) 2022 GaTech-EIC. All rights reserved.

Licensed under the MIT license.

bns-gcn's People

Contributors

chwan1016 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bns-gcn's Issues

A question about experiment results on paper

Dear author of BNS-GCN:
In the Table 9 of this paper, I have questions about the parameters of Epoch Time and Epoch Comm.
image

  1. How to get the number of Epoch Comm? The experiment output has "memory stats", is it from this result?
  2. In the output of each epoch, there are multiple epoch times of each processor. Is the final epoch time the average time or take the maximum value? (Same question as Epoch Comm)
    Thank you very much!

How to use other two models in bns-gcn

Hello, I have encountered some problems when using bns-gcn, as can be seen from the code you provided, there are three model available for use. But after I modify the script specified by the script, there seems to be an error. I'm not sure where I have to modify if I want to use the other two models.

partition for paper100m

Hi,

thanks for your help. I try to partition the large dataset paper100m. However, it has errors with the log below. Could you help resolve that? I am sure there is enough memory (~1T). Not sure if it is because there aren't GPU for this node. (If so, could you please share the partition results from paper100m? say 40 partitions?)

Converting to homogeneous graph takes 7.768s, peak mem: 227.382 GB
Illegal instruction

Thanks.

A inquiry about the derivation of your paper

May I ask how the steps outlined in the formula derivation on pages 14/15 of your article :"BNS-GCN: EFFICIENT FULL-GRAPH TRAINING OF GRAPH CONVOLUTIONAL NETWORKS WITH PARTITION-PARALLELISM AND RANDOM BOUNDARY NODE SAMPLING" were derived using rectangular frames
捕获

A question about epoch time breakdown

Dear author of BNS-GCN,

I have a question about the epoch time breakdown.
截屏2022-05-20 23 37 10
For the first time in the picture, is it the computation time or is it the total time of one training epoch?
I have already added --no_eval argument. But the time breakdown results are still strange and seem not following the results in the paper. That's why I ask this question.
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.