Giter Club home page Giter Club logo

graphstorm's Introduction

GraphStorm

| Document and Tutorial Site | GraphStorm Paper |

GraphStorm is a graph machine learning (GML) framework for enterprise use cases. It simplifies the development, training and deployment of GML models for industry-scale graphs by providing scalable training and inference pipelines of Graph Machine Learning (GML) models for extremely large graphs (measured in billons of nodes and edges). GraphStorm provides a collection of built-in GML models and users can train a GML model with a single command without writing any code. To help develop SOTA models, GraphStorm provides a large collection of configurations for customizing model implementations and training pipelines to improve model performance. GraphStorm also provides a programming interface to train any custom GML model in a distributed manner. Users provide their own model implementations and use GraphStorm training pipeline to scale.

GraphStorm architecture

Get Started

Installation

GraphStorm is compatible to Python 3.7+. It requires PyTorch 1.13+, DGL 1.0 and transformers 4.3.0+.

GraphStorm can be installed with pip and it can be used to train GNN models in a standalone mode. To run GraphStorm in a distributed environment, we recommend users to using Docker container to reduce envrionment setup efforts. A guideline to setup GraphStorm running environment can be found at here and a full instruction on how to setup distributed training can be found here.

Run GraphStorm with OGB datasets

Note: we assume users have setup a GraphStorm standalone environment following the Setup GraphStorm with pip Packages instructions. And users have git cloned the GraphStorm source code into the /graphstorm/ folder to use some complimentatry tools.

Node classification on OGB arxiv graph First, use the below command to download the OGB arxiv data and process it into a DGL graph for the node classification task.

python /graphstorm/tools/gen_ogb_dataset.py --savepath /tmp/ogbn-arxiv-nc/ --retain-original-features true

Second, use the below command to partition this arxiv graph into a distributed graph that GraphStorm can use as its input.

python /graphstorm/tools/partition_graph.py --dataset ogbn-arxiv \
                                            --filepath /tmp/ogbn-arxiv-nc/ \
                                            --num-parts 1 \
                                            --num-trainers-per-machine 4 \
                                            --output /tmp/ogbn_arxiv_nc_train_val_1p_4t

GraphStorm training relies on ssh to launch training jobs. The GraphStorm standalone mode uses ssh services in port 22.

In addition, to run GraphStorm training in a single machine, users need to create a ip_list.txt file that contains one row as below, which will facilitate ssh communication to the machine itself.

127.0.0.1

Users can use the following command to create the simple ip_list.txt file.

touch /tmp/ip_list.txt
echo 127.0.0.1 > /tmp/ip_list.txt

Third, run the below command to train an RGCN model to perform node classification on the partitioned arxiv graph.

python -m graphstorm.run.gs_node_classification \
       --workspace /tmp/ogbn-arxiv-nc \
       --num-trainers 1 \
       --num-servers 1 \
       --num-samplers 0 \
       --part-config /tmp/ogbn_arxiv_nc_train_val_1p_4t/ogbn-arxiv.json \
       --ip-config  /tmp/ip_list.txt \
       --ssh-port 22 \
       --cf /graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \
       --save-perf-results-path /tmp/ogbn-arxiv-nc/models

Link Prediction on OGB MAG graph First, use the below command to download the OGB MAG data and process it into a DGL graph for the link prediction task. The edge type for prediction is “author,writes,paper”. The command also set 80% of the edges of this type for training and validation (default 10%), and the rest 20% for testing.

python /graphstorm/tools/gen_mag_dataset.py --savepath /tmp/ogbn-mag-lp/ --edge-pct 0.8

Second, use the following command to partition the MAG graph into a distributed format.

python /graphstorm/tools/partition_graph_lp.py --dataset ogbn-mag \
                                               --filepath /tmp/ogbn-mag-lp/ \
                                               --num-parts 1 \
                                               --num-trainers-per-machine 4 \
                                               --target-etypes author,writes,paper \
                                               --output /tmp/ogbn_mag_lp_train_val_1p_4t

Third, run the below command to train an RGCN model to perform link prediction on the partitioned MAG graph.

python -m graphstorm.run.gs_link_prediction \
       --workspace /tmp/ogbn-mag-lp/ \
       --num-trainers 1 \
       --num-servers 1 \
       --num-samplers 0 \
       --part-config /tmp/ogbn_mag_lp_train_val_1p_4t/ogbn-mag.json \
       --ip-config /tmp/ip_list.txt \
       --ssh-port 22 \
       --cf /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml \
       --node-feat-name paper:feat \
       --save-model-path /tmp/ogbn-mag/models \
       --save-perf-results-path /tmp/ogbn-mag/models

To learn GraphStorm's full capabilities, please refer to our Documentations and Tutorials.

Cite

If you use GraphStorm in a scientific publication, we would appreciate citations to the following paper:

@article{zheng2024graphstorm,
  title={GraphStorm: all-in-one graph machine learning framework for industry applications},
  author={Zheng, Da and Song, Xiang and Zhu, Qi and Zhang, Jian and Vasiloudis, Theodore and Ma, Runjie and Zhang, Houyu and Wang, Zichen and Adeshina, Soji and Nisa, Israt and others},
  journal={arXiv preprint arXiv:2406.06022},
  year={2024}
}

Limitation

GraphStorm framework now supports using CPU or NVidia GPU for model training and inference. But it only works with PyTorch-gloo backend. It was only tested on AWS CPU instances or AWS GPU instances equipped with NVidia GPUs including P4, V100, A10 and A100.

Multiple samplers are supported in PyTorch versions <= 1.12 and >= 2.1.0. Please use --num-samplers 0 for other PyTorch versions. More details here.

To use multiple samplers on sagemaker please use PyTorch versions <= 1.12.

License

This project is licensed under the Apache-2.0 License.

graphstorm's People

Contributors

amazon-auto avatar chang-l avatar classicsong avatar congweilin avatar dominikajedynak avatar gentlezhu avatar houyuzhang1007 avatar isratnisa avatar jalencato avatar jcy1rus avatar kacper-pietkun avatar luqiy avatar oceanusity avatar prateekdesai04 avatar thvasilo avatar wangz10 avatar zheng-da avatar zhjwy9343 avatar znyzhouwl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphstorm's Issues

GraphStorm needs to check the input data and model config.

Traceback (most recent call last):
  File "gsgnn_ep.py", line 141, in <module>
    main(args)
  File "gsgnn_ep.py", line 111, in main
    trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
  File "/graph-storm/python/graphstorm/trainer/ep_trainer.py", line 143, in fit
    loss = model(blocks, batch_graph, input_feats, None, lbl, input_nodes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/edge_gnn.py", line 119, in forward
    pred_loss = self.loss_func(logits, labels[target_etype])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/loss_func.py", line 51, in forward
    return self.loss_fn(logits, labels)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'

Define minimum requirements in setup.py

Our setup.py currently does not have specific requirements of library versions, although the library assumes a couple of libraries have at least some specific versions. In particular:

ogb >= 1.3.6,
torch >= 1.12

[Bug] new launch API can not exit in other instances after break in the launch instance

issue description:

In the new launch API, when use "crtl + c" to exit the training/inference processes in the launch instance, the GraphStorm processes in other instances DO NOT exit! The launch script does not kill other instances' processes.

Log track:

Step 40 | Validation mrr: 0.0590
Step 40 | Test mrr: 0.0588
Step 40 | Best Validation mrr: 0.0590
Step 40 | Best Test mrr: 0.0588
Step 40 | Best Iteration mrr: 40.0000
Eval time: 61.0621, Evaluation step: 40.0000
evaluate validation/test: elapsed time: 17.759, mem (curr: 14.181, peak: 14.181, shared: 8.483, global curr: 25.959, global shared: 12.100) GB
successfully save the model to /data/ak/models-test/epoch-0-iter-39
Time on save model 0.0036745071411132812
Epoch 00000 | Batch 040 | Train Loss: 0.3333 | Time: 9.5225
^CProcess Process-1:
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=3; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/graphstorm/python/graphstorm/run/launch.py", line 58, in cleanup_proc
remote_pids = get_all_remote_pids_func()
File "/graphstorm/python/graphstorm/run/launch.py", line 279, in get_all_remote_pids
cmds = udf_command.split()
AttributeError: 'list' object has no attribute 'split'
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=1 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=3 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=2 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=0; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=0 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=2; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=1; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
^C

[Feature Request] Weighted edge loss.

When doing link prediction training, different edges may have different weights representing their importance. Supporting weighted edge in link prediction loss is required.

We are going to add a new argument --lp-edge-weight-for-loss to specify the edge weight used in loss function.

Related #214

Verify the correctness of GraphStorm input config.

  • Don't require users to provide graph name.
  • Verify the node/edge feature names and label names.
  • If infer_data.test_idxs is empty, return an error in the inference script.
  • We may want to warn users if they want to have mini-batch inference. (mini-batch inference on a large graph can take very long time).

[Doc] Add a tutorial

The tutorial will includes:

  1. GraphStorm environment setup.
  2. Preparing graph data.
  3. Training a GNN model using GraphStorm framework.
  4. Do GNN inference using GraphStorm framework.
  5. Collecting model artifacts and prediction resutls.

Have a scalable way of saving learnable embeddings

Currently, trainer 0 saves the learnable embeddings to the disk. If the learnable embedding table is very large, trainer 0 will run out of memory. We need to distribute the saving of learnable embeddings.

[Feature Request] Add SageMaker support.

The Amazon SageMaker support should includes:

  • Support launching data processing task using Amazon SageMaker.
  • Support launching distributed training task using Amazon SageMaker.
  • Support launching distributed inference task using Amazon SageMaker.
  • Tutorial.

[Config] Below configuration/arguments are not intuitive or consistent - batch 1

These names of configurations/arguments are either intuitive or consistent. Recommend to make changes.

  • feat_name: change to “node_feat_name”
    • Use “edge_feat_name” later when support edge features.
  • n_layers: change to "num_layers"
  • n_epochs change to "num_epochs"
  • sparse_lr: change to “sparse_optimizer_lr”
  • save_predict_path: change to "save_prediction_path"
  • negative_sampler: change to "train_negative_sampler"
  • test_negative_sampler: change to "eval_negative_sampler"
  • enable_early_stop: change to "use_early_stop"
  • lm_infer_batchsize: change to "lm_infer_batch_size"
  • evaluation_frequency: change to "eval_frequency"

Support categorial attributes

Some features are categorial values. That is, the original values are strings. We need to convert it to integers.

Support in-library launch scripts for built-in model training and inference.

Motivation

To run a GraphStorm training task, a user need to take the following steps:

  1. Pip install GraphStorm library
  2. git clone https://github.com/dmlc/dgl.git
  3. git clone https://github.com/awslabs/graphstorm.git
  4. Use dgl launch.py script to launch a task.

The steps are to complex as the user needs to download may codebases to get the entrypoint scripts.

Proposal

We will integrate launch.py as well as other entrypoint scripts provided in training_scripts/inference_scripts into GraphStorm library. Then the UX will be:

  1. Pip install GraphStorm library
  2. Run a training task using built-in training script: python3 -m graphstorm.run.gsf_node_classification xxx or python3 -m graphstorm.run.gsf_link_prediction xxx

For user defined training script: python3 -m graphstorm.run.launch xxx <PATH_TO_SCRIPT> xxx .

[Config] Argument name "num_gpus" default behavior is unknown

In the configuration file, the "num_gpus" argument is confusing. What is its default behavior?

Because this argument only give a number of GPUs and does not know if some GPUs are shared with other users, its default behavior should be using ALL GPUs.

So please check current codes to see if this is the default behavior. If not, please implement it.

[Config] argument "eval_batch_size" may slow down the speed of evaluation

In configuration, argument "eval_batch_size" may slow down the speed of evaluation because users normally set this value to 1k-4k as used in the "train_batch_size". And in link prediction task, we have another computation of test score, the smaller batch_size may also slow down the speed.

So,

  • set its default value to a larger one, e.g. +10k
  • add another argument, e.g. "lp_score_batch_size", to set the batch size of link prediction score computation.

Cannot download OGB Datasets

Unable to download OGB datasets required for partitioning and running regression tests.
Affected datasets (arxiv, products, papers100M)
Able to download OGB MAG

[Config] argument "use_dot_product" is not intuitive about its function

In configuration, the argument "use_dot_product" is not intuitive about its function, and it is easy to get an assertion error of proper setting.

So

  • it is better to change to an intuitive argument name, and,
  • it is better to automatically to determine its value based on other given arguments or graph information.

[Bug] Cannot save sparse embeddings in a distributed setting

Setting:

  • Two nodes, g4dn.12xlarge.
  • Launch node create a folder, e.g., "sharedfolder" and use it as an NFS shared folder.
  • In the execution node create an NFS client, and mount the "sharedfolder".
  • Copy graphstorm code and data into the sharedfolder.

How to reproduct:

  1. Set the "save-model-path", e.g., "/sharedfolder/model"
  2. Run an LP task in two partitions.
  3. Could set the "evaluation_frequency" and "save_model_per_iters" to a smaller value, e.g. 10, to accelerate this repruducing.

Problem:

  1. After evaluation and save mode;
  2. The launch node can save sparse embeddings successfully, but
  3. The execution node can NOT save the sparse embeddings.

Root cause:

  1. The launch node's rank 0 process will make the "/sharedfolder/model" when start to save model artifacts;
  2. The "/sharedfolder/model" mode is 755 rwxr-xr-x;
  3. The launch node's rank 0 process will make folders for each save-model-per-iters, e.g., "/sharedfolder/model/epoch-0-iter-19" and "/sharedfolder/model/epoch-0-iter-19/item/", whose mode are 755 too.
  4. The launc node's rank 0 process will save "model.bin" and "optimizer.bin" to the ""/sharedfolder/model/epoch-0-iter-19"" folder.
  5. The launch nodes' process can save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder.
  6. BUT, the execution node can NOT save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder, because the 755 mode do not allow process in the execution nodes to save them.

This is caused by the newly distributed saving of sparse embeddings.

[Roadmap] V0.1 Release Plan

The V0.1 release will focus mostly on completing and optimizing the existing functionalities and adding data loading support.

  • [Improvement] Saving/restoring model artifact in distributed training process user experience.
  • [Improvement] Refining all configuration arguments.
  • [New Feature] Provide a single-machine data loading pipeline.
  • [Doc] Tutorial of graph construction, GNN model training and GNN model inference.
  • [Improvement] Add support for tuning language model with a separate learning rate.
  • [Improvement] Bug fixes.

Welcome any comments and feedback!

Save embeddings based on the original node IDs

When GNN embeddings are saved, they are saved based on the node IDs in the partitioned graph, which are different from the original node IDs of users' input data. We need to define a simple way for users to identify the right GNN embeddings from the original node IDs.

[Bug] OOM while inferencing on ogbn-papers100M for link prediction

Attempting inference on ogbn-papers100M for link prediction is causing an OOM (out-of-memory) issue, as shown in the attached screenshot. As a result, the system is becoming extremely unresponsive. The OOM happens in the evaluation function (val_mrr, test_mrr = self.evaluator.evaluate(None, test_scores, 0)) function here. System can successfully save the node embedding and relational embeddings and exit the program without any issue when the evaluation function is omitted.

Screenshot 2023-04-10 at 11 47 20 AM

Experiment setup:

Dataset: ogbn-papers100M partitioned into 3
Instance: g4dn.metal
Command to run Inferencing:

python3 -u  ~/dgl/tools/launch.py \
        --workspace /graph-storm/inference_scripts/lp_infer \
        --num_trainers 1 \
        --num_servers 1 \
        --num_samplers 0 \
        --part_config /data/ogbn-papers100M-3p/ogbn-papers100M.json \
        --ip_config  /data/ip_list_p3_metal.txt \
        --ssh_port 2222 \
        "python3 lp_infer_gnn.py --cf  /data/ogbn_papers100M_infer_p3.yaml  --use-node-embeddings false --num-gpus 4 --part-config /data/ogbn-papers100M-3p/ogbn-papers100M.json  --restore-model-path /data/papers100M-lp-p3-model/epoch-0  --feat-name feat --no-validation false"

Reproduced with the following environment:

  • DGL 1.0.2 + GSF github/gitlab version
  • DGL 1.0.0 + GSF github/gitlab version

Smaller dataset like ogbn-mag works fine on similar setup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.