Giter Club home page Giter Club logo

graphstorm's Issues

Support categorial attributes

Some features are categorial values. That is, the original values are strings. We need to convert it to integers.

Have a scalable way of saving learnable embeddings

Currently, trainer 0 saves the learnable embeddings to the disk. If the learnable embedding table is very large, trainer 0 will run out of memory. We need to distribute the saving of learnable embeddings.

Define minimum requirements in setup.py

Our setup.py currently does not have specific requirements of library versions, although the library assumes a couple of libraries have at least some specific versions. In particular:

ogb >= 1.3.6,
torch >= 1.12

[Roadmap] V0.1 Release Plan

The V0.1 release will focus mostly on completing and optimizing the existing functionalities and adding data loading support.

  • [Improvement] Saving/restoring model artifact in distributed training process user experience.
  • [Improvement] Refining all configuration arguments.
  • [New Feature] Provide a single-machine data loading pipeline.
  • [Doc] Tutorial of graph construction, GNN model training and GNN model inference.
  • [Improvement] Add support for tuning language model with a separate learning rate.
  • [Improvement] Bug fixes.

Welcome any comments and feedback!

[Doc] Add a tutorial

The tutorial will includes:

  1. GraphStorm environment setup.
  2. Preparing graph data.
  3. Training a GNN model using GraphStorm framework.
  4. Do GNN inference using GraphStorm framework.
  5. Collecting model artifacts and prediction resutls.

Save embeddings based on the original node IDs

When GNN embeddings are saved, they are saved based on the node IDs in the partitioned graph, which are different from the original node IDs of users' input data. We need to define a simple way for users to identify the right GNN embeddings from the original node IDs.

Cannot download OGB Datasets

Unable to download OGB datasets required for partitioning and running regression tests.
Affected datasets (arxiv, products, papers100M)
Able to download OGB MAG

[Feature Request] Weighted edge loss.

When doing link prediction training, different edges may have different weights representing their importance. Supporting weighted edge in link prediction loss is required.

We are going to add a new argument --lp-edge-weight-for-loss to specify the edge weight used in loss function.

Related #214

[Bug] OOM while inferencing on ogbn-papers100M for link prediction

Attempting inference on ogbn-papers100M for link prediction is causing an OOM (out-of-memory) issue, as shown in the attached screenshot. As a result, the system is becoming extremely unresponsive. The OOM happens in the evaluation function (val_mrr, test_mrr = self.evaluator.evaluate(None, test_scores, 0)) function here. System can successfully save the node embedding and relational embeddings and exit the program without any issue when the evaluation function is omitted.

Screenshot 2023-04-10 at 11 47 20 AM

Experiment setup:

Dataset: ogbn-papers100M partitioned into 3
Instance: g4dn.metal
Command to run Inferencing:

python3 -u  ~/dgl/tools/launch.py \
        --workspace /graph-storm/inference_scripts/lp_infer \
        --num_trainers 1 \
        --num_servers 1 \
        --num_samplers 0 \
        --part_config /data/ogbn-papers100M-3p/ogbn-papers100M.json \
        --ip_config  /data/ip_list_p3_metal.txt \
        --ssh_port 2222 \
        "python3 lp_infer_gnn.py --cf  /data/ogbn_papers100M_infer_p3.yaml  --use-node-embeddings false --num-gpus 4 --part-config /data/ogbn-papers100M-3p/ogbn-papers100M.json  --restore-model-path /data/papers100M-lp-p3-model/epoch-0  --feat-name feat --no-validation false"

Reproduced with the following environment:

  • DGL 1.0.2 + GSF github/gitlab version
  • DGL 1.0.0 + GSF github/gitlab version

Smaller dataset like ogbn-mag works fine on similar setup.

[Config] Below configuration/arguments are not intuitive or consistent - batch 1

These names of configurations/arguments are either intuitive or consistent. Recommend to make changes.

  • feat_name: change to “node_feat_name”
    • Use “edge_feat_name” later when support edge features.
  • n_layers: change to "num_layers"
  • n_epochs change to "num_epochs"
  • sparse_lr: change to “sparse_optimizer_lr”
  • save_predict_path: change to "save_prediction_path"
  • negative_sampler: change to "train_negative_sampler"
  • test_negative_sampler: change to "eval_negative_sampler"
  • enable_early_stop: change to "use_early_stop"
  • lm_infer_batchsize: change to "lm_infer_batch_size"
  • evaluation_frequency: change to "eval_frequency"

GraphStorm needs to check the input data and model config.

Traceback (most recent call last):
  File "gsgnn_ep.py", line 141, in <module>
    main(args)
  File "gsgnn_ep.py", line 111, in main
    trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
  File "/graph-storm/python/graphstorm/trainer/ep_trainer.py", line 143, in fit
    loss = model(blocks, batch_graph, input_feats, None, lbl, input_nodes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/edge_gnn.py", line 119, in forward
    pred_loss = self.loss_func(logits, labels[target_etype])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/loss_func.py", line 51, in forward
    return self.loss_fn(logits, labels)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'

[Bug] Cannot save sparse embeddings in a distributed setting

Setting:

  • Two nodes, g4dn.12xlarge.
  • Launch node create a folder, e.g., "sharedfolder" and use it as an NFS shared folder.
  • In the execution node create an NFS client, and mount the "sharedfolder".
  • Copy graphstorm code and data into the sharedfolder.

How to reproduct:

  1. Set the "save-model-path", e.g., "/sharedfolder/model"
  2. Run an LP task in two partitions.
  3. Could set the "evaluation_frequency" and "save_model_per_iters" to a smaller value, e.g. 10, to accelerate this repruducing.

Problem:

  1. After evaluation and save mode;
  2. The launch node can save sparse embeddings successfully, but
  3. The execution node can NOT save the sparse embeddings.

Root cause:

  1. The launch node's rank 0 process will make the "/sharedfolder/model" when start to save model artifacts;
  2. The "/sharedfolder/model" mode is 755 rwxr-xr-x;
  3. The launch node's rank 0 process will make folders for each save-model-per-iters, e.g., "/sharedfolder/model/epoch-0-iter-19" and "/sharedfolder/model/epoch-0-iter-19/item/", whose mode are 755 too.
  4. The launc node's rank 0 process will save "model.bin" and "optimizer.bin" to the ""/sharedfolder/model/epoch-0-iter-19"" folder.
  5. The launch nodes' process can save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder.
  6. BUT, the execution node can NOT save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder, because the 755 mode do not allow process in the execution nodes to save them.

This is caused by the newly distributed saving of sparse embeddings.

Support in-library launch scripts for built-in model training and inference.

Motivation

To run a GraphStorm training task, a user need to take the following steps:

  1. Pip install GraphStorm library
  2. git clone https://github.com/dmlc/dgl.git
  3. git clone https://github.com/awslabs/graphstorm.git
  4. Use dgl launch.py script to launch a task.

The steps are to complex as the user needs to download may codebases to get the entrypoint scripts.

Proposal

We will integrate launch.py as well as other entrypoint scripts provided in training_scripts/inference_scripts into GraphStorm library. Then the UX will be:

  1. Pip install GraphStorm library
  2. Run a training task using built-in training script: python3 -m graphstorm.run.gsf_node_classification xxx or python3 -m graphstorm.run.gsf_link_prediction xxx

For user defined training script: python3 -m graphstorm.run.launch xxx <PATH_TO_SCRIPT> xxx .

[Config] argument "use_dot_product" is not intuitive about its function

In configuration, the argument "use_dot_product" is not intuitive about its function, and it is easy to get an assertion error of proper setting.

So

  • it is better to change to an intuitive argument name, and,
  • it is better to automatically to determine its value based on other given arguments or graph information.

[Bug] new launch API can not exit in other instances after break in the launch instance

issue description:

In the new launch API, when use "crtl + c" to exit the training/inference processes in the launch instance, the GraphStorm processes in other instances DO NOT exit! The launch script does not kill other instances' processes.

Log track:

Step 40 | Validation mrr: 0.0590
Step 40 | Test mrr: 0.0588
Step 40 | Best Validation mrr: 0.0590
Step 40 | Best Test mrr: 0.0588
Step 40 | Best Iteration mrr: 40.0000
Eval time: 61.0621, Evaluation step: 40.0000
evaluate validation/test: elapsed time: 17.759, mem (curr: 14.181, peak: 14.181, shared: 8.483, global curr: 25.959, global shared: 12.100) GB
successfully save the model to /data/ak/models-test/epoch-0-iter-39
Time on save model 0.0036745071411132812
Epoch 00000 | Batch 040 | Train Loss: 0.3333 | Time: 9.5225
^CProcess Process-1:
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=3; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/graphstorm/python/graphstorm/run/launch.py", line 58, in cleanup_proc
remote_pids = get_all_remote_pids_func()
File "/graphstorm/python/graphstorm/run/launch.py", line 279, in get_all_remote_pids
cmds = udf_command.split()
AttributeError: 'list' object has no attribute 'split'
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=1 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=3 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=2 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=0; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=0 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=2; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=1; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
^C

Verify the correctness of GraphStorm input config.

  • Don't require users to provide graph name.
  • Verify the node/edge feature names and label names.
  • If infer_data.test_idxs is empty, return an error in the inference script.
  • We may want to warn users if they want to have mini-batch inference. (mini-batch inference on a large graph can take very long time).

[Config] Argument name "num_gpus" default behavior is unknown

In the configuration file, the "num_gpus" argument is confusing. What is its default behavior?

Because this argument only give a number of GPUs and does not know if some GPUs are shared with other users, its default behavior should be using ALL GPUs.

So please check current codes to see if this is the default behavior. If not, please implement it.

[Feature Request] Add SageMaker support.

The Amazon SageMaker support should includes:

  • Support launching data processing task using Amazon SageMaker.
  • Support launching distributed training task using Amazon SageMaker.
  • Support launching distributed inference task using Amazon SageMaker.
  • Tutorial.

[Config] argument "eval_batch_size" may slow down the speed of evaluation

In configuration, argument "eval_batch_size" may slow down the speed of evaluation because users normally set this value to 1k-4k as used in the "train_batch_size". And in link prediction task, we have another computation of test score, the smaller batch_size may also slow down the speed.

So,

  • set its default value to a larger one, e.g. +10k
  • add another argument, e.g. "lp_score_batch_size", to set the batch size of link prediction score computation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.