awslabs / graphstorm Goto Github PK

View Code? Open in Web Editor NEW

355.0 13.0 55.0 6.12 MB

Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.

License: Apache License 2.0

Shell 6.99% Python 92.92% C 0.02% Dockerfile 0.07%

graph graphneuralnetwork machine-learning pytorch

graphstorm's Issues

Support custom data split in graph construction.

We should allow users to specify custom training/validation/test sets during graph construction. #29 (comment)

Support categorial attributes

Some features are categorial values. That is, the original values are strings. We need to convert it to integers.

[Feature request] construct-graph need to store node_id_mapping and edge_id_mapping.

When DistDGL partition a graph, it will shuffle node ids and edge ids. We need to store the new-id to original-id mapping.

Have a scalable way of saving learnable embeddings

Currently, trainer 0 saves the learnable embeddings to the disk. If the learnable embedding table is very large, trainer 0 will run out of memory. We need to distribute the saving of learnable embeddings.

Define minimum requirements in setup.py

Our setup.py currently does not have specific requirements of library versions, although the library assumes a couple of libraries have at least some specific versions. In particular:

ogb >= 1.3.6,
torch >= 1.12

[Roadmap] V0.1 Release Plan

The V0.1 release will focus mostly on completing and optimizing the existing functionalities and adding data loading support.

[Improvement] Saving/restoring model artifact in distributed training process user experience.
- #28
[Improvement] Refining all configuration arguments.
- #78
- #79
- #80
- #81
- #99
[New Feature] Provide a single-machine data loading pipeline.
- #40
- #51
[Doc] Tutorial of graph construction, GNN model training and GNN model inference.
- #61
[Improvement] Add support for tuning language model with a separate learning rate.
- #35
[Improvement] Bug fixes.

Welcome any comments and feedback!

[Doc] Add a tutorial

The tutorial will includes:

GraphStorm environment setup.
Preparing graph data.
Training a GNN model using GraphStorm framework.
Do GNN inference using GraphStorm framework.
Collecting model artifacts and prediction resutls.

Reduce the memory overhead for text tokens in graph construction pipeline

We should use a field to store the valid length of a token list instead of using an attention mask.

Save embeddings based on the original node IDs

When GNN embeddings are saved, they are saved based on the node IDs in the partitioned graph, which are different from the original node IDs of users' input data. We need to define a simple way for users to identify the right GNN embeddings from the original node IDs.

Support NCCL distributed backend in GraphStorm

Currently, GraphStorm's evaluation code can only run with gloo. We need to support NCCL in the future.

Here are codes can break NCCL:

graphstorm/python/graphstorm/dataloading/utils.py

Lines 40 to 43 in bdddfde

 # NCCL backend only supports GPU tensors, thus here we need to allocate it to gpu 

 num_nodes = th.tensor(nids.numel()).to(device) 

 assert num_nodes.is_cuda, "NCCL does not support CPU all_reduce" 

 dist.all_reduce(num_nodes, dist.ReduceOp.MIN)

GraphStorm fails if `fanout`=[] and `model_encoder_type`='mlp'

When the encoder is MLP, GraphStorm fails if fanout is empty.

[Config] Remove argument `log_report_frequency`

Remove argument log_report_frequency. It is not useful.

Cannot download OGB Datasets

Unable to download OGB datasets required for partitioning and running regression tests.
Affected datasets (arxiv, products, papers100M)
Able to download OGB MAG

[Test] reduce the number of BERT+GNN test in CI

There are multiple tests in (https://github.com/awslabs/graphstorm/blob/main/tests/end2end-tests/graphstorm-ec/mgpu_test.sh) that train GNN+BERT in end to end.
BERT+GNN training is expensive. we should reduce the tests.
also, because remove_target_edge_type is turned on by default, the tests don't have fanout.

[Feature Request] Weighted edge loss.

When doing link prediction training, different edges may have different weights representing their importance. Supporting weighted edge in link prediction loss is required.

We are going to add a new argument --lp-edge-weight-for-loss to specify the edge weight used in loss function.

Related #214

[Config] rename window_for_early_stop to early_stop_rounds

Per our discussion, we need to rename window_for_early_stop to early_stop_rounds.

Wrong function name used in examples/customized_models/HGT/hgt_nc.py

Running pylint --rcfile tests/lint/pylintrc --errors-only on examples/customized_models/HGT/hgt_nc.py produces:

examples/customized_models/HGT/hgt_nc.py:332:17: E1101: Instance of 'GSgnnNodePredictionTrainer' has no 'get_best_model' member (no-member)

[Bug] OOM while inferencing on ogbn-papers100M for link prediction

Attempting inference on ogbn-papers100M for link prediction is causing an OOM (out-of-memory) issue, as shown in the attached screenshot. As a result, the system is becoming extremely unresponsive. The OOM happens in the evaluation function (val_mrr, test_mrr = self.evaluator.evaluate(None, test_scores, 0)) function here. System can successfully save the node embedding and relational embeddings and exit the program without any issue when the evaluation function is omitted.

Experiment setup:

Dataset: ogbn-papers100M partitioned into 3
Instance: g4dn.metal
Command to run Inferencing:

python3 -u  ~/dgl/tools/launch.py \
        --workspace /graph-storm/inference_scripts/lp_infer \
        --num_trainers 1 \
        --num_servers 1 \
        --num_samplers 0 \
        --part_config /data/ogbn-papers100M-3p/ogbn-papers100M.json \
        --ip_config  /data/ip_list_p3_metal.txt \
        --ssh_port 2222 \
        "python3 lp_infer_gnn.py --cf  /data/ogbn_papers100M_infer_p3.yaml  --use-node-embeddings false --num-gpus 4 --part-config /data/ogbn-papers100M-3p/ogbn-papers100M.json  --restore-model-path /data/papers100M-lp-p3-model/epoch-0  --feat-name feat --no-validation false"

Reproduced with the following environment:

DGL 1.0.2 + GSF github/gitlab version
DGL 1.0.0 + GSF github/gitlab version

Smaller dataset like ogbn-mag works fine on similar setup.

[Config] argument "verbose" and "debug" have similar functions

In configuration, argument "verbose" and "debug" have similar functions. So could merge them into one argument, e.g., "verbose" only.

[Config] Need to rename argument `save_model_per_iters` to `save_model_frequency`

Renaming argument save_model_per_iters to save_model_frequency will add more transparency to users.

[Config] Change argument `mini_batch_infer` to `use_mini_batch_infer`

Change argument mini_batch_infer to use_mini_batch_infer to increase usability.

Link Prediction's infer evaluated MRR is much larger than trainer evaluated MRR

I tested with MAG data.

For MAG, the best trainer test MRR is around 0.38, but the infer test MRR is about 0.83.

No clue about why the test MRRs have such big difference.

`predict` does not have `input_nodes` while `forward` has `input_nodes`

The interface of predict and forward should be consistent.

[Config] Below configuration/arguments are not intuitive or consistent - batch 1

These names of configurations/arguments are either intuitive or consistent. Recommend to make changes.

feat_name: change to “node_feat_name”
- Use “edge_feat_name” later when support edge features.
n_layers: change to "num_layers"
n_epochs change to "num_epochs"
sparse_lr: change to “sparse_optimizer_lr”
save_predict_path: change to "save_prediction_path"
negative_sampler: change to "train_negative_sampler"
test_negative_sampler: change to "eval_negative_sampler"
enable_early_stop: change to "use_early_stop"
lm_infer_batchsize: change to "lm_infer_batch_size"
evaluation_frequency: change to "eval_frequency"

GraphStorm needs to check the input data and model config.

Traceback (most recent call last):
  File "gsgnn_ep.py", line 141, in <module>
    main(args)
  File "gsgnn_ep.py", line 111, in main
    trainer.fit(train_loader=dataloader, val_loader=val_dataloader,
  File "/graph-storm/python/graphstorm/trainer/ep_trainer.py", line 143, in fit
    loss = model(blocks, batch_graph, input_feats, None, lbl, input_nodes)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/edge_gnn.py", line 119, in forward
    pred_loss = self.loss_func(logits, labels[target_etype])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/graph-storm/python/graphstorm/model/loss_func.py", line 51, in forward
    return self.loss_fn(logits, labels)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'

[Bug] Cannot save sparse embeddings in a distributed setting

Setting:

Two nodes, g4dn.12xlarge.
Launch node create a folder, e.g., "sharedfolder" and use it as an NFS shared folder.
In the execution node create an NFS client, and mount the "sharedfolder".
Copy graphstorm code and data into the sharedfolder.

How to reproduct:

Set the "save-model-path", e.g., "/sharedfolder/model"
Run an LP task in two partitions.
Could set the "evaluation_frequency" and "save_model_per_iters" to a smaller value, e.g. 10, to accelerate this repruducing.

Problem:

After evaluation and save mode;
The launch node can save sparse embeddings successfully, but
The execution node can NOT save the sparse embeddings.

Root cause:

The launch node's rank 0 process will make the "/sharedfolder/model" when start to save model artifacts;
The "/sharedfolder/model" mode is 755 rwxr-xr-x;
The launch node's rank 0 process will make folders for each save-model-per-iters, e.g., "/sharedfolder/model/epoch-0-iter-19" and "/sharedfolder/model/epoch-0-iter-19/item/", whose mode are 755 too.
The launc node's rank 0 process will save "model.bin" and "optimizer.bin" to the ""/sharedfolder/model/epoch-0-iter-19"" folder.
The launch nodes' process can save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder.
BUT, the execution node can NOT save sparse embeddings to the ""/sharedfolder/model/epoch-0-iter-19/item"" folder, because the 755 mode do not allow process in the execution nodes to save them.

This is caused by the newly distributed saving of sparse embeddings.

[Config] Rename `evaluation_frequency` to `eval_frequency`

Rename evaluation_frequency to eval_frequency to be consistent with other arguments.

[Config] need to rename "n_hidden" to an intuitive name

The argument name "n_hidden" is not intuitive. After discussion, suggest to change it to "hidden_size".

Support in-library launch scripts for built-in model training and inference.

Motivation

To run a GraphStorm training task, a user need to take the following steps:

Pip install GraphStorm library
git clone https://github.com/dmlc/dgl.git
git clone https://github.com/awslabs/graphstorm.git
Use dgl launch.py script to launch a task.

The steps are to complex as the user needs to download may codebases to get the entrypoint scripts.

Proposal

We will integrate launch.py as well as other entrypoint scripts provided in training_scripts/inference_scripts into GraphStorm library. Then the UX will be:

Pip install GraphStorm library
Run a training task using built-in training script: python3 -m graphstorm.run.gsf_node_classification xxx or python3 -m graphstorm.run.gsf_link_prediction xxx

For user defined training script: python3 -m graphstorm.run.launch xxx <PATH_TO_SCRIPT> xxx .

[Config] rename "n_bases" and "n_heads" to "num_bases" and "num_heads"

Follow the naming convention, we need to replace all n_ to num_ for those use the "n" for the "number".

So, rename "n_bases" and "n_heads" to "num_bases" and "num_heads"

Ctrl-C cannot kill all GraphStorm processes

DGL's launch script kills all training processes if a user types Ctrl-C. This doesn't work in GraphStorm. GraphStorm should also handle Ctrl-C signal.

[Config] argument "use_dot_product" is not intuitive about its function

In configuration, the argument "use_dot_product" is not intuitive about its function, and it is easy to get an assertion error of proper setting.

it is better to change to an intuitive argument name, and,
it is better to automatically to determine its value based on other given arguments or graph information.

[Feature request] Save optimizer states of learnable sparse embedding.

Currently, GraphStorm does not save/load optimizer states of learnable sparse embeddings. (https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/model/utils.py#L336-L337). We need to support it.

[FR] Feature request: separate eval for link prediction

GraphStorm may need to support evaluation on separated edge types for link prediction

The preprocessing script support data split for link prediction.

Currently, the labels on edges must be classification task.

Define a rule of naming feature tensors if feature processing generates multiple tensors

Some feature processing generates multiple tensors. For example, tokenziation generates three tensors. We need to define a rule of how to name the output tensors, as discussed here: #29 (comment)

[Bug] new launch API can not exit in other instances after break in the launch instance

issue description:

In the new launch API, when use "crtl + c" to exit the training/inference processes in the launch instance, the GraphStorm processes in other instances DO NOT exit! The launch script does not kill other instances' processes.

Log track:

Step 40 | Validation mrr: 0.0590
Step 40 | Test mrr: 0.0588
Step 40 | Best Validation mrr: 0.0590
Step 40 | Best Test mrr: 0.0588
Step 40 | Best Iteration mrr: 40.0000
Eval time: 61.0621, Evaluation step: 40.0000
evaluate validation/test: elapsed time: 17.759, mem (curr: 14.181, peak: 14.181, shared: 8.483, global curr: 25.959, global shared: 12.100) GB
successfully save the model to /data/ak/models-test/epoch-0-iter-39
Time on save model 0.0036745071411132812
Epoch 00000 | Batch 040 | Train Loss: 0.3333 | Time: 9.5225
^CProcess Process-1:
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=3; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/graphstorm/python/graphstorm/run/launch.py", line 58, in cleanup_proc
remote_pids = get_all_remote_pids_func()
File "/graphstorm/python/graphstorm/run/launch.py", line 279, in get_all_remote_pids
cmds = udf_command.split()
AttributeError: 'list' object has no attribute 'split'
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=1 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.170 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=3 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=2 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=0; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.11.145 'cd /data/ak; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo OMP_NUM_THREADS=6 DGL_GROUP_ID=0 PYTHONPATH=/graphstorm/python ; /usr/bin/python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=0 --master_addr=172.31.11.145 --master_port=1234 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.14.145 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=2; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 172.31.2.213 'cd /data/ak; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=/data/ak/ak_lp_4p/ak.json DGL_IP_CONFIG=/data/ip_list.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo PYTHONPATH=/graphstorm/python DGL_SERVER_ID=1; /usr/bin/python3 /graphstorm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf ak_lp.yaml --evaluation-frequency 20 --save-model-per-iters 20 --save-model-path /data/ak/models-test --save-embed-path /data/ak/embed-test --ip-config /data/ip_list.txt --part-config /data/ak/ak_lp_4p/ak.json --verbose False)'' died with <Signals.SIGINT: 2>.
^C

Support Mac OS

Verify the correctness of GraphStorm input config.

Don't require users to provide graph name.
Verify the node/edge feature names and label names.
If infer_data.test_idxs is empty, return an error in the inference script.
We may want to warn users if they want to have mini-batch inference. (mini-batch inference on a large graph can take very long time).

Users don't need to provide node files for every node type in graph construction

Some node types may not have any node features. In this case, we require users to provide node IDs for all nodes of thes node types.
#29 (comment)

[Feature Request] Support separate evaluation metrics for different relation types for link prediction task.

Currently, GraphStorm combines the MRR scores of different relations types into an overall MRR score. It is required to report MRR scores separately.

Rename training_scripts and inference_scripts folders

Rename training_scripts and inference_scripts folders to training_examples and inference_examples respectively.

[Config] rename predict_type to target_type

To be consistent, rename "predict_*type" to "target_*type"

Create an example of custom model for link prediction

[Config] Argument name "num_gpus" default behavior is unknown

In the configuration file, the "num_gpus" argument is confusing. What is its default behavior?

Because this argument only give a number of GPUs and does not know if some GPUs are shared with other users, its default behavior should be using ALL GPUs.

So please check current codes to see if this is the default behavior. If not, please implement it.

[Feature request] Support edge features

Currently, GraphStorm doesn't pass the edge features.

[Feature Request] Users need to set the train/val/test by their own in graph construction

Current graph construction configuration only allow users to set the split percentages. But users may want to preset the split by their own logics, e.g., timestamp.

Need to allow users to setup their own train/val/test masks in both data files and the configuration JSON file.

[Feature Request] Add SageMaker support.

The Amazon SageMaker support should includes:

Support launching data processing task using Amazon SageMaker.
Support launching distributed training task using Amazon SageMaker.
Support launching distributed inference task using Amazon SageMaker.
Tutorial.

[Config] argument "eval_batch_size" may slow down the speed of evaluation

In configuration, argument "eval_batch_size" may slow down the speed of evaluation because users normally set this value to 1k-4k as used in the "train_batch_size". And in link prediction task, we have another computation of test score, the smaller batch_size may also slow down the speed.

So,

set its default value to a larger one, e.g. +10k
add another argument, e.g. "lp_score_batch_size", to set the batch size of link prediction score computation.

[Config] need to rename "call_to_consider_early_stop" to an intuitive name

The argument name "call_to_consider_early_stop" is not intuitive. After discussion, suggest to change it to "early_stop_burnin_rounds" and set default value to 0.

	# NCCL backend only supports GPU tensors, thus here we need to allocate it to gpu
	num_nodes = th.tensor(nids.numel()).to(device)
	assert num_nodes.is_cuda, "NCCL does not support CPU all_reduce"
	dist.all_reduce(num_nodes, dist.ReduceOp.MIN)

awslabs / graphstorm Goto Github PK

graphstorm's Issues

Experiment setup:

Motivation

Proposal

Recommend Projects

Recommend Topics

Recommend Org