Giter Club home page Giter Club logo

nccl-rdma-sharp-plugins's Introduction

nccl-rdma-sharp-plugins

nccl-rdma-sharp plugin enables RDMA and Switch based collectives(SHARP) with NVIDIA's NCCL library

Overview

Requirements

  • MOFED
  • CUDA
  • SHARP
  • NCCL
  • GPUDirectRDMA plugin

Build Instructions

build system requirements

  • CUDA
  • SHARP
  • MOFED

Plugin uses GNU autotools for its build system. You can build it as follows:

$ ./autogen.sh
$ ./configure
$ make
$ make install

The following flags enabled to build with custom dependencies

  --with-verbs=PATH       Path to non-standard libibverbs installation
  --with-sharp=PATH       Path to non-standard SHARP installation
  --with-cuda=PATH        Path to non-standard CUDA installation

nccl-rdma-sharp-plugins's People

Contributors

addyladdy avatar alexey-rivkin avatar artemry-nv avatar b-a-s avatar bureddy avatar dmitrygx avatar mike-dubman avatar sergei-lebedev avatar tvegas1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nccl-rdma-sharp-plugins's Issues

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect

In Megatron-LM repo https://github.com/NVIDIA/Megatron-LM/blob/v3.0.2/megatron/mpu/initialize.py#L62, there are three positions will create pg through torch.distributed.new_group.

If I set os.environ["NCCL_SHARP_DISABLE"] = "1" after data parallel, the expect result is data parallel pg will allocate sharp resources, the model parallel pg and the tensor parallel pg will not allocate sharp resources.

But from repo https://github.com/Mellanox/nccl-rdma-sharp-plugins/blob/master/src/sharp_plugin.c#L252 and my experiment, debug log reports "SHARP: Set to disable on this communicator" and all pg can not allocate sharp resources, this is not in line with expectations.

Could you check this problem ?

Question about building dynamic library libnccl-net.so file

I am following the build instructions as it is, but I only get libnccl-net.la file in my src/ folder and not libnccl-net.so.

I tried --enable-shared also but doesn't seem to work. I have sharp and cuda installed and giving both the paths with --with-cuda and --with-sharp,

any suggestions?

Sharp in docker?

Hi there,
Can I use the sharp plugin in a docker-based environment? Is there any tutorial about how to setup Sharp in docker? Thanks!

[error] - AM TreeConfig add children MAD response status 0x1c00

Is there any documentation on what the following error codes mean, and what can cause them?

[g996eee][Sep 30 04:35:42 131457][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887003958052883] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)

CUDA 11.6, HPC-X 2.12 UFM 6.9.0 build 7, Mellanox CX-6
Tried sharp plugin both included in HPC-X and building from master, no difference.

# mpirun -np 16 -map-by ppr:8:node -x UCX_TLS=dc,shm,self  -x LD_LIBRARY_PATH -x NCCL_COLLNET_ENABLE=1 /opt/nccl_tests/build/all_reduce_perf -b 1G -e 2G -f 2 -g 1 -w 5 -n 1 

Warning: Permanently added 'nccl-test-32-worker-0.nccl-test-32-worker.multus.svc,10.139.111.2' (ECDSA) to the list of known hosts.
Warning: Permanently added 'nccl-test-32-worker-3.nccl-test-32-worker.multus.svc,10.139.112.2' (ECDSA) to the list of known hosts.
Warning: Permanently added 'nccl-test-32-worker-1.nccl-test-32-worker.multus.svc,10.139.112.4' (ECDSA) to the list of known hosts.
Warning: Permanently added 'nccl-test-32-worker-2.nccl-test-32-worker.multus.svc,10.139.112.3' (ECDSA) to the list of known hosts.

# nThread 1 nGpus 1 minBytes 1073741824 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 1 validation: 1 
#
# Using devices
#   Rank  0 Pid    598 on    g996eee device  0 [0x27] NVIDIA A100-SXM4-80GB
#   Rank  1 Pid    599 on    g996eee device  1 [0x2a] NVIDIA A100-SXM4-80GB
#   Rank  2 Pid    600 on    g996eee device  2 [0x51] NVIDIA A100-SXM4-80GB
#   Rank  3 Pid    601 on    g996eee device  3 [0x57] NVIDIA A100-SXM4-80GB
#   Rank  4 Pid    602 on    g996eee device  4 [0x9e] NVIDIA A100-SXM4-80GB
#   Rank  5 Pid    603 on    g996eee device  5 [0xa4] NVIDIA A100-SXM4-80GB
#   Rank  6 Pid    604 on    g996eee device  6 [0xc7] NVIDIA A100-SXM4-80GB
#   Rank  7 Pid    605 on    g996eee device  7 [0xca] NVIDIA A100-SXM4-80GB
#   Rank  8 Pid    594 on    g2f4e7c device  0 [0x27] NVIDIA A100-SXM4-80GB
#   Rank  9 Pid    595 on    g2f4e7c device  1 [0x2a] NVIDIA A100-SXM4-80GB
#   Rank 10 Pid    596 on    g2f4e7c device  2 [0x51] NVIDIA A100-SXM4-80GB
#   Rank 11 Pid    597 on    g2f4e7c device  3 [0x57] NVIDIA A100-SXM4-80GB
#   Rank 12 Pid    598 on    g2f4e7c device  4 [0x9e] NVIDIA A100-SXM4-80GB
#   Rank 13 Pid    599 on    g2f4e7c device  5 [0xa4] NVIDIA A100-SXM4-80GB
#   Rank 14 Pid    600 on    g2f4e7c device  6 [0xc7] NVIDIA A100-SXM4-80GB
#   Rank 15 Pid    601 on    g2f4e7c device  7 [0xca] NVIDIA A100-SXM4-80GB

[g996eee:0:598 - context.c:687] INFO job (ID: 35887902139160176) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:598 - context.c:875] INFO sharp_job_id:114  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:598 - context.c:886] INFO sharp_job_id:114  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee:0:598 - comm.c:392] INFO [group#:0] group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x6f0000000000) mlid:c004
[g996eee:0:598 - comm.c:392] INFO [group#:1] group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[g996eee:0:600 - context.c:687] INFO job (ID: 35887003958052883) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:115  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:115  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:42 131457][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887003958052883] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:42 125532][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887003958052883] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:602 - context.c:687] INFO job (ID: 35887114005571810) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:602 - context.c:875] INFO sharp_job_id:116  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:602 - context.c:886] INFO sharp_job_id:116  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee:0:602 - comm.c:392] INFO [group#:0] group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x710000000000) mlid:c005
[g996eee:0:602 - comm.c:392] INFO [group#:1] group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[g996eee:0:604 - context.c:687] INFO job (ID: 35887018084864365) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:604 - context.c:875] INFO sharp_job_id:117  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:604 - context.c:886] INFO sharp_job_id:117  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee:0:604 - comm.c:392] INFO [group#:0] group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x720000000000) mlid:c006
[g996eee:0:604 - comm.c:392] INFO [group#:1] group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004130201420) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:118  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:118  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:45 056431][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004130201420] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:45 066747][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004130201420] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887003132968255) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:119  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:119  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:46 307672][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887003132968255] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:46 318712][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887003132968255] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887003451118562) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:120  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:120  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:47 548419][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887003451118562] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:47 550832][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887003451118562] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004604703053) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:121  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:121  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g2f4e7c][Sep 30 04:35:48 937799][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004604703053] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee][Sep 30 04:35:48 955482][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004604703053] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887003503035138) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:122  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:122  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:50 204167][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887003503035138] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:50 214760][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887003503035138] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004402930063) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:123  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:123  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:51 519524][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004402930063] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:51 530746][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004402930063] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004947744738) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:124  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:124  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:52 807063][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004947744738] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:52 801270][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004947744738] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887003900347484) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:125  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:125  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:54 048396][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c][Sep 30 04:35:54 042484][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887003900347484] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 unique id 35887003900347484] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004364391853) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:126  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:126  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:55 305056][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004364391853] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:55 299238][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004364391853] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887004620413021) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:127  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:127  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:56 568313][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887004620413021] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:56 582732][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887004620413021] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g996eee:0:600 - context.c:687] INFO job (ID: 35887005092490615) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[g996eee:0:600 - context.c:875] INFO sharp_job_id:128  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[g996eee:0:600 - context.c:886] INFO sharp_job_id:128  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
[g996eee][Sep 30 04:35:57 916182][SD][703][error] - AM TreeConfig add children MAD response status 0x1c00
[g996eee:0:600 unique id 35887005092490615] ERROR AN MAD error in sharp_connect_tree.

[g996eee:0:600 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)
[g2f4e7c][Sep 30 04:35:57 910354][SD][701][error] - AM TreeConfig add children MAD response status 0x1c00
[g2f4e7c:1:596 unique id 35887005092490615] ERROR AN MAD error in sharp_connect_tree.

[g2f4e7c:1:596 - comm.c:33] ERROR sharp_connect_tree failed: AN MAD error(-18)

[g996eee:600  :0:600] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44b8)
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
==== backtrace (tid:    600) ====
0 0x0000000000014420 __funlockfile()  ???:0
1 0x000000000005ebc8 ncclGroupEnd()  ???:0
2 0x000000000005ee95 ncclGroupEnd()  ???:0
3 0x000000000005f33a ncclGroupEnd()  ???:0
4 0x00000000000562e0 ncclGetUniqueId()  ???:0
5 0x0000000000057eea ncclRedOpDestroy()  ???:0
6 0x000000000005769c ncclRedOpDestroy()  ???:0
7 0x000000000007fd93 ncclAllReduce()  ???:0
8 0x00000000000052e6 AllReduceRunColl()  /opt/nccl_tests/src/all_reduce.cu:57
9 0x0000000000007a51 startColl()  /opt/nccl_tests/src/common.cu:563
10 0x0000000000009fb4 TimeTest()  /opt/nccl_tests/src/common.cu:764
11 0x0000000000005174 AllReduceRunTest()  /opt/nccl_tests/src/all_reduce.cu:103
12 0x0000000000005b2a threadRunTests()  /opt/nccl_tests/src/common.cu:792
13 0x000000000000b6d3 run()  /opt/nccl_tests/src/common.cu:1166
14 0x0000000000003fbf main()  /opt/nccl_tests/src/common.cu:1007
15 0x0000000000024083 __libc_start_main()  ???:0
16 0x0000000000004efe _start()  ???:0
=================================
[g996eee:00600] *** Process received signal ***
[g996eee:00600] Signal: Segmentation fault (11)
[g996eee:00600] Signal code:  (-6)
[g996eee:00600] Failing at address: 0x258
[g996eee:00600] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f0c8142f420]
[g996eee:00600] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5ebc8)[0x7f0c8149cbc8]
[g996eee:00600] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5ee95)[0x7f0c8149ce95]
[g996eee:00600] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5f33a)[0x7f0c8149d33a]
[g996eee:00600] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x562e0)[0x7f0c814942e0]
[g996eee:00600] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x57eea)[0x7f0c81495eea]
[g996eee:00600] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5769c)[0x7f0c8149569c]
[g996eee:00600] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclAllReduce+0xf3)[0x7f0c814bdd93]
[g996eee:00600] [ 8] /opt/nccl_tests/build/all_reduce_perf(+0x52e6)[0x55d4292c22e6]
[g996eee:00600] [ 9] /opt/nccl_tests/build/all_reduce_perf(+0x7a51)[0x55d4292c4a51]
[g996eee:00600] [10] /opt/nccl_tests/build/all_reduce_perf(+0x9fb4)[0x55d4292c6fb4]
[g996eee:00600] [11] /opt/nccl_tests/build/all_reduce_perf(+0x5174)[0x55d4292c2174]
[g996eee:00600] [12] /opt/nccl_tests/build/all_reduce_perf(+0x5b2a)[0x55d4292c2b2a]
[g996eee:00600] [13] /opt/nccl_tests/build/all_reduce_perf(+0xb6d3)[0x55d4292c86d3]
[g996eee:00600] [14] /opt/nccl_tests/build/all_reduce_perf(+0x3fbf)[0x55d4292c0fbf]
[g996eee:00600] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f0c81069083]
[g996eee:00600] [16] /opt/nccl_tests/build/all_reduce_perf(+0x4efe)[0x55d4292c1efe]
[g996eee:00600] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[g2f4e7c:596  :0:596] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44b8)
==== backtrace (tid:    596) ====
0 0x0000000000014420 __funlockfile()  ???:0
1 0x000000000005ebc8 ncclGroupEnd()  ???:0
2 0x000000000005ee95 ncclGroupEnd()  ???:0
3 0x000000000005f33a ncclGroupEnd()  ???:0
4 0x00000000000562e0 ncclGetUniqueId()  ???:0
5 0x0000000000057eea ncclRedOpDestroy()  ???:0
6 0x000000000005769c ncclRedOpDestroy()  ???:0
7 0x000000000007fd93 ncclAllReduce()  ???:0
8 0x00000000000052e6 AllReduceRunColl()  /opt/nccl_tests/src/all_reduce.cu:57
9 0x0000000000007a51 startColl()  /opt/nccl_tests/src/common.cu:563
10 0x0000000000009fb4 TimeTest()  /opt/nccl_tests/src/common.cu:764
11 0x0000000000005174 AllReduceRunTest()  /opt/nccl_tests/src/all_reduce.cu:103
12 0x0000000000005b2a threadRunTests()  /opt/nccl_tests/src/common.cu:792
13 0x000000000000b6d3 run()  /opt/nccl_tests/src/common.cu:1166
14 0x0000000000003fbf main()  /opt/nccl_tests/src/common.cu:1007
15 0x0000000000024083 __libc_start_main()  ???:0
16 0x0000000000004efe _start()  ???:0
=================================
[g2f4e7c:00596] *** Process received signal ***
[g2f4e7c:00596] Signal: Segmentation fault (11)
[g2f4e7c:00596] Signal code:  (-6)
[g2f4e7c:00596] Failing at address: 0x254
[g2f4e7c:00596] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fdf48964420]
[g2f4e7c:00596] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5ebc8)[0x7fdf489d1bc8]
[g2f4e7c:00596] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5ee95)[0x7fdf489d1e95]
[g2f4e7c:00596] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5f33a)[0x7fdf489d233a]
[g2f4e7c:00596] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x562e0)[0x7fdf489c92e0]
[g2f4e7c:00596] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x57eea)[0x7fdf489caeea]
[g2f4e7c:00596] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x5769c)[0x7fdf489ca69c]
[g2f4e7c:00596] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclAllReduce+0xf3)[0x7fdf489f2d93]
[g2f4e7c:00596] [ 8] /opt/nccl_tests/build/all_reduce_perf(+0x52e6)[0x55bccdcad2e6]
[g2f4e7c:00596] [ 9] /opt/nccl_tests/build/all_reduce_perf(+0x7a51)[0x55bccdcafa51]
[g2f4e7c:00596] [10] /opt/nccl_tests/build/all_reduce_perf(+0x9fb4)[0x55bccdcb1fb4]
[g2f4e7c:00596] [11] /opt/nccl_tests/build/all_reduce_perf(+0x5174)[0x55bccdcad174]
[g2f4e7c:00596] [12] /opt/nccl_tests/build/all_reduce_perf(+0x5b2a)[0x55bccdcadb2a]
[g2f4e7c:00596] [13] /opt/nccl_tests/build/all_reduce_perf(+0xb6d3)[0x55bccdcb36d3]
[g2f4e7c:00596] [14] /opt/nccl_tests/build/all_reduce_perf(+0x3fbf)[0x55bccdcabfbf]
[g2f4e7c:00596] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fdf4859e083]
[g2f4e7c:00596] [16] /opt/nccl_tests/build/all_reduce_perf(+0x4efe)[0x55bccdcacefe]
[g2f4e7c:00596] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 600 on node nccl-test-32-worker-0.nccl-test-32-worker.multus.svc exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

question about at least version

According readme

Requirements

  • MOFED
  • CUDA
  • SHARP
  • NCCL
  • GPUDirectRDMA plugin

I am curious about this repo requires at least version.
Our HPC cluster install with

Is this suitable condition for this plugins?
Or which version of this component already tested? Many Thanks!

Tagged releases

Is it possible to get tags for more recent versions / releases? Or is there another place recommended to download a fixed version instead of relying on pinning to a commit hash?

Context: I'm setting up a docker image to use NCCL + SHARP using OFED, which doesn't have the plugin come rolled with it.

As a side, why does the plugin come packaged with HPC-X but not OFED?

CC: @bureddy who seems to be maintaining this project? Btw, thank you for all your hard work -- this is very impressive :)

make error

I met this problem when I implemented make:

make  all-recursive
make[1]: Entering directory `/nccl-rdma-sharp-plugins'
Making all in src
make[2]: Entering directory `/nccl-rdma-sharp-plugins/src'
  CC       libnccl_net_la-ibvwrap.lo
  CC       libnccl_net_la-utils.lo
  CC       libnccl_net_la-param.lo
  CC       libnccl_net_la-socket.lo
  CC       libnccl_net_la-p2p_plugin.lo
  CC       libnccl_net_la-ib_plugin.lo
  CC       libnccl_net_la-ucx_plugin.lo
ucx_plugin.c: In function 'nccl_ucx_irecv':
ucx_plugin.c:714:48: error: 'size' undeclared (first use in this function); did you mean 'sizes'?
  714 |   req = ucp_tag_recv_nb(comm->worker, data[0], size, ucp_dt_make_contig(1),
      |                                                ^~~~
      |                                                sizes
ucx_plugin.c:714:48: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [libnccl_net_la-ucx_plugin.lo] Error 1
make[2]: Leaving directory `/nccl-rdma-sharp-plugins/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/nccl-rdma-sharp-plugins'
make: *** [all] Error 2

I am wondering how to fix it. I have met all the requirements in the README.

make error

Hi ,
make all-recursive
make[1]: Entering directory /root/nccl-rdma-sharp-plugins-2.1.0' Making all in src make[2]: Entering directory /root/nccl-rdma-sharp-plugins-2.1.0/src'
CC libnccl_net_la-ibvwrap.lo
CC libnccl_net_la-utils.lo
CC libnccl_net_la-p2p_plugin.lo
CC libnccl_net_la-ib_plugin.lo
CC libnccl_net_la-ucx_plugin.lo
CC libnccl_net_la-ucx_rma_plugin.lo
CC libnccl_net_la-sharp_plugin.lo
CCLD libnccl-net.la
/usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status
make[2]: *** [libnccl-net.la] Error 1
make[2]: Leaving directory /root/nccl-rdma-sharp-plugins-2.1.0/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /root/nccl-rdma-sharp-plugins-2.1.0'
make: *** [all] Error 2

[root@gpu1 nccl-rdma-sharp-plugins-2.1.0]# gcc -lz --verbose
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
COMPILER_PATH=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/:/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/:/usr/libexec/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/:/usr/lib/gcc/x86_64-redhat-linux/
LIBRARY_PATH=/usr/lib/gcc/x86_64-redhat-linux/4.8.5/:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-mtune=generic' '-march=x86-64'
/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/collect2 --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.8.5/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../.. -lz -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/4.8.5/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/crtn.o
/usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status

Error when compiling the master branch code

My environment:

  • cuda: 12.2
  • sharp + verbs: the version from ofed 23.07
    It seems like the difference between allgather & allreduce, and the details are in the image below.
    20240104-160336

bf16 datatype supported?

Do I need to set some flags to enable bfloat16?

Currently, it is not working. Report following error for nccl test:

llm011:2504887:2504887 [7] enqueue.cc:1175 NCCL WARN Error : no algorithm/protocol available
llm011:2504887:2504887 [7] NCCL INFO enqueue.cc:1273 -> 3
llm011:2504887:2504887 [7] NCCL INFO enqueue.cc:567 -> 3
llm011:2504887:2504887 [7] NCCL INFO enqueue.cc:936 -> 3

llm012:1731092:1731092 [7] enqueue.cc:1175 NCCL WARN Error : no algorithm/protocol available
llm012:1731092:1731092 [7] NCCL INFO enqueue.cc:1273 -> 3
llm012:1731092:1731092 [7] NCCL INFO enqueue.cc:567 -> 3
llm012:1731092:1731092 [7] NCCL INFO enqueue.cc:936 -> 3
llm012:1731092:1731092 [7] NCCL INFO group.cc:140 -> 3
llm012:1731092:1731092 [7] NCCL INFO group.cc:341 -> 3
llm012:1731092:1731092 [7] NCCL INFO group.cc:422 -> 3
llm012:1731092:1731092 [7] NCCL INFO group.cc:106 -> 3

Question using sharp on one A100 device

Hi,
I have used a single A100 device to test the sharp performance when I test the RDMA performance, the device-to-device send/recv speed is about 18GB/s, when I use sharp to do allreduce operation (as only one process is used, there are no calculations i think.), the speed is 12GB/s. So my question what may cause the performance to degrade?

Thanks!

make error ncclSharpIallgather

I encountered an error while running the make command, ./autogen.sh and ./configure --with-cuda=/usr/local/cuda successed

make  all-recursive
make[1]: Entering directory '/root/nccl-rdma-sharp-plugins'
Making all in src
make[2]: Entering directory '/root/nccl-rdma-sharp-plugins/src'
  CC       libnccl_net_la-ibvwrap.lo
  CC       libnccl_net_la-utils.lo
  CC       libnccl_net_la-param.lo
  CC       libnccl_net_la-socket.lo
  CC       libnccl_net_la-p2p_plugin.lo
  CC       libnccl_net_la-ib_plugin.lo
  CC       libnccl_net_la-ucx_plugin.lo
  CC       libnccl_net_la-ucx_rma_plugin.lo
  CC       libnccl_net_la-ucx_uct_lib.lo
  CC       libnccl_net_la-ucx_uct_plugin.lo
  CC       libnccl_net_la-sharp_plugin.lo
sharp_plugin.c: In function ‘ncclSharpIallgather’:
sharp_plugin.c:533:33: error: storage size of ‘gather_spec’ isn’t known
  533 |   struct sharp_coll_gather_spec gather_spec;
      |                                 ^~~~~~~~~~~
sharp_plugin.c:550:29: error: implicit declaration of function ‘sharp_coll_do_allgather_nb’; did you mean ‘sharp_coll_do_allreduce_nb’? [-Werror=implicit-function-declaration]
  550 |   if (SHARP_COLL_SUCCESS != sharp_coll_do_allgather_nb(cComm->sharpCollComm, &gather_spec, &req->sharpRequest)) {
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                             sharp_coll_do_allreduce_nb
sharp_plugin.c:533:33: error: unused variable ‘gather_spec’ [-Werror=unused-variable]
  533 |   struct sharp_coll_gather_spec gather_spec;
      |                                 ^~~~~~~~~~~
sharp_plugin.c: In function ‘ncclSharpIreducescatter’:
sharp_plugin.c:609:14: error: ‘struct sharp_coll_reduce_spec’ has no member named ‘offset’
  609 |   reduce_spec.offset = windowOffset;
      |              ^
sharp_plugin.c:615:29: error: implicit declaration of function ‘sharp_coll_do_reduce_scatter_nb’; did you mean ‘sharp_coll_do_reduce_nb’? [-Werror=implicit-function-declaration]
  615 |   if (SHARP_COLL_SUCCESS != sharp_coll_do_reduce_scatter_nb(cComm->sharpCollComm, &reduce_spec, &req->sharpRequest)) {
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                             sharp_coll_do_reduce_nb
cc1: all warnings being treated as errors
make[2]: *** [Makefile:589: libnccl_net_la-sharp_plugin.lo] Error 1
make[2]: Leaving directory '/root/nccl-rdma-sharp-plugins/src'
make[1]: *** [Makefile:447: all-recursive] Error 1
make[1]: Leaving directory '/root/nccl-rdma-sharp-plugins'
make: *** [Makefile:365: all] Error 2

Using SHARP failed which sharp_coll_comm_init running failed.

Hi developer,
I have built the SHARP env, and the sharp plugin has been loaded successfylly.
When run this function sharp_coll_comm_init , it return error, so finally the nccl use the P2P NET.
Can you give me some help to analysis this issue, thank you!

The following is the error log:
[C25L18:0:24972 - context.c:702] INFO job (ID: 1201360188720575732) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:24972 - context.c:895] INFO sharp_job_id:12 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:24972 - context.c:899] INFO sharp_job_id:12 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
C25L19:19373:19491 [3] NCCL INFO Sharp rank 1/2 initialized on mlx5_5:1
C25L18:24972:25066 [3] NCCL INFO Sharp rank 0/2 initialized on mlx5_5:1
[C25L18:0:24972 - comm.c:374] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)
[C25L19:1:19373 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)

C25L19:19373:19491 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)
C25L18:24972:25066 [3] sharp_plugin.c:302 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

License information

Hello!

Could you please clarify the licence type under which the code in this repo is distributed?
Is it possible to change the plugin code for research purposes and then publish it?

As far as I understand, the existing UCX plugin in this project aimed to support only IB functionality in UCX, since it mixes up both Verbs and UCX APIs.

Suppose, I would like to develop a more generic p2p network plugin for NCCL that supports TCP, IB and other transports through UCX library. Could I re-use the code of this project in order to do this?

Copyright headers in *.c files say that I should "See LICENSE.txt for license information", but there is no such a file in the repo.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.