installation issue

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Full API Documentation: https://nvidia.github.io/apex

GTC 2019 and Pytorch DevCon 2019 Slides

apex.parallel.SyncBatchNorm extends torch.nn.modules.batchnorm._BatchNorm to support synchronized BN. It allreduces stats across processes during multiprocess (DistributedDataParallel) training. Synchronous BN has been used in cases where only a small local minibatch can fit on each GPU. Allreduced stats increase the effective batch size for the BN layer to the global batch size across all processes (which, technically, is the correct formulation). Synchronous BN has been observed to improve converged accuracy in some of our research models.

Checkpointing

To properly save and load your amp training, we introduce the amp.state_dict(), which contains all loss_scalers and their corresponding unskipped steps, as well as amp.load_state_dict() to restore these attributes.

In order to get bitwise accuracy, we recommend the following workflow:

# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
...

# Save checkpoint
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...

# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])

# Continue training
...

Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.

Installation

Each apex.contrib module requires one or more install options other than --cpp_ext and --cuda_ext. Note that contrib modules do not necessarily support stable PyTorch releases.

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

how to pull a container
how to run a pulled container
release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

We recommend installing Ninja to make compilation faster.

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

APEX also supports a Python-only build via

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

Custom C++/CUDA Extensions and Install Options

If a requirement of a module is not met, then it will not be built.

Module Name	Install Option	Misc
`apex_C`	`--cpp_ext`
`amp_C`	`--cuda_ext`
`syncbn`	`--cuda_ext`
`fused_layer_norm_cuda`	`--cuda_ext`	`apex.normalization`
`mlp_cuda`	`--cuda_ext`
`scaled_upper_triang_masked_softmax_cuda`	`--cuda_ext`
`generic_scaled_masked_softmax_cuda`	`--cuda_ext`
`scaled_masked_softmax_cuda`	`--cuda_ext`
`fused_weight_gradient_mlp_cuda`	`--cuda_ext`	Requires CUDA>=11
`permutation_search_cuda`	`--permutation_search`	`apex.contrib.sparsity`
`bnp`	`--bnp`	`apex.contrib.groupbn`
`xentropy`	`--xentropy`	`apex.contrib.xentropy`
`focal_loss_cuda`	`--focal_loss`	`apex.contrib.focal_loss`
`fused_index_mul_2d`	`--index_mul_2d`	`apex.contrib.index_mul_2d`
`fused_adam_cuda`	`--deprecated_fused_adam`	`apex.contrib.optimizers`
`fused_lamb_cuda`	`--deprecated_fused_lamb`	`apex.contrib.optimizers`
`fast_layer_norm`	`--fast_layer_norm`	`apex.contrib.layer_norm`. different from `fused_layer_norm`
`fmhalib`	`--fmha`	`apex.contrib.fmha`
`fast_multihead_attn`	`--fast_multihead_attn`	`apex.contrib.multihead_attn`
`transducer_joint_cuda`	`--transducer`	`apex.contrib.transducer`
`transducer_loss_cuda`	`--transducer`	`apex.contrib.transducer`
`cudnn_gbn_lib`	`--cudnn_gbn`	Requires cuDNN>=8.5, `apex.contrib.cudnn_gbn`
`peer_memory_cuda`	`--peer_memory`	`apex.contrib.peer_memory`
`nccl_p2p_cuda`	`--nccl_p2p`	Requires NCCL >= 2.10, `apex.contrib.nccl_p2p`
`fast_bottleneck`	`--fast_bottleneck`	Requires `peer_memory_cuda` and `nccl_p2p_cuda`, `apex.contrib.bottleneck`
`fused_conv_bias_relu`	`--fused_conv_bias_relu`	Requires cuDNN>=8.4, `apex.contrib.conv_bias_relu`

	def new_synthesize_flattened_rnn_weights(fp32_weights,
	fp16_flat_tensor,
	rnn_fn='',
	verbose=False):
	fp16_weights = []
	fp32_base_ptr = fp32_weights[0].data_ptr()
	for w_fp32 in fp32_weights:
	w_fp16 = w_fp32.new().half()
	offset = (w_fp32.data_ptr() - fp32_base_ptr) // w_fp32.element_size()
	w_fp16.set_(fp16_flat_tensor.storage(),
	offset,
	w_fp32.shape)

train				validate
epoch	Top1	Top5	Loss	epoch	Top1	Top5	Loss
0	3.166	9.401	6.0748	0	3.054	9.308	12.1711
1	15.428	34.406	4.4438	1	18.206	39.778	4.1692
2	26.628	50.356	3.6012	2	29.108	54.764	3.3741
3	34.069	59.227	3.1205	3	30.938	56.796	3.2538
4	37.787	63.202	2.8991	4	29.46	55.652	3.3615
5	40.33	65.834	2.7536	5	12.982	30.574	5.1418
6	42.476	67.836	2.6325	6	0.428	1.608	8.4904
7	43.851	69.086	2.5574	7	0.1	0.502	8.2962
8	44.888	70.058	2.5005	8	0.1	0.49	15.8809
9	45.692	70.684	2.4588	9	0.1	0.5	83.5319
10	46.378	71.274	2.4261	10	0.104	0.496	184.0083
11	46.66	71.618	2.4065	11	0.1	0.504	210.9373
12	46.938	71.805	2.3928	12	0.1	0.5	585.1285
13	47.039	71.931	2.3873	13	0.1	0.5	2283.96
14	46.974	71.87	2.393	14	0	0.006	1612.295
15	46.667	71.499	2.4104	15	0.002	0.006	7.0508
16	46.273	71.155	2.4337	16	0.002	0.006	1554.635
17	16.3	25.251	5.3414	17	0.1	0.5	8.9353
18	0.096	0.482	6.9067	18	0.1	0.5	7.0235
19	0.095	0.485	6.9068	19	0.1	0.5	6.911
20	0.097	0.488	6.9068	20	0.1	0.5	6.9091
21	0.094	0.491	6.9067	21	0.1	0.5	6.9086
22	0.094	0.487	6.9066	22	0.1	0.5	6.9085
23	0.095	0.478	6.9066	23	0.1	0.5	6.9085
24	0.101	0.491	6.9067	24	0.1	0.5	6.9082
25	0.098	0.487	6.9067	25	0.1	0.5	6.9083
26	0.097	0.483	6.9068	26	0.1	0.5	6.908
27	0.099	0.485	6.9067	27	0.1	0.5	6.9082
28	0.091	0.489	6.9067	28	0.1	0.5	6.9085
29	0.097	0.489	6.9067	29	0.1	0.5	6.9083
30	0.1	0.503	6.9065	30	0.1	0.5	6.908
31	0.1	0.496	6.9063	31	0.1	0.5	6.9078
32	0.098	0.487	6.9063	32	0.1	0.5	6.908
33	0.092	0.472	6.9063	33	0.1	0.5	6.9078
34	0.092	0.469	6.9063	34	0.1	0.5	6.9078
35	0.095	0.461	6.9063	35	0.1	0.5	6.9078
36	0.093	0.463	6.9063	36	0.1	0.5	6.9078
37	0.086	0.459	6.9062	37	0.1	0.5	6.908
38	0.089	0.467	6.9063	38	0.1	0.5	6.9078
39	0.092	0.469	6.9063	39	0.1	0.5	6.908
40	0.092	0.461	6.9063	40	0.1	0.5	6.908
41	0.095	0.459	6.9063	41	0.1	0.5	6.9078
42	0.09	0.46	6.9063	42	0.1	0.5	6.9078
43	0.09	0.461	6.9063	43	0.1	0.5	6.908
44	0.093	0.463	6.9063	44	0.1	0.5	6.908
45	0.094	0.464	6.9063	45	0.1	0.5	6.9078
46	0.09	0.457	6.9063	46	0.1	0.5	6.9078
47	0.091	0.466	6.9063	47	0.1	0.5	6.908
48	0.089	0.465	6.9063	48	0.1	0.5	6.9078
49	0.09	0.449	6.9063	49	0.1	0.5	6.9078
50	0.094	0.46	6.9063	50	0.1	0.5	6.9078
51	0.094	0.464	6.9063	51	0.1	0.5	6.908
52	0.092	0.473	6.9063	52	0.1	0.5	6.9078
53	0.094	0.462	6.9063	53	0.1	0.5	6.9078
54	0.088	0.468	6.9063	54	0.1	0.5	6.9078
55	0.091	0.453	6.9063	55	0.1	0.5	6.9078
56	0.091	0.45	6.9064	56	0.1	0.5	6.9078
57	0.093	0.472	6.9063	57	0.1	0.5	6.9077
58	0.09	0.455	6.9063	58	0.1	0.5	6.9077
59	0.091	0.464	6.9063	59	0.1	0.5	6.9078
60	0.099	0.491	6.9063	60	0.1	0.5	6.9078
61	0.101	0.499	6.9063	61	0.1	0.5	6.908
62	0.1	0.496	6.9063	62	0.1	0.5	6.908
63	0.1	0.487	6.9063	63	0.1	0.5	6.9082
64	0.098	0.483	6.9063	64	0.1	0.5	6.9082
65	0.094	0.461	6.9064	65	0.1	0.5	6.9082
66	0.091	0.467	6.9063	66	0.1	0.5	6.9082
67	0.093	0.466	6.9064	67	0.1	0.5	6.9082
68	0.097	0.471	6.9063	68	0.1	0.5	6.9082
69	0.088	0.461	6.9064	69	0.1	0.5	6.9082
70	0.093	0.459	6.9063	70	0.1	0.5	6.9082
71	0.096	0.473	6.9064	71	0.1	0.5	6.9083
72	0.092	0.471	6.9064	72	0.1	0.5	6.9082
73	0.095	0.464	6.9064	73	0.1	0.5	6.9083
74	0.092	0.464	6.9063	74	0.1	0.5	6.9083
75	0.09	0.462	6.9064	75	0.1	0.5	6.9083
76	0.093	0.467	6.9064	76	0.1	0.5	6.9083
77	0.091	0.467	6.9064	77	0.1	0.5	6.9083
78	0.092	0.455	6.9064	78	0.1	0.5	6.9083
79	0.09	0.459	6.9064	79	0.1	0.5	6.9082
80	0.095	0.493	6.9064	80	0.1	0.5	6.9082
81	0.094	0.486	6.9064	81	0.1	0.5	6.9082
82	0.099	0.487	6.9064	82	0.1	0.5	6.9082
83	0.094	0.498	6.9064	83	0.1	0.5	6.9082
84	0.096	0.492	6.9064	84	0.1	0.5	6.9082
85	0.097	0.487	6.9064	85	0.1	0.5	6.9083
86	0.096	0.492	6.9064	86	0.1	0.5	6.9082
87	0.1	0.493	6.9065	87	0.1	0.5	6.9083
88	0.099	0.482	6.9064	88	0.1	0.5	6.9083
89	0.097	0.498	6.9064	89	0.1	0.5	6.9082

nvidia / apex Goto Github PK

apex's Introduction

Introduction

Full API Documentation: https://nvidia.github.io/apex

GTC 2019 and Pytorch DevCon 2019 Slides

Contents

1. Amp: Automatic Mixed Precision

2. Distributed Training

Synchronized Batch Normalization

Checkpointing

Installation

Containers

From Source

Linux

[Experimental] Windows

Custom C++/CUDA Extensions and Install Options

apex's People

Contributors

Stargazers

Watchers

Forkers

apex's Issues

Recommend Projects

Recommend Topics

Recommend Org