Giter Club home page Giter Club logo

apex's Introduction

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Full API Documentation: https://nvidia.github.io/apex

Contents

1. Amp: Automatic Mixed Precision

Deprecated. Use PyTorch AMP

apex.amp is a tool to enable mixed precision training by changing only 3 lines of your script. Users can easily experiment with different pure and mixed precision training modes by supplying different flags to amp.initialize.

Webinar introducing Amp (The flag cast_batchnorm has been renamed to keep_batchnorm_fp32).

API Documentation

Comprehensive Imagenet example

DCGAN example coming soon...

Moving to the new Amp API (for users of the deprecated "Amp" and "FP16_Optimizer" APIs)

2. Distributed Training

apex.parallel.DistributedDataParallel is deprecated. Use torch.nn.parallel.DistributedDataParallel

apex.parallel.DistributedDataParallel is a module wrapper, similar to torch.nn.parallel.DistributedDataParallel. It enables convenient multiprocess distributed training, optimized for NVIDIA's NCCL communication library.

API Documentation

Python Source

Example/Walkthrough

The Imagenet example shows use of apex.parallel.DistributedDataParallel along with apex.amp.

Synchronized Batch Normalization

Deprecated. Use torch.nn.SyncBatchNorm

apex.parallel.SyncBatchNorm extends torch.nn.modules.batchnorm._BatchNorm to support synchronized BN. It allreduces stats across processes during multiprocess (DistributedDataParallel) training. Synchronous BN has been used in cases where only a small local minibatch can fit on each GPU. Allreduced stats increase the effective batch size for the BN layer to the global batch size across all processes (which, technically, is the correct formulation). Synchronous BN has been observed to improve converged accuracy in some of our research models.

Checkpointing

To properly save and load your amp training, we introduce the amp.state_dict(), which contains all loss_scalers and their corresponding unskipped steps, as well as amp.load_state_dict() to restore these attributes.

In order to get bitwise accuracy, we recommend the following workflow:

# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
...

# Save checkpoint
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...

# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])

# Continue training
...

Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.

Installation

Each apex.contrib module requires one or more install options other than --cpp_ext and --cuda_ext. Note that contrib modules do not necessarily support stable PyTorch releases.

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

  • how to pull a container
  • how to run a pulled container
  • release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

We recommend installing Ninja to make compilation faster.

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

APEX also supports a Python-only build via

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

  • Fused kernels required to use apex.optimizers.FusedAdam.
  • Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
  • Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
  • Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

Custom C++/CUDA Extensions and Install Options

If a requirement of a module is not met, then it will not be built.

Module Name Install Option Misc
apex_C --cpp_ext
amp_C --cuda_ext
syncbn --cuda_ext
fused_layer_norm_cuda --cuda_ext apex.normalization
mlp_cuda --cuda_ext
scaled_upper_triang_masked_softmax_cuda --cuda_ext
generic_scaled_masked_softmax_cuda --cuda_ext
scaled_masked_softmax_cuda --cuda_ext
fused_weight_gradient_mlp_cuda --cuda_ext Requires CUDA>=11
permutation_search_cuda --permutation_search apex.contrib.sparsity
bnp --bnp apex.contrib.groupbn
xentropy --xentropy apex.contrib.xentropy
focal_loss_cuda --focal_loss apex.contrib.focal_loss
fused_index_mul_2d --index_mul_2d apex.contrib.index_mul_2d
fused_adam_cuda --deprecated_fused_adam apex.contrib.optimizers
fused_lamb_cuda --deprecated_fused_lamb apex.contrib.optimizers
fast_layer_norm --fast_layer_norm apex.contrib.layer_norm. different from fused_layer_norm
fmhalib --fmha apex.contrib.fmha
fast_multihead_attn --fast_multihead_attn apex.contrib.multihead_attn
transducer_joint_cuda --transducer apex.contrib.transducer
transducer_loss_cuda --transducer apex.contrib.transducer
cudnn_gbn_lib --cudnn_gbn Requires cuDNN>=8.5, apex.contrib.cudnn_gbn
peer_memory_cuda --peer_memory apex.contrib.peer_memory
nccl_p2p_cuda --nccl_p2p Requires NCCL >= 2.10, apex.contrib.nccl_p2p
fast_bottleneck --fast_bottleneck Requires peer_memory_cuda and nccl_p2p_cuda, apex.contrib.bottleneck
fused_conv_bias_relu --fused_conv_bias_relu Requires cuDNN>=8.4, apex.contrib.conv_bias_relu

apex's People

Contributors

a-maci avatar aidyn-a avatar alpha0422 avatar carlc-nv avatar cbcase avatar crcrpar avatar csarofeen avatar definitelynotmcarilli avatar ekrimer avatar eqy avatar erhoo82 avatar fdecayed avatar fuzzkatt avatar jjsjann123 avatar jpool-nv avatar kevinstephano avatar kexinyu avatar mcarilli avatar minitu avatar mkolod avatar nweidia avatar ptrblck avatar seryilmaz avatar slayton58 avatar syed-ahmed avatar thorjohnsen avatar timmoon10 avatar xwang233 avatar yaox12 avatar yjk21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apex's Issues

FP16 about input and loss?

I have two questions about how to train the network correctly with fp16?

First, In main_fp16_optimizer.py, input will be .half() in data_prefetcher(), and model = network_to_half(model). Should input.half be necessary? #58

train_dataset = datasets.ImageFolder(
        traindir,
        transforms.Compose([
            transforms.RandomResizedCrop(crop_size),
            transforms.RandomHorizontalFlip(),
            # transforms.ToTensor(), Too slow
            # normalize,
        ]))

Second, should we concern about the operation in the criterion (loss function), which may be more complicated such as the loss function in object detection and sementation ?

if args.fp16:
            optimizer.backward(loss)

FP16_Optimizer has no way to "retain_graph=True"

I need to call "backward" multiple times, but when I do, I get:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Of course, when I try to pass

self.optimizer.backward(loss,retain_graph=True)

I get:
TypeError: backward() got an unexpected keyword argument 'retain_graph'

Trouble building with cuda_ext

Getting this error when trying to build with --cuda_ext. I'm on a GTX 1060 with PyTorch 1.0, gcc version 4.9.4 (Ubuntu 4.9.4-2ubuntu1)

torch.__version__  =  1.0.0
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'syncbn' extension
gcc -pthread -B /home/chang/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/TH -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/chang/anaconda3/include/python3.7m -c csrc/syncbn.cpp -o build/temp.linux-x86_64-3.7/csrc/syncbn.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=syncbn -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option โ€˜-Wstrict-prototypesโ€™ is valid for C/ObjC but not for C++
/usr/local/cuda/bin/nvcc -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/TH -I/home/chang/anaconda3/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/chang/anaconda3/include/python3.7m -c csrc/welford.cu -o build/temp.linux-x86_64-3.7/csrc/welford.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=syncbn -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
csrc/welford.cu(82): error: identifier "__shfl_down_sync" is undefined
          detected during:
            instantiation of "void welford_reduce_mean_m2n(T *, int *, T &, T &, int &, int, int) [with T=at::acc_type<double, true>]" 
(184): here
            instantiation of "void welford_kernel<scalar_t,accscalar_t,outscalar_t>(const scalar_t *, outscalar_t *, outscalar_t *, outscalar_t *, int, int, int) [with scalar_t=double, accscalar_t=at::acc_type<double, true>, outscalar_t=at::acc_type<double, true>]" 
(364): here

csrc/welford.cu(49): error: identifier "__shfl_down_sync" is undefined
          detected during:
            instantiation of "T warp_reduce_sum(T) [with T=at::acc_type<double, true>]" 
(60): here
            instantiation of "T reduce_block(T *, T) [with T=at::acc_type<double, true>]" 
(268): here
            instantiation of "void reduce_bn_kernel(const scalar_t *, const scalar_t *, const accscalar_t *, const accscalar_t *, accscalar_t *, accscalar_t *, layerscalar_t *, layerscalar_t *, int, int, int, float) [with scalar_t=double, accscalar_t=at::acc_type<double, true>, layerscalar_t=at::acc_type<double, true>]" 
(460): here

csrc/welford.cu(49): error: identifier "__shfl_down_sync" is undefined
          detected during:
            instantiation of "T warp_reduce_sum(T) [with T=at::acc_type<float, true>]" 
(60): here
            instantiation of "T reduce_block(T *, T) [with T=at::acc_type<float, true>]" 
(268): here
            instantiation of "void reduce_bn_kernel(const scalar_t *, const scalar_t *, const accscalar_t *, const accscalar_t *, accscalar_t *, accscalar_t *, layerscalar_t *, layerscalar_t *, int, int, int, float) [with scalar_t=float, accscalar_t=at::acc_type<float, true>, layerscalar_t=at::acc_type<float, true>]" 
(460): here

3 errors detected in the compilation of "/tmp/tmpxft_00002f5a_00000000-7_welford.cpp1.ii".
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 2

Would apex still be useful for non Volta architectures?

I was looking into the library, and it seems that the assumption is that the GPU is a Volta architecture.

This link shows some benchmarks for fp16 training and inference, and the 1080 Ti doesn't gain that much performance from fp16.

Would it be useful to apply this library for GPUs besides Titan V and V100?

Dockerfile doesn't work

Hi,

I noticed that the url inside Dockerfile is not accessible:

FROM gitlab-dl.nvidia.com:5005/dgx/pytorch:18.04-py3-devel

Could you help to fix it? Thanks

Best,
Vincent

Error with latest pytorch head

I've been using apex for a few months now with pytorch 0.4.1. Now I'm trying to use the latest apex head together with the latest Pytorch head (2nd November) and I'm coming up with the following error when trying to run /apex/examples/imagenet/main_fp16_optimizer.py:

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

This is on line 326 when the backward pass is called:
optimizer.backward(loss)

Standard float training works fine.

I know I should be using a stable Pytorch version, but I saw there was a recently fixed issue with cuda extension compilation. Any ideas? I also couldn't find mention of apex being updated for pytorch 1.0 yet?

Also I don't see any releases on this git repository?

I do not think you need to preprocess train images to 256x256

"Train images are expected to be 256x256 jpegs."

The validation images are 256x256 I am pretty sure and is what the pytorch examples page says as does the code, but for PyTorch training you feed the training images raw from the imagenet downloaded unziped into folders with the category as the folder name. The raw images from imagenet they are not preprocessed in anyway unlike with say mxnet and caffe2. If they need to be 256x256 can you provide the script that handles that?

amp examples

A full amp example would be useful. It would help answer questions like:

Do we need to call ".half()" on the model?
Do we need to call init() before the model is built?
Enabling amp seems to slow training down, why might this be?

Illegal memory access with latest PyTorch/Apex

Hi,

I'm trying to train a model through apex using latest Pytorch and Apex masters, but every forward call ends up with the following error:

  File "main.py", line 348, in <module>
    metrics = train(batch, args.fp16)
  File "main.py", line 88, in train
    scaled_loss.backward()
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The full log with CUDNN_LOGINFO_DBG flag set is attached below.
It's using CUDNN 7.3.1 / CUDA 9.2

100936.txt

Errors during compilation

When trying to run compilation, using: "python setup.py install --cuda_ext --cpp_ext" , at the first stage, errors for missing pytorch libraries are thrown: c10, ATen, torch/extension.h. This part is possibly fixed by adding manually these files from pytorch repo.
The next issue is:

"/.../lib/python3.6/site-packages/torch/lib/include/ATen/Error.h:105:0: note: this is the location of the previous definition
#define AT_CHECK(cond, ...) \
^
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

No information about the error is available, even when using strace on the compilation command.
environment details:
gcc/g++ : 4.8.5
cuda 9, pytorch 0.4.1 (tested also with 0.4.0), python3.6.
OS: RHEL7.4 , using conda environment
arch: ppc64 (PowerAI 9)

Same issues confirmed on x86_64 - intel

SyncBN in AMP?

How can I implement SycnBN with AMP? Should usign the following code in main_fp16_optimizer.py?

import apex
model = apex.parallel.convert_syncbn_model(model)

from apex.parallel import DistributedDataParallel as DDP
model = DDP(model, delay_allreduce=True)

If so, can the DDP be repalced with original DataParallel if I train net with only one computer?

AttributeError: 'tuple' object has no attribute 'log_softmax' when run inception_v3

python main.py -a inception_v3 --epoch 5 -b 64 /workspace/imagenet/ --fp16
=> creating model 'inception_v3'
Traceback (most recent call last):
File "main.py", line 466, in
main()
File "main.py", line 212, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main.py", line 293, in train
loss = criterion(output, target_var)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 779, in forward
self.ignore_index, self.reduce)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 946, in log_softmax
return input.log_softmax(dim)
AttributeError: 'tuple' object has no attribute 'log_softmax'

node:
In [2]: print(torch.version)
0.5.0a0

Install Error: support PyTorch 0.4.1?

I've recently upgraded to PyTorch 0.4.1. When I try to install Apex, it gives following error:

$ python setup.py install
Traceback (most recent call last):
  File "setup.py", line 1, in <module>
    import torch
  File "/home/yuduo/anaconda3/lib/python3.6/site-packages/torch/__init__.py", line 80, in <module>
    from torch._C import *
ImportError: /home/yuduo/anaconda3/lib/python3.6/site-packages/torch/lib/libshm.so: undefined symbol: _ZTI24THRefcountedMapAllocator

Get amp handler in a more decent way

Usually, the location of amp.init() is far from loss.backward(). While it is possible to pass the handler as a parameter to the function which calls loss.backward(), but it is not very decent. I wonder if we can do something like:

import torch
import apex
apex.amp.apex.init()

........


def backward(loss):
    with apex.amp.get_default_handler().scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

Nan when using torch.mean

Hi, I am writting Layernorm using torch.mean().
My pytorch version is 1.0.0a0+505dedf .
This is my code.

class LayerNorm(nn.Module):
    def __init__(self, num_features, eps=1e-5, affine=True, fp16=True):
        super(LayerNorm, self).__init__()
        self.num_features = num_features
        self.affine = affine
        self.eps = eps
        self.fp16 = fp16
        if self.affine:
            self.gamma = nn.Parameter(torch.Tensor(num_features).uniform_())
            self.beta = nn.Parameter(torch.zeros(num_features))
            if self.fp16:
                self.gamma = nn.Parameter(torch.Tensor(num_features).uniform_().half())
                self.beta = nn.Parameter(torch.zeros(num_features).half())
                #self.eps = np.float16(self.eps)
    def forward(self, x):
        shape = [-1] + [1] * (x.dim() - 1)
        print(x.view(-1))
        print(torch.mean(x.view(-1)) )
        mean = x.view(-1).mean() .view(*shape)
        std = x.view(-1).std() .view(*shape)
        x = (x - mean) / (std + self.eps)
        exit()
        if self.affine:
            shape = [1, -1] + [1] * (x.dim() - 2)
            x = x * self.gamma.view(*shape) + self.beta.view(*shape)
        return x

The output is

tensor([-11.0703,   3.6230,  -0.1460,  ...,   0.7358, -10.4688,  -9.3984],
       device='cuda:0', dtype=torch.float16, grad_fn=<ViewBackward>)
tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward1>)

I notice the result turns to nan when I use torch.mean(). Do you have any suggestion?

Is instance norm supported?

I got the following error when nn.InstanceNorm2d is used in the network:

RuntimeError: Expected object of type torch.cuda.HalfTensor but found type torch.cuda.FloatTensor for argument #4 'running_mean'

Any suggestion on how to get around this?

Error when amp_handle.amp.init()

I used the following code

import torch
from apex import amp
amp_handle = amp.init()

and "Floating point exception (core dumped)" occured

TypeError: OptimWrapper is not an Optimizer

Hello,

I have tried to implement amp by wrapping a pre-existing model:

https://github.com/neptune-ml/open-solution-mapping-challenge

I've scoured the source code and am pretty sure there is only one optimizer, so I first tried enabling amp and wrapping backpropagation as instructed, but after that training ran only about 75-80% as fast as it had before. So I decided to try explicitly wrapping the optimizer:

    self.optimizer = optim.Adam(self.weight_regularization(self.model, **architecture_config['regularizer_params']),
                                **architecture_config['optimizer_params'])
    #Initializing amp
    amp_handle = amp.init()
    #Wrapping self.optimizer
    self.optimizer = amp_handle.wrap_optimizer(self.optimizer)
    self.loss_function = None
    self.callbacks = callbacks_unet(self.callbacks_config)

And then at the backdrop:

    with self.optimizer.scale_loss(batch_loss) as scaled_loss:
        scaled_loss.backward()
    self.optimizer.step()

However, I get the following error:

2018-10-10 17-24-18 steps >>> step unet fitting and transforming...

Traceback (most recent call last):

File "main.py", line 93, in

main()

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call

return self.main(*args, **kwargs)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main

rv = self.invoke(ctx)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke

return ctx.invoke(self.callback, **ctx.params)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke

return callback(*args, **kwargs)

File "main.py", line 31, in train

pipeline_manager.train(pipeline_name, dev_mode)

File "/ebs/osmc/src/pipeline_manager.py", line 32, in train

train(pipeline_name, dev_mode, self.logger, self.params, self.seed)

File "/ebs/osmc/src/pipeline_manager.py", line 116, in train

pipeline.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

[Previous line repeated 3 more times]

File "/ebs/osmc/src/steps/base.py", line 112, in fit_transform

return self._cached_fit_transform(step_inputs)

File "/ebs/osmc/src/steps/base.py", line 123, in _cached_fit_transform

step_output_data = self.transformer.fit_transform(**step_inputs)

File "/ebs/osmc/src/steps/base.py", line 262, in fit_transform

self.fit(*args, **kwargs)

File "/ebs/osmc/src/models.py", line 76, in fit

self.callbacks.set_params(self, validation_datagen=validation_datagen, meta_valid=meta_valid)

File "/ebs/osmc/src/steps/pytorch/callbacks.py", line 76, in set_params

callback.set_params(*args, **kwargs)

File "/ebs/osmc/src/steps/pytorch/callbacks.py", line 222, in set_params

self.lr_scheduler = ExponentialLR(self.optimizer, self.gamma, last_epoch=-1)

File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 178, in init

super(ExponentialLR, self).__init__(optimizer, last_epoch)

File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 13, in init

type(optimizer).__name__))

TypeError: OptimWrapper is not an Optimizer

Did I make a stupid mistake somewhere? Did i forget to do something? Or is something about this model incompatible with apex?

Pytorch Model with multiple inputs

There is a bug that doesn't allow a model to have multiple inputs through the forward function after using network_to_half function.
I made a model with 2 inputs parameters, and it works fine without network_to_half. However, after using it, it gives me the following error:
forward() takes 2 positional arguments but 3 were given

RuntimeError: Tensor: invalid storage offset

This error raised when the weights of RNN are not part of single contiguous chunk of memory. In pure pytorch calculation it is just a warning, but with apex it will fail:

apex/apex/amp/utils.py

Lines 177 to 188 in 4212b3e

def new_synthesize_flattened_rnn_weights(fp32_weights,
fp16_flat_tensor,
rnn_fn='',
verbose=False):
fp16_weights = []
fp32_base_ptr = fp32_weights[0].data_ptr()
for w_fp32 in fp32_weights:
w_fp16 = w_fp32.new().half()
offset = (w_fp32.data_ptr() - fp32_base_ptr) // w_fp32.element_size()
w_fp16.set_(fp16_flat_tensor.storage(),
offset,
w_fp32.shape)

(the offsets may become negative, which are invalid)

While I am not sure why this (weights are not part of single contiguous chunk of memory) happens in PyTorch, but a simple workaround is to call rnn.flatten_parameters() before each forward call.

resnet50 doesn't converge when running example/imagenet/main.py on imagenet dataset with fp16

i want use example/imagenet/main.py to train resnet50 model on imagenet dataset with fp16. But the accurary can't converge. BTW, not using fp16 will get right top1 accurary=76%.

my command is:

python -m torch.distributed.launch --nproc_per_node=8 main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet
  • Python version: 3.6.2

  • PyTorch version: 0.4.1

  • torchvision version: 0.2.1

  • OS: Ubuntu 16.04.3 LTS

  • Nvidia driver version: 390.46

  • CUDA runtime version: 9.0

  • GPU number: 8

  • GPU model: Tesla P100-PCIE

the validate accurary suddenly fall down to 0 after about 7 epochs.
the train accurary suddenly fall down to 0 after about 17 epochs.

i saved model's gradient( para.grad )each epoch, i found when epoch=17, the data distribution of model's parameter( para.data๏ผ‰ is normal, but 84.5% of gradient data( para.grad) is NaN.

the accurary result as follows,

train validate
epoch Top1 Top5 Loss epoch Top1 Top5 Loss
0 3.166 9.401 6.0748 0 3.054 9.308 12.1711
1 15.428 34.406 4.4438 1 18.206 39.778 4.1692
2 26.628 50.356 3.6012 2 29.108 54.764 3.3741
3 34.069 59.227 3.1205 3 30.938 56.796 3.2538
4 37.787 63.202 2.8991 4 29.46 55.652 3.3615
5 40.33 65.834 2.7536 5 12.982 30.574 5.1418
6 42.476 67.836 2.6325 6 0.428 1.608 8.4904
7 43.851 69.086 2.5574 7 0.1 0.502 8.2962
8 44.888 70.058 2.5005 8 0.1 0.49 15.8809
9 45.692 70.684 2.4588 9 0.1 0.5 83.5319
10 46.378 71.274 2.4261 10 0.104 0.496 184.0083
11 46.66 71.618 2.4065 11 0.1 0.504 210.9373
12 46.938 71.805 2.3928 12 0.1 0.5 585.1285
13 47.039 71.931 2.3873 13 0.1 0.5 2283.96
14 46.974 71.87 2.393 14 0 0.006 1612.295
15 46.667 71.499 2.4104 15 0.002 0.006 7.0508
16 46.273 71.155 2.4337 16 0.002 0.006 1554.635
17 16.3 25.251 5.3414 17 0.1 0.5 8.9353
18 0.096 0.482 6.9067 18 0.1 0.5 7.0235
19 0.095 0.485 6.9068 19 0.1 0.5 6.911
20 0.097 0.488 6.9068 20 0.1 0.5 6.9091
21 0.094 0.491 6.9067 21 0.1 0.5 6.9086
22 0.094 0.487 6.9066 22 0.1 0.5 6.9085
23 0.095 0.478 6.9066 23 0.1 0.5 6.9085
24 0.101 0.491 6.9067 24 0.1 0.5 6.9082
25 0.098 0.487 6.9067 25 0.1 0.5 6.9083
26 0.097 0.483 6.9068 26 0.1 0.5 6.908
27 0.099 0.485 6.9067 27 0.1 0.5 6.9082
28 0.091 0.489 6.9067 28 0.1 0.5 6.9085
29 0.097 0.489 6.9067 29 0.1 0.5 6.9083
30 0.1 0.503 6.9065 30 0.1 0.5 6.908
31 0.1 0.496 6.9063 31 0.1 0.5 6.9078
32 0.098 0.487 6.9063 32 0.1 0.5 6.908
33 0.092 0.472 6.9063 33 0.1 0.5 6.9078
34 0.092 0.469 6.9063 34 0.1 0.5 6.9078
35 0.095 0.461 6.9063 35 0.1 0.5 6.9078
36 0.093 0.463 6.9063 36 0.1 0.5 6.9078
37 0.086 0.459 6.9062 37 0.1 0.5 6.908
38 0.089 0.467 6.9063 38 0.1 0.5 6.9078
39 0.092 0.469 6.9063 39 0.1 0.5 6.908
40 0.092 0.461 6.9063 40 0.1 0.5 6.908
41 0.095 0.459 6.9063 41 0.1 0.5 6.9078
42 0.09 0.46 6.9063 42 0.1 0.5 6.9078
43 0.09 0.461 6.9063 43 0.1 0.5 6.908
44 0.093 0.463 6.9063 44 0.1 0.5 6.908
45 0.094 0.464 6.9063 45 0.1 0.5 6.9078
46 0.09 0.457 6.9063 46 0.1 0.5 6.9078
47 0.091 0.466 6.9063 47 0.1 0.5 6.908
48 0.089 0.465 6.9063 48 0.1 0.5 6.9078
49 0.09 0.449 6.9063 49 0.1 0.5 6.9078
50 0.094 0.46 6.9063 50 0.1 0.5 6.9078
51 0.094 0.464 6.9063 51 0.1 0.5 6.908
52 0.092 0.473 6.9063 52 0.1 0.5 6.9078
53 0.094 0.462 6.9063 53 0.1 0.5 6.9078
54 0.088 0.468 6.9063 54 0.1 0.5 6.9078
55 0.091 0.453 6.9063 55 0.1 0.5 6.9078
56 0.091 0.45 6.9064 56 0.1 0.5 6.9078
57 0.093 0.472 6.9063 57 0.1 0.5 6.9077
58 0.09 0.455 6.9063 58 0.1 0.5 6.9077
59 0.091 0.464 6.9063 59 0.1 0.5 6.9078
60 0.099 0.491 6.9063 60 0.1 0.5 6.9078
61 0.101 0.499 6.9063 61 0.1 0.5 6.908
62 0.1 0.496 6.9063 62 0.1 0.5 6.908
63 0.1 0.487 6.9063 63 0.1 0.5 6.9082
64 0.098 0.483 6.9063 64 0.1 0.5 6.9082
65 0.094 0.461 6.9064 65 0.1 0.5 6.9082
66 0.091 0.467 6.9063 66 0.1 0.5 6.9082
67 0.093 0.466 6.9064 67 0.1 0.5 6.9082
68 0.097 0.471 6.9063 68 0.1 0.5 6.9082
69 0.088 0.461 6.9064 69 0.1 0.5 6.9082
70 0.093 0.459 6.9063 70 0.1 0.5 6.9082
71 0.096 0.473 6.9064 71 0.1 0.5 6.9083
72 0.092 0.471 6.9064 72 0.1 0.5 6.9082
73 0.095 0.464 6.9064 73 0.1 0.5 6.9083
74 0.092 0.464 6.9063 74 0.1 0.5 6.9083
75 0.09 0.462 6.9064 75 0.1 0.5 6.9083
76 0.093 0.467 6.9064 76 0.1 0.5 6.9083
77 0.091 0.467 6.9064 77 0.1 0.5 6.9083
78 0.092 0.455 6.9064 78 0.1 0.5 6.9083
79 0.09 0.459 6.9064 79 0.1 0.5 6.9082
80 0.095 0.493 6.9064 80 0.1 0.5 6.9082
81 0.094 0.486 6.9064 81 0.1 0.5 6.9082
82 0.099 0.487 6.9064 82 0.1 0.5 6.9082
83 0.094 0.498 6.9064 83 0.1 0.5 6.9082
84 0.096 0.492 6.9064 84 0.1 0.5 6.9082
85 0.097 0.487 6.9064 85 0.1 0.5 6.9083
86 0.096 0.492 6.9064 86 0.1 0.5 6.9082
87 0.1 0.493 6.9065 87 0.1 0.5 6.9083
88 0.099 0.482 6.9064 88 0.1 0.5 6.9083
89 0.097 0.498 6.9064 89 0.1 0.5 6.9082

when i got the wrong result, i run the main.py on another server with two V100. I only use one, because when i want use two, i got the trouble described in pytorch issue 11327. But it does not affect my training...

python main.py --fp16 --arch resnet50 --epochs 90 --workers 6 --batch-size=256 /imagenet

Then i got the same wrong result, it is very like the above results.

I want to know whether you have tested the main.pyon imagenet dataset with fp16,and got a good accurary, like top1=76% described in the paper MIXED PRECISION TRAINING.

Pytorch DistributedDataParallel incompatibility

Is it known/intentional that the FP16 Optimizer does not work with Pytorch's built in DistributedDataParallel? Or is there some subtlety to getting it to work? I need to use the new DistributedDataParallel with the new c10d backend for the work I'm doing and would like to be able to use Apex with that

loss.backward() in apex.amp ?

As the example for handling multiple backward passes :

amp_handle = amp.init()
optimizer = amp_handle.wrap_optimizer(optimizer, num_loss=2)
# ...
optimizer.zero_grad()
loss1 = ComputeLoss1(model)
with optimizer.scale_loss(loss1) as scaled_loss:
    scaled_loss.backward()
# ...
loss2 = ComputeLoss2(model)
with optimizer.scale_loss(loss2) as scaled_loss:
    scaled_loss.backward()
# ...
optimizer.step()

Can we first add losses together and then do the wrapper?

loss = loss1 + loss2
with optimizer.scale_loss(loss) as scaled_loss:
    scaled_loss.backward()

AMP supported hardware

Hello!

It's not clear what hardware apex benefits from the most. I'm using AMP on a Tesla K80 to train my model and actually the training slowed down by ~ 1.3x. In the mixed precision user guide, it says that the framework should support Volta Tensor Core math and it only mentions Tesla V100, so is Tesla V100 the best hardware for mixed precision training?

Learning Scheduler

Essentially, I want to use a learning scheduler. Typically the syntax for that is:

scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_rule)

where here I am using the LambdaLR rule. However, when optimizer is FP16_Optimizer, this throws an error:

TypeError: FP16_Optimizer is not an Optimizer

This makes total sense. If you go to the documentation for schedulers on the base class, there is this piece of code

        if not isinstance(optimizer, Optimizer):
            raise TypeError('{} is not an Optimizer'.format(
                type(optimizer).__name__))

Now, my questions are:

  1. Is there already a way of dealing with this, cause probably, I am not the first one who has this problem?
  2. If not, what would be the best suggestion to implement schedulers for FP16_Optimizer? Copy the code form torch.optim and change it to work with FP16_Optimizer?

Training deadlock

There is an issue in the implementation of DistributedDataParallel that triggers a deadlock of processes.
Specifically, in the method flat_dist_call, there is a for loop over a dictionary with calls to collective operations (like broadcasting) in the body. Since the ordering of the dictionary's keys is random, we obtain non-matching calls to the collective operations, which induce a deadlock of the processes.
I have fixed this issue and created a pull request.

position of amp.init()

Hi, thanks for sharing this great project!

I'd like to ask you about the position of amp.init(). Is it matter to call at the first of code or not? For example, can I amp.init() after all nn.module instances are generated?
Since there is no relation between the amp_handle and the model, I'm a little confused. In the case of apex.fp16_utils, it looks like I have to call model = network_to_half(model) before the training loop. But there is no procedure like that in the case of amp.

Thank you again!
Jin

Windows --cuda_ext build fails due to missing canUse32BitIndexMath

Latest MSVC 2017 update, CUDA 10.0.130, PyTorch 1.0 release with python 3.6, apex from master branch.

   Creating library build\temp.win-amd64-3.6\Release\apex/optimizers/csrc\fused_adam_cuda.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\apex/optimizers/csrc\fused_adam_cuda.cp36-win_amd64.exp
fused_adam_cuda_kernel.obj : error LNK2001: unresolved external symbol "bool __cdecl at::cuda::detail::canUse32BitIndexMath(class at::Tensor const &,__int64)" (?canUse32BitIndexMath@detail@cuda@at@@YA_NAEBVTensor@3@_J@Z)
build\lib.win-amd64-3.6\fused_adam_cuda.cp36-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Community\\VC\\Tools\\MSVC\\14.16.27023\\bin\\HostX86\\x64\\link.exe' failed with exit status 1120

Warning spam when extensions are missing is excessive

Issue #96 does not concern me all too much, but during training I almost can't see the loss numbers in the repeated spam of things like

Warning:  apex was installed without --cuda_ext.  FusedAdam will be unavailable.
Warning:  apex was installed without --cuda_ext.  FusedLayerNorm will be unavailable.

I'm not sure how useful this warning is. But for sure, giving it more than once serves no purpose.

ZeroDivisionError in backward

Hi, I am having an error when I implement the amp procedure on a working CNN like this:

self.optimizer.zero_grad()

        outputs = self.model(maps)

        loss = self.criterion(outputs,labels.float())

        #add automatic mixed precision support from apex
        with self.amp_handle.scale_loss(loss, self.optimizer) as scaled_loss:

            scaled_loss.backward()

        self.optimizer.step()

`
And here is the error I get:

scaled_loss.backward() File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/usr/local/lib/python3.5/dist-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/amp/handle.py", line 53, in scale_loss optimizer.param_groups, loss_scale) File "/usr/local/lib/python3.5/dist-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/amp/scaler.py", line 21, in unscale_and_update 1. / scale, ZeroDivisionError: float division by zero

Any suggestion would be appreciated.

AMP Checkpointing

Assuming it differs from normal pytorch usage, would it be possible to provide an example of the steps required to save and load model checkpoints with amp? Are there any specific considerations to take into account (especially when using DDP)?

Warning: apex was installed without --cuda_ext.

I install apex according this sentence:

python setup.py install --cuda_ext --cpp_ext

2.After that, using

import apex

to test, but it report warning as following:
Warning: apex was installed without --cuda_ext. Fused syncbn kernels will be unavailable. Python fallbacks will be used instead.
Warning: apex was installed without --cuda_ext. FusedAdam will be unavailable.
Warning: apex was installed without --cuda_ext. FusedLayerNorm will be unavailable.

Is there any problem?

installation issue

I am really excited about trying this. But, every time I try installing, I am getting the following error:
torch.version = 0.5.0a0+03e7953
Found CUDA_HOME = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2
Traceback (most recent call last):
File "setup.py", line 105, in
CUDA_MAJOR = get_cuda_version()
File "setup.py", line 85, in get_cuda_version
re.compile('nvcc$').search)
File "setup.py", line 38, in find
return list(set(collection))
TypeError: 'NoneType' object is not iterable

atex/amex?

Will there be an Tensorflow/MXNet extension of all these awesome work?

pip uninstall apex: FileNotFoundError: [Errno 2] No such file or directory: '/.../apex-0.1-py3.6.egg

i use conda creating a python3.6 env. And instal the latest master apex using python setup.py install
when i run pip uninstall apex, i got the error

Uninstalling apex-0.1:
  /xxx/apex-0.1-py3.6.egg
Proceed (y/n)? y
  Successfully uninstalled apex-0.1
Traceback (most recent call last):
  File "/xxx/bin/pip", line 6, in <module>
    sys.exit(pip.main())
  File "/xxx/pip/__init__.py", line 249, in main
    return command.main(cmd_args)
  File "/xxx/pip/basecommand.py", line 252, in main
    pip_version_check(session)
  File "/xxx/pip/utils/outdated.py", line 102, in pip_version_check
    installed_version = get_installed_version("pip")
  File "/xxx/pip/utils/__init__.py", line 838, in get_installed_version
    working_set = pkg_resources.WorkingSet()
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 644, in __init__
    self.add_entry(entry)
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 700, in add_entry
    for dist in find_distributions(entry, True):
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 1949, in find_eggs_in_zip
    if metadata.has_metadata('PKG-INFO'):
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 1463, in has_metadata
    return self.egg_info and self._has(self._fn(self.egg_info, name))
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 1823, in _has
    return zip_path in self.zipinfo or zip_path in self._index()
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 1703, in zipinfo
    return self._zip_manifests.load(self.loader.archive)
  File "/xxx/pip/_vendor/pkg_resources/__init__.py", line 1643, in load
    mtime = os.stat(path).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/xxx/apex-0.1-py3.6.egg'

negligble performance gains and non convergence on DCGAN using apex (what to change?)

I bought a RTX 2070 with the goal in mind to train my DCGAN on fp16 for bigger and faster models. After carefully adjusting my models and running vanilla model.half() without apex, AMP and FP16_Optimizer I'm not too convinced by the results. Maybe I did something wrong?

The architecture:

       #Loss Function: 
        criterion = nn.BCELoss()


       # Generator
       "512px output": (
        nn.Sequential(
        # Input Z (100x1x1)
        nn.ConvTranspose2d(nz, ngf * 64, 4, 1, 0, bias=False),
        nn.BatchNorm2d(ngf * 64),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 4x4x(ngf*64)

        nn.ConvTranspose2d(ngf * 64, ngf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 32),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 8x8x(ngf*32)

        nn.ConvTranspose2d(ngf * 32, ngf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 16),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 16x16x(ngf*16)

        nn.ConvTranspose2d(ngf * 16, ngf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 8),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 32x32x(ngf*8)

        nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 4),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 64x64x(ngf*4)

        nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf * 2),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 128x128x(ngf * 2)
            
        nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ngf),
        nn.LeakyReLU(negative_slope=0.2, inplace=True),
        # 256x256x(ngf)

        nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
        nn.Tanh()
        # 512x512x3 Output
    ),

    # Discriminator
    nn.Sequential(
        # Input 512x512x3
        nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf),
        nn.LeakyReLU(0.2, inplace=True),
        # 256x256xndf

        nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 2),
        nn.LeakyReLU(0.2, inplace=True),
        # 64x64x(ndf * 2)

        nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 4),
        nn.LeakyReLU(0.2, inplace=True),
        # 32x32x(ndf * 4)

        nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 8),
        nn.LeakyReLU(0.2, inplace=True),
        # 16x16x(ndf * 8)

        nn.Conv2d(ndf * 8, ndf * 16, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 16),
        nn.LeakyReLU(0.2, inplace=True),
        # 8x8x(ndf * 16)

        nn.Conv2d(ndf * 16, ndf * 32, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 32),
        nn.LeakyReLU(0.2, inplace=True),
        # 4x4x(ndf * 32)
        
        nn.Conv2d(ndf * 32, ndf * 64, 4, 2, 1, bias=False),
        nn.BatchNorm2d(ndf * 64),
        nn.LeakyReLU(0.2, inplace=True),
        # 2x2x(ndf * 64)

        nn.Conv2d(ndf * 64, 1, 4, 1, 0, bias=False),
        nn.Sigmoid()
        # 1x1x1
    )),

I changed the following parts in my code to accomodate for FP16:

network_to_half(netG)
network_to_half(netD)
optimizerD = FP16_Optimizer(optimizerD, dynamic_loss_scale=True, verbose=False)
optimizerG = FP16_Optimizer(optimizerG, dynamic_loss_scale=True, verbose=False)

in the training loop:

for i, data in enumerate(dataloader, 0):
    # making the input fp16
    input_batch = data[0].cuda().half()
     ....
    # collect gradients for real batch in discriminator
    optimizerD.backward(errD_real, update_master_grads=False)
     ....
    # collect gradients for fake batch in discriminator
    optimizerD.backward(errD_fake, update_master_grads=False)
     ....
    # backprop discriminator
     optimizerD.update_master_grads()
     optimizerD.step()
    ....
    # collect gradients for generated batch in generator and backprop generator
     optimizerG.backward(errG)
     optimizerG.step()
    ....

Results:

  • using stock model.half() without apex: the model is 2x slower and not converging after 1 epoch
  • using AMP: the model is 1.5x slower and not converging after 1 epoch
  • using FP16_Optimizer: the model is 1.2x slower and converging if dynamic_loss_scale is used

Basically the model only somewhat behaves if I'm using dynamic_loss_scale in FP16_Optimizer, although it produces garbage outputs even though the architecture didn't change from the FP32 model that worked.

AMP should use dynamic_loss_scale automatically but it always collapses after 1 iteration and is very slow.

I expected the model to be faster and atleast converge like the FP32 model did. The only benefit is that the model is occupying around 51% less space on the GPU, so bigger models can be trained.

Questions:

What do I need to change in my architecture and training setup to make FP16 work with this DCGAN?

System information

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 Home
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration: GPU 0: GeForce RTX 2070
Nvidia driver version: 416.81
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] cuda92 1.0 0 pytorch
[conda] pytorch 0.4.1 py37_cuda92_cudnn7he774522_1 [cuda92] pytorch
[conda] torchvision 0.2.1

Feature request: FusedAdamW

So far I'm really liking apex - no hassle fp16 training. I've noticed in my experiments that the optimizer does take a not inconsiderable time to execute, so I'm quite interested to try out the new FusedAdam optimizer (once issue 74 is sorted out, that is).

The thing is, I'm normally using AdamW. It's small variation of Adam that improves weight decay behaviour. I understand it's gotten quite popular, for instance fast.ai is using it in all of their work. Would it also be possible to get a FusedAdamW implementation please?

https://arxiv.org/pdf/1711.05101.pdf
https://www.fast.ai/2018/07/02/adam-weight-decay/

The overlapping of communication over computation seems not been realized due to the GPU log

We use the apex extension with pytorch 0.4.0, The system information is:
system: ubuntu 16.04.4
pytorch version: 0.4.0 with CUDA 9.1 and CUDNN 7.0.5
python version: 3.5.2
GPU: Tesla P100 *8
NVIDIA driver: 390.46
Model: ResNet 50

we set shared_parameter=False to enable the overlapping of communication over computation (We have read the source code and find that if the value is True, the communication will be done after all computations). The message_size is reduced to 10^6. We run 6 iterations and record the GPU log with nvidia profiler tool.

However, we found from the GPU log that the overlapping is not realized. The log of the 6th iteration is shown as follow. The first "AllreduceKennel" call is after the call of 'MaxPoolBackward', which is the end of backward computation. We have checked the other iterations and find the same thing.

`28.478387,0.475834,49,8,64,256,1,1,32,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void MaxPoolBackward<float, float>(int, float const *, long const *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*)",291627

28.478871,0.138303,12544,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply3<ThresholdUpdateGradInput<float>, float, float, float, unsigned int, int=-2, int=-2, int=-2>(OffsetInfo<ThresholdUpdateGradInput<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, OffsetInfo<float, float, int=-2>, float, float)",291643

28.479021,0.007424,,,,,,,,,,0.001343,0.176630,"Device",,"Tesla P100-PCIE-16GB (0)","1","24","[CUDA memset]",291667

28.479047,0.197309,110,1,1,512,1,1,64,0.265625,24.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","24","void cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, float const *, cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, float const , cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)",291697

28.479262,0.002912,1,112,1,128,1,1,14,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeWgradOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)",291715
28.479274,0.008000,37,1,1,256,1,1,8,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void scalePackedTensor_kernel<float, float>(cudnnTensor4dStruct, float*, float)",291721
28.479295,0.004831,1,1,1,256,1,1,12,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeBOffsetsKernel(cudnn::maxwell::gemm::ComputeBOffsetsParams)",291726

28.479312,0.466554,2,1,112,128,1,1,128,10.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","maxwell_scudnn_128x64_stridedB_splitK_large_nn",291730

28.479791,0.007104,2,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291741

28.479806,0.043616,4000,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291754

28.479852,0.007136,,,,,,,,,,0.000046,0.006265,"Pinned","Device","Tesla P100-PCIE-16GB (0)","1","14","[CUDA memcpy HtoD]",291792

28.479865,0.005311,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291874

28.479879,0.049184,112,2,1,512,1,1,13,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void CatArrayBatchedCopy<float, unsigned int, int=1>(float*, CatArrInputTensor<float, unsigned int>*, OutputTensorSizeStride<unsigned int, unsigned int=4>, int, unsigned int)",291807

28.479886,0.007520,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291889

28.479906,0.032896,2048,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291908

28.479940,4.409442,1,1,1,257,1,1,128,0.007812,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void AllReduceKernel<int=256, int=8, FuncSum<float>, float>(KernelArgs<FuncSum<float>>)",291821`

Could you please tell us the reason for us or point out the our faults when use the apex extensions?Thanks for your help.

Segmentation fault...

Ubuntu 18.04.1
Cuda 9.2
C++ 7.3.0
Python 3.6.5

nvidia/apex$ python setup.py install
Segmentation fault (core dumped)

Cheers
Pei

when do multi-process training (uses all visible GPU on the node) ,script crash

python -m apex.parallel.multiproc main.py -a resnet50 --fp16 --b 128 --workers 4 /workspace/imagenet/
Traceback (most recent call last):
File "main.py", line 466, in
main()
File "main.py", line 117, in main
rank=args.rank)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/init.py", line 94, in init_process_group
group_name, rank)
RuntimeError: more than one node have assigned same rank at /opt/pytorch/pytorch/torch/lib/THD/process_group/General.cpp:17

How to use fp16 training with masked operations

Hello !

I'm working on sequence training with CNNs and for this I've to operate some masked_fill operations over the padding, before softmax for example.

On float32 training, I'm masking with -1e20 value, and it seems to be training fine. Unfortunately, when training with float16 and masking with -1e15, amp loss scaling always returns "NaN" for gradients.

Do you have any idea how to combine masked_fill with amp ?

Thanks,
Morgan

'RNN' KeyError

Note that this is with the latest commit 12dce88

In [9]: from apex import amp
   ...: amp_handle = amp.init()
   ...:                
   ...:                   
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-ef5bcdad1b52> in <module>()
      1 from apex import amp    
----> 2 amp_handle = amp.init()                                                   
                      
/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/amp.py in init(enabled, enable_caching, verbose, allow_banned)
    144                                         
    145     # 5.5) Extra-special handling of RNN backend              
--> 146     wrap.rnn_cast(torch.nn.backends.thnn.backend, 'RNN', verbose)
    147                                
    148     # And even more special handling of `backward` for fused gru / lstm
                                      
/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/wrap.py in rnn_cast(backend, fn, verbose)
    142 #   2) Insert an fp16 `flat_weight` if necessary
    143 def rnn_cast(backend, fn, verbose=False):        
--> 144     orig_rnn = utils.get_func(backend, fn)    
    145     @functools.wraps(orig_rnn)                                                                                                                                                                               
    146     def rnn_wrapper(*args, **kwargs):               

/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/utils.py in get_func(mod, fn)
    117 def get_func(mod, fn):       
    118     if isinstance(mod, torch.nn.backends.backend.FunctionBackend):
--> 119         return mod.function_classes[fn]
    120     else:             
    121         return getattr(mod, fn)             
                                     
KeyError: 'RNN'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.