hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Out of memory during searching & training. about pc-darts HOT 9 CLOSED

yuhuixu1993 commented on July 18, 2024

Out of memory during searching & training.

from pc-darts.

Comments (9)

yuhuixu1993 commented on July 18, 2024

@ganji15 ，Which GPU is used in your experiment? 1080ti or V100? 1080ti may have OOM error on higher version pytorch just like the original DARTS. Besides, I note that the OOM occurs when test the validation accuracy, you can just comment that validation part and this will not affect the searched result.

from pc-darts.

ganji15 commented on July 18, 2024

@yuhuixu1993 Thanks. My GPU is Quadro P5000, and I also met the OOM error before when using Titan X. I think you are right, and I will modify the code according to your good suggestion.

from pc-darts.

ganji15 commented on July 18, 2024

@yuhuixu1993 The problem is not solved, and I find out that the OOM error occurs in the forward propagation. As a result, I even cannot evaluate the performance of the searched model.

Specifically, I deleted the training part of ``train.py'' and kept the inference part as follows:

  for epoch in range(args.epochs):
    scheduler.step()
    logging.info('epoch %d lr %e', epoch, scheduler.get_lr()[0])
    model.drop_path_prob = args.drop_path_prob * epoch / args.epochs

    # train_acc, train_obj = train(train_queue, dp_model, model, criterion, optimizer)
    # logging.info('train_acc %f', train_acc)
    logging.info('enter infer')
    valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
    logging.info('exit infer')
    if valid_acc > best_acc:
      best_acc = valid_acc

    logging.info('valid_acc %f, best_acc %f', valid_acc, best_acc)
    # logging.info('valid_acc %f', valid_acc)

I also added log in the ``infer'' function as follows:

def infer(valid_queue, model, criterion):
   ...
  for step, (input, target) in enumerate(valid_queue):
    logging.info('step %d'%step)  ## debug information
    input = input.cuda()
    target = target.cuda(non_blocking=True)
    ...

Then, I run the ``train.py'' and I got the following errors:

➜  pc-darts python train.py --auxiliary --cutout --gpu 1                                              
Experiment dir : eval-EXP-20190902-110522
09/02 11:05:22 AM gpu device = 1
09/02 11:05:22 AM args = Namespace(arch='PCDARTS', auxiliary=True, auxiliary_weight=0.4, batch_size=96, cutout=True, cutout_length=16, data='../data', drop_path_prob=0.3, epochs=600, gpu='1', grad_clip=5, init_channels=36, layers=20, learning_rate=0.025, model_path='saved_models', momentum=0.9, report_freq=50, save='eval-EXP-20190902-110522', seed=0, set='cifar10', weight_decay=0.0003)
108 108 36
108 144 36
144 144 36
144 144 36
144 144 36
144 144 36
144 144 72
144 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 144
288 576 144
576 576 144
576 576 144
576 576 144
576 576 144
576 576 144
09/02 11:05:24 AM param size = 3.634678MB
Files already downloaded and verified
Files already downloaded and verified
09/02 11:05:26 AM epoch 0 lr 2.500000e-02
09/02 11:05:26 AM enter infer
09/02 11:05:26 AM step 0
09/02 11:05:26 AM valid 000 2.301318e+00 8.333333
09/02 11:05:26 AM step 1
Traceback (most recent call last):
  File "train.py", line 193, in <module>
    main() 
  File "train.py", line 125, in main
    valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
  File "train.py", line 178, in infer
    logits, _ = dp_model(input)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/model.py", line 150, in forward
    s0, s1 = s1, cell(s0, s1, self.drop_path_prob)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/model.py", line 51, in forward
    h1 = op1(h1)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/operations.py", line 66, in forward
    return self.op(x)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 6.75 MiB (GPU 0; 15.90 GiB total capacity; 14.32 GiB already allocated; 3.56 MiB free; 3.09 MiB cached)

Therefore, I guess that this is something wrong with the searched model (too large? too deep? circle route?). How can I visualize the searched model? How to fix this error? Thanks!

p.s. I downgrade PyTorch 1.2 -> 1.0, and the problem remains.

from pc-darts.

yuhuixu1993 commented on July 18, 2024

@ganji15 ,yes you reduce the batchsize and try it again? Pytorch version 0.3 is suggested.

from pc-darts.

ganji15 commented on July 18, 2024

@yuhuixu1993 It works when I change the batch size from 96 to 48. The model is so large, which consumes over 16 GB GPU memory with only the forward propagation.

from pc-darts.

yuhuixu1993 commented on July 18, 2024

Can you evaluate the architecture of darts? You can also try and such that we can know if it is the problem of pytorch version or model size. As I can evaluate DARTS on 1080ti. If it does the model size reason, you may search a smaller one by adding flop constraint as SNAS or use smaller batchsize or bigger GPU.

from pc-darts.

ganji15 commented on July 18, 2024

@yuhuixu1993 I think it is the problem of PyTorch version. The original code of ``infer'' function is as follows:

def infer(valid_queue, model, criterion):
  objs = utils.AvgrageMeter()
  top1 = utils.AvgrageMeter()
  top5 = utils.AvgrageMeter()
  model.eval()

  for step, (input, target) in enumerate(valid_queue):
    #input = input.cuda()
    #target = target.cuda(non_blocking=True)
    input = input.cuda()
    target = target.cuda(non_blocking=True)
    logits = model(input)
    loss = criterion(logits, target)

    prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
    n = input.size(0)
    objs.update(loss.data.item(), n)   ## this may lead to GPU memory leak in pytorch 1.0+
    top1.update(prec1.data.item(), n)  ## this may lead to GPU memory leak in pytorch 1.0+
    top5.update(prec5.data.item(), n)  ## this may lead to GPU memory leak in pytorch 1.0+

    if step % args.report_freq == 0:
      logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)

  return top1.avg, objs.avg

So, I change to code as follows:

def infer(valid_queue, model, criterion):
  objs = utils.AvgrageMeter()
  top1 = utils.AvgrageMeter()
  top5 = utils.AvgrageMeter()
  model.eval()

   with torch.no_grad():  # no grad for inference
    for step, (input, target) in enumerate(valid_queue):
      input = input.cuda()
      target = target.cuda(non_blocking=True)
      logits = model(input)
      loss = criterion(logits, target)

      prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
      n = input.size(0)
      objs.update(loss.item(), n) # deleting .data to fix OOM
      top1.update(prec1.item(), n)  # deleting .data to fix OOM
      top5.update(prec5.item(), n)  # deleting .data to fix OOM

      if step % args.report_freq == 0:
        logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)

  return top1.avg, objs.avg

After the above modification, I can evaluate the model with a larger batch size with PyTorch 1.2.

By the way, how can I visualize the searched model and see its detailed configurations? Thanks.

from pc-darts.

yuhuixu1993 commented on July 18, 2024

Yes, I have already updated my code too. The visualization code is inherited from DARTS. First, you copy the searched architecture to geonotype.py and then try python visualize.py PC-DARTS.

from pc-darts.

ganji15 commented on July 18, 2024

@yuhuixu1993 Thank you for your kind help. Since the problem has been solved, I will close this issue.

from pc-darts.

Out of memory during searching & training. about pc-darts HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent