Giter Club home page Giter Club logo

Comments (9)

yuhuixu1993 avatar yuhuixu1993 commented on July 18, 2024

@ganji15 ,Which GPU is used in your experiment? 1080ti or V100? 1080ti may have OOM error on higher version pytorch just like the original DARTS. Besides, I note that the OOM occurs when test the validation accuracy, you can just comment that validation part and this will not affect the searched result.

from pc-darts.

ganji15 avatar ganji15 commented on July 18, 2024

@yuhuixu1993 Thanks. My GPU is Quadro P5000, and I also met the OOM error before when using Titan X. I think you are right, and I will modify the code according to your good suggestion.

from pc-darts.

ganji15 avatar ganji15 commented on July 18, 2024

@yuhuixu1993 The problem is not solved, and I find out that the OOM error occurs in the forward propagation. As a result, I even cannot evaluate the performance of the searched model.

Specifically, I deleted the training part of ``train.py'' and kept the inference part as follows:

  for epoch in range(args.epochs):
    scheduler.step()
    logging.info('epoch %d lr %e', epoch, scheduler.get_lr()[0])
    model.drop_path_prob = args.drop_path_prob * epoch / args.epochs

    # train_acc, train_obj = train(train_queue, dp_model, model, criterion, optimizer)
    # logging.info('train_acc %f', train_acc)
    logging.info('enter infer')
    valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
    logging.info('exit infer')
    if valid_acc > best_acc:
      best_acc = valid_acc

    logging.info('valid_acc %f, best_acc %f', valid_acc, best_acc)
    # logging.info('valid_acc %f', valid_acc)

I also added log in the ``infer'' function as follows:

def infer(valid_queue, model, criterion):
   ...
  for step, (input, target) in enumerate(valid_queue):
    logging.info('step %d'%step)  ## debug information
    input = input.cuda()
    target = target.cuda(non_blocking=True)
    ...

Then, I run the ``train.py'' and I got the following errors:

➜  pc-darts python train.py --auxiliary --cutout --gpu 1                                              
Experiment dir : eval-EXP-20190902-110522
09/02 11:05:22 AM gpu device = 1
09/02 11:05:22 AM args = Namespace(arch='PCDARTS', auxiliary=True, auxiliary_weight=0.4, batch_size=96, cutout=True, cutout_length=16, data='../data', drop_path_prob=0.3, epochs=600, gpu='1', grad_clip=5, init_channels=36, layers=20, learning_rate=0.025, model_path='saved_models', momentum=0.9, report_freq=50, save='eval-EXP-20190902-110522', seed=0, set='cifar10', weight_decay=0.0003)
108 108 36
108 144 36
144 144 36
144 144 36
144 144 36
144 144 36
144 144 72
144 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 144
288 576 144
576 576 144
576 576 144
576 576 144
576 576 144
576 576 144
09/02 11:05:24 AM param size = 3.634678MB
Files already downloaded and verified
Files already downloaded and verified
09/02 11:05:26 AM epoch 0 lr 2.500000e-02
09/02 11:05:26 AM enter infer
09/02 11:05:26 AM step 0
09/02 11:05:26 AM valid 000 2.301318e+00 8.333333
09/02 11:05:26 AM step 1
Traceback (most recent call last):
  File "train.py", line 193, in <module>
    main() 
  File "train.py", line 125, in main
    valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
  File "train.py", line 178, in infer
    logits, _ = dp_model(input)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/model.py", line 150, in forward
    s0, s1 = s1, cell(s0, s1, self.drop_path_prob)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/model.py", line 51, in forward
    h1 = op1(h1)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganji/Documents/work/pc-darts/operations.py", line 66, in forward
    return self.op(x)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 6.75 MiB (GPU 0; 15.90 GiB total capacity; 14.32 GiB already allocated; 3.56 MiB free; 3.09 MiB cached)

Therefore, I guess that this is something wrong with the searched model (too large? too deep? circle route?). How can I visualize the searched model? How to fix this error? Thanks!

p.s. I downgrade PyTorch 1.2 -> 1.0, and the problem remains.

from pc-darts.

yuhuixu1993 avatar yuhuixu1993 commented on July 18, 2024

@ganji15 ,yes you reduce the batchsize and try it again? Pytorch version 0.3 is suggested.

from pc-darts.

ganji15 avatar ganji15 commented on July 18, 2024

@yuhuixu1993 It works when I change the batch size from 96 to 48. The model is so large, which consumes over 16 GB GPU memory with only the forward propagation.

from pc-darts.

yuhuixu1993 avatar yuhuixu1993 commented on July 18, 2024

Can you evaluate the architecture of darts? You can also try and such that we can know if it is the problem of pytorch version or model size. As I can evaluate DARTS on 1080ti. If it does the model size reason, you may search a smaller one by adding flop constraint as SNAS or use smaller batchsize or bigger GPU.

from pc-darts.

ganji15 avatar ganji15 commented on July 18, 2024

@yuhuixu1993 I think it is the problem of PyTorch version. The original code of ``infer'' function is as follows:

def infer(valid_queue, model, criterion):
  objs = utils.AvgrageMeter()
  top1 = utils.AvgrageMeter()
  top5 = utils.AvgrageMeter()
  model.eval()

  for step, (input, target) in enumerate(valid_queue):
    #input = input.cuda()
    #target = target.cuda(non_blocking=True)
    input = input.cuda()
    target = target.cuda(non_blocking=True)
    logits = model(input)
    loss = criterion(logits, target)

    prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
    n = input.size(0)
    objs.update(loss.data.item(), n)   ## this may lead to GPU memory leak in pytorch 1.0+
    top1.update(prec1.data.item(), n)  ## this may lead to GPU memory leak in pytorch 1.0+
    top5.update(prec5.data.item(), n)  ## this may lead to GPU memory leak in pytorch 1.0+

    if step % args.report_freq == 0:
      logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)

  return top1.avg, objs.avg

So, I change to code as follows:

def infer(valid_queue, model, criterion):
  objs = utils.AvgrageMeter()
  top1 = utils.AvgrageMeter()
  top5 = utils.AvgrageMeter()
  model.eval()

   with torch.no_grad():  # no grad for inference
    for step, (input, target) in enumerate(valid_queue):
      input = input.cuda()
      target = target.cuda(non_blocking=True)
      logits = model(input)
      loss = criterion(logits, target)

      prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
      n = input.size(0)
      objs.update(loss.item(), n) # deleting .data to fix OOM
      top1.update(prec1.item(), n)  # deleting .data to fix OOM
      top5.update(prec5.item(), n)  # deleting .data to fix OOM

      if step % args.report_freq == 0:
        logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)

  return top1.avg, objs.avg

After the above modification, I can evaluate the model with a larger batch size with PyTorch 1.2.

By the way, how can I visualize the searched model and see its detailed configurations? Thanks.

from pc-darts.

yuhuixu1993 avatar yuhuixu1993 commented on July 18, 2024

Yes, I have already updated my code too. The visualization code is inherited from DARTS. First, you copy the searched architecture to geonotype.py and then try python visualize.py PC-DARTS.

from pc-darts.

ganji15 avatar ganji15 commented on July 18, 2024

@yuhuixu1993 Thank you for your kind help. Since the problem has been solved, I will close this issue.

from pc-darts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.