Comments (9)
@ganji15 ,Which GPU is used in your experiment? 1080ti or V100? 1080ti may have OOM error on higher version pytorch just like the original DARTS. Besides, I note that the OOM occurs when test the validation accuracy, you can just comment that validation part and this will not affect the searched result.
from pc-darts.
@yuhuixu1993 Thanks. My GPU is Quadro P5000, and I also met the OOM error before when using Titan X. I think you are right, and I will modify the code according to your good suggestion.
from pc-darts.
@yuhuixu1993 The problem is not solved, and I find out that the OOM error occurs in the forward propagation. As a result, I even cannot evaluate the performance of the searched model.
Specifically, I deleted the training part of ``train.py'' and kept the inference part as follows:
for epoch in range(args.epochs):
scheduler.step()
logging.info('epoch %d lr %e', epoch, scheduler.get_lr()[0])
model.drop_path_prob = args.drop_path_prob * epoch / args.epochs
# train_acc, train_obj = train(train_queue, dp_model, model, criterion, optimizer)
# logging.info('train_acc %f', train_acc)
logging.info('enter infer')
valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
logging.info('exit infer')
if valid_acc > best_acc:
best_acc = valid_acc
logging.info('valid_acc %f, best_acc %f', valid_acc, best_acc)
# logging.info('valid_acc %f', valid_acc)
I also added log in the ``infer'' function as follows:
def infer(valid_queue, model, criterion):
...
for step, (input, target) in enumerate(valid_queue):
logging.info('step %d'%step) ## debug information
input = input.cuda()
target = target.cuda(non_blocking=True)
...
Then, I run the ``train.py'' and I got the following errors:
➜ pc-darts python train.py --auxiliary --cutout --gpu 1
Experiment dir : eval-EXP-20190902-110522
09/02 11:05:22 AM gpu device = 1
09/02 11:05:22 AM args = Namespace(arch='PCDARTS', auxiliary=True, auxiliary_weight=0.4, batch_size=96, cutout=True, cutout_length=16, data='../data', drop_path_prob=0.3, epochs=600, gpu='1', grad_clip=5, init_channels=36, layers=20, learning_rate=0.025, model_path='saved_models', momentum=0.9, report_freq=50, save='eval-EXP-20190902-110522', seed=0, set='cifar10', weight_decay=0.0003)
108 108 36
108 144 36
144 144 36
144 144 36
144 144 36
144 144 36
144 144 72
144 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 72
288 288 144
288 576 144
576 576 144
576 576 144
576 576 144
576 576 144
576 576 144
09/02 11:05:24 AM param size = 3.634678MB
Files already downloaded and verified
Files already downloaded and verified
09/02 11:05:26 AM epoch 0 lr 2.500000e-02
09/02 11:05:26 AM enter infer
09/02 11:05:26 AM step 0
09/02 11:05:26 AM valid 000 2.301318e+00 8.333333
09/02 11:05:26 AM step 1
Traceback (most recent call last):
File "train.py", line 193, in <module>
main()
File "train.py", line 125, in main
valid_acc, valid_obj = infer(valid_queue, dp_model, model, criterion)
File "train.py", line 178, in infer
logits, _ = dp_model(input)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ganji/Documents/work/pc-darts/model.py", line 150, in forward
s0, s1 = s1, cell(s0, s1, self.drop_path_prob)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ganji/Documents/work/pc-darts/model.py", line 51, in forward
h1 = op1(h1)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ganji/Documents/work/pc-darts/operations.py", line 66, in forward
return self.op(x)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/cvmt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 6.75 MiB (GPU 0; 15.90 GiB total capacity; 14.32 GiB already allocated; 3.56 MiB free; 3.09 MiB cached)
Therefore, I guess that this is something wrong with the searched model (too large? too deep? circle route?). How can I visualize the searched model? How to fix this error? Thanks!
p.s. I downgrade PyTorch 1.2 -> 1.0, and the problem remains.
from pc-darts.
@ganji15 ,yes you reduce the batchsize and try it again? Pytorch version 0.3 is suggested.
from pc-darts.
@yuhuixu1993 It works when I change the batch size from 96 to 48. The model is so large, which consumes over 16 GB GPU memory with only the forward propagation.
from pc-darts.
Can you evaluate the architecture of darts? You can also try and such that we can know if it is the problem of pytorch version or model size. As I can evaluate DARTS on 1080ti. If it does the model size reason, you may search a smaller one by adding flop constraint as SNAS or use smaller batchsize or bigger GPU.
from pc-darts.
@yuhuixu1993 I think it is the problem of PyTorch version. The original code of ``infer'' function is as follows:
def infer(valid_queue, model, criterion):
objs = utils.AvgrageMeter()
top1 = utils.AvgrageMeter()
top5 = utils.AvgrageMeter()
model.eval()
for step, (input, target) in enumerate(valid_queue):
#input = input.cuda()
#target = target.cuda(non_blocking=True)
input = input.cuda()
target = target.cuda(non_blocking=True)
logits = model(input)
loss = criterion(logits, target)
prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
n = input.size(0)
objs.update(loss.data.item(), n) ## this may lead to GPU memory leak in pytorch 1.0+
top1.update(prec1.data.item(), n) ## this may lead to GPU memory leak in pytorch 1.0+
top5.update(prec5.data.item(), n) ## this may lead to GPU memory leak in pytorch 1.0+
if step % args.report_freq == 0:
logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)
return top1.avg, objs.avg
So, I change to code as follows:
def infer(valid_queue, model, criterion):
objs = utils.AvgrageMeter()
top1 = utils.AvgrageMeter()
top5 = utils.AvgrageMeter()
model.eval()
with torch.no_grad(): # no grad for inference
for step, (input, target) in enumerate(valid_queue):
input = input.cuda()
target = target.cuda(non_blocking=True)
logits = model(input)
loss = criterion(logits, target)
prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
n = input.size(0)
objs.update(loss.item(), n) # deleting .data to fix OOM
top1.update(prec1.item(), n) # deleting .data to fix OOM
top5.update(prec5.item(), n) # deleting .data to fix OOM
if step % args.report_freq == 0:
logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)
return top1.avg, objs.avg
After the above modification, I can evaluate the model with a larger batch size with PyTorch 1.2.
By the way, how can I visualize the searched model and see its detailed configurations? Thanks.
from pc-darts.
Yes, I have already updated my code too. The visualization code is inherited from DARTS. First, you copy the searched architecture to geonotype.py
and then try python visualize.py PC-DARTS
.
from pc-darts.
@yuhuixu1993 Thank you for your kind help. Since the problem has been solved, I will close this issue.
from pc-darts.
Related Issues (20)
- Is a channel sampling mask fixed? HOT 3
- Is there any plan to release the pretrained imagenet model? HOT 1
- Why modifying architecture after epoch 15
- Data preparation of ImageNet
- How to change the channel proportion K? HOT 2
- Cannot re-implement your claimed result HOT 3
- GPU Utilization is Bad HOT 1
- We cannot obtain your claimed result on ImageNet after trying many configurations HOT 4
- Question about search on custom dataset HOT 5
- test.py运行报错
- Understanding the two sets of the architecture hyperparameter HOT 2
- how you report the final accuracy in evaluation? Possibly touch the test set for the best acc? HOT 2
- Learning rate schedule
- 你好,结果不一致 HOT 2
- Searched genotype remain / keep unchanged for a great number of epoch HOT 2
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
- 您好,想请问一下网络搜索完之后如何得到需要的网络结构代码? HOT 3
- About the license of this repository
- Hello, whether PC-DARTS likes DARTS with extra dropout?
- Not Enough Comments in the Code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pc-darts.