chenyuntc / pytorch-best-practice Goto Github PK

View Code? Open in Web Editor NEW

726.0 19.0 226.0 23 KB

A Guidance on PyTorch Coding Style Based on Kaggle Dogs vs. Cats

Python 100.00%

pytorch image-classification visdom

pytorch-best-practice's Introduction

PyTorch 实践指南

本文是文章PyTorch实践指南配套代码，请参照知乎专栏原文或者对应的markdown文件更好的了解而文件组织和代码细节。

本部分内容属于《深度学习框架PyTorch：入门与实践》一部分, 关于该书的源码，以及更多案例，请查看github

数据下载

从kaggle比赛官网下载所需的数据
解压并把训练集和测试集分别放在一个文件夹中

安装

PyTorch : 可按照PyTorch官网的指南，根据自己的平台安装指定的版本
安装指定依赖：

pip install -r requirements.txt

训练

必须首先启动visdom：

python -m visdom.server

然后使用如下命令启动训练：

# 在gpu0上训练,并把可视化结果保存在visdom 的classifier env上
python main.py train --data-root=./data/train --use-gpu=True --env=classifier

详细的使用命令可使用

python main.py help

测试

python main.py --data-root=./data/test --use-gpu=False --batch-size=256

pytorch-best-practice's People

Contributors

Stargazers

Watchers

Forkers

leezqcst hongvvu fuxianh hunterhawk kevinlemon qoboty rabintang elviswf willdamon amoliu yuechengyin jianfly smartape choiyeren hagho hairy-crab deepindeeper chayedandana bigpo signalimagecv onpix berryhn hczheng qinkevin niaoyu xwater8 hanahimi gqrong sinianyutian sonyeric mingchaoxu haowangxidian shunsunsun sonack queenie88 t-txiaorui xiaoanshi zikai1 dolphintear dolphinamy zzzz94 q512624756 zju-plp chenyv liushuchun zylhub lubocsu liuhengli 94mia pursueorigin chriscramer muyurainy hitergelei justmyfantasy allenwoods dfenglei icaresth hiterstone yvent zenwan longzee otherprojectsforks lixinhappy semutter xianyubai lizihong chenliqiong serendipity-ge lorinchen juzigithub yyshi12 zjkfly stonels0 xbutterflyx ngchc cybertyann mingyang1996 guoleming fzylx wzugang phanyoung think-chao intel-linyonghui trouble404 skyqin bramblexu collector-m baiyuanxiang mldlx dizzydwarf75 githubzhangpy limkokholefork yangjinhuang95 xiaoyeye1117 shadowclouds gjlper jingwanli6666 rosefun papicheng xiaosongshine

pytorch-best-practice's Issues

NameError

python main.py train --data-root=./data/train --use-gpu=True --env=classifier

Traceback (most recent call last):
File "main.py", line 170, in
import fire
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 366, in _Fire
component, remaining_args)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 48, in train
def train(**kwargs):
NameError: name 'opt' is not defined

为什么在val中还加入model.train()

RuntimeError:cuda runtime error(2):out of memory

有两个8G的显卡，显示这个错误，想问一下原因，谢谢

为什么val_accuracy始终为50%左右，验证集的混淆矩阵也基本只有一类有值

@chenyuntc 你好，我按照教程的代码自己实践了一下，训练过程中发现visdom的val_accuracy始终在50%左右，验证集的混淆矩阵也基本只有一类有值，我以为自己哪里写错了，又把原代码跑了一遍，发现也是一样的现象，训练过程中的可视化结果如下图，按道理val_accuracy应该会随着训练的进行不断增加，不知道是哪里有问题？如果有遇到类似问题的朋友也请指教一下，先行谢过！

RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH

在运行python main.py train时出现如下问题,系统环境为ubuntu16.04+cuda9.0+cudnn7.0.5,百度之后发现该问题可能是因为cuda计算能力不够,cudnn需要计算能力达到3.0的cuda,但是cuda9.0的计算能力为2.1,是不足以支持的,但是在配置环境的时候网上有很多教程都是ubuntu16.04+cuda9.0+cudnn7.0.5,想问一下真的是cuda计算能力的问题吗还是别的问题

Error in `python': munmap_chunk(): invalid pointer: 0x0000000002a22030

程序在运行的时候出现
"please use transforms.Resize instead.")
/usr/local/lib/python2.7/dist-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
"please use transforms.RandomResizedCrop instead.")
1%| | 137/17500 [01:50<3:33:34, 1.35it/s]
1%| | 137/17500 [01:49<3:34:13, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:34:31, 1.35it/s]
1%| | 137/17500 [01:49<3:33:46, 1.35it/s]
1%| | 137/17500 [01:49<3:33:40, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:32:45, 1.36it/s]
1%| | 137/17500 [01:49<3:32:46, 1.36it/s]
1%| | 137/17500 [01:49<3:32:01, 1.36it/s]
*** Error in `python': munmap_chunk(): invalid pointer: 0x0000000002a22030 ***
======= Backtrace: =========
下面还有一大堆
7f17a776c000-7f17a796b000 ---p 0021b000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f17a796b000-7f17a7987000 r--p 0021a000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0已放弃 (核心已转储)
请问这个问题怎么解决？

[WinError 3] 系统找不到指定的路径。: './data/train'

发生浮点数溢出问题

在执行的过程中发生了数据溢出，下面是执行过程中的输出：

python main.py train --train-data-root=/home/linux_fhb/data/cat_vs_dog/train --use-gpu --env=classifier
user config:
env classifier
model ResNet34
train_data_root /home/linux_fhb/data/cat_vs_dog/train
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch 10
lr 0.1
lr_decay 0.95
weight_decay 0.0001
parse <bound method parse of <config.DefaultConfig object at 0x7f3e4a85b400>>
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  "please use transforms.Resize instead.")
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  "please use transforms.RandomResizedCrop instead.")
  0%|                                                 | 0/17500 [00:00<?, ?it/s]main.py:99: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  loss_meter.add(loss.data[0])
  3%|█▏                                   | 547/17500 [02:09<1:05:07,  4.34it/s]
main.py:138: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_input = Variable(input, volatile=True)
main.py:139: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_label = Variable(label.type(t.LongTensor), volatile=True)
Traceback (most recent call last):
  File "main.py", line 171, in <module>
    fire.Fire()
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "main.py", line 121, in train
    if loss_meter.value()[0] > previous_loss:          
RuntimeError: value cannot be converted to type float without overflow: 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.000000

其中环境的版本号为：

Python 3.6.5 :: Anaconda, Inc.
fire                               0.1.3    
numpy                              1.14.3   
numpydoc                           0.8.0    
torch                              0.4.1    
torchfile                          0.1.0    
torchnet                           0.0.4    
torchvision                        0.2.1    
visdom                             0.1.8.5

显卡版本为：NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1), 11G 显存；

有遇到相同问题的兄弟吗？你们是怎么解决的？

No such file or directory: 'checkpoints/model.pth'

--max-epoch = 20 TypeError: ''str' object cannot be interpreted as an integer'

$ CUDA_VISIBLE_DEVICES='2,3' python main.py train --train-data-root=data/train/ --lr=0.005 --batch-size=32 --model='ResNet34' --max-epoch = 20 --use-gpu --env=classifier

TypeError: 'str' object cannot be interpreted as an integer

user config:
env classifier
vis_port 8097
model ResNet34
train_data_root data/train/
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch =
lr 0.005
lr_decay 0.5
weight_decay 0.0
WARNING:root:Setting up a new session...
WARNING:visdom:Without the incoming socket you cannot receive events from the server or register event handlers to your Visdom client.
Traceback (most recent call last):
  File "main.py", line 168, in <module>
    fire.Fire()
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "main.py", line 79, in train
    for epoch in range(opt.max_epoch):
TypeError: 'str' object cannot be interpreted as an integer

iteritems错误

File "main.py", line 171, in
fire.Fire()
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 49, in train
opt.parse(kwargs)
File "/home/thinkjoy/PycharmProjects/pytorch-best-practice/config.py", line 30, in parse
for k,v in kwargs.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'

windows下训练loss不下降，

因为我在Python3运行，所以要做一些小的修改，，
win10-64、CPU环境，
1.utils/visualize.py 44行：win=unicode(name) --> win=str(name)
2.main.py 22行：加 import config
3.main.py 108行：loss_meter.add(loss.data[0]) --> loss_meter.add(loss.item())
4.config.py 10行：load_model_path = 'checkpoints/model.pth' --> load_model_path = None
5.config.py 12行：batch_size = 128 --> batch_size = 8
6.config.py 21行：lr = 0.1 --> lr = 0.001
7.config.py 31行：for k,v in kwargs.iteritems() --> for k,v in kwargs.items()
8.没有执行python -m visdom.server，配置好路径之后直接 python main.py train
打印出loss格式如下，发现loss一直在0.6-1.5之间浮动：
loss: tensor(0.7035, grad_fn=)
也出现了别的同学说的准确率一直在50%左右，也就是学了跟不学一样，