Giter Club home page Giter Club logo

pytorch-best-practice's Introduction

PyTorch 实践指南

本文是文章PyTorch实践指南配套代码,请参照知乎专栏原文或者对应的markdown文件更好的了解而文件组织和代码细节。

本部分内容属于 《深度学习框架PyTorch:入门与实践》一部分, 关于该书的源码,以及更多案例,请查看github

数据下载

  • kaggle比赛官网 下载所需的数据
  • 解压并把训练集和测试集分别放在一个文件夹中

安装

  • PyTorch : 可按照PyTorch官网的指南,根据自己的平台安装指定的版本
  • 安装指定依赖:
pip install -r requirements.txt

训练

必须首先启动visdom:

python -m visdom.server

然后使用如下命令启动训练:

# 在gpu0上训练,并把可视化结果保存在visdom 的classifier env上
python main.py train --data-root=./data/train --use-gpu=True --env=classifier

详细的使用命令 可使用

python main.py help

测试

python main.py --data-root=./data/test --use-gpu=False --batch-size=256

pytorch-best-practice's People

Contributors

chenyuntc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-best-practice's Issues

NameError

python main.py train --data-root=./data/train --use-gpu=True --env=classifier

Traceback (most recent call last):
File "main.py", line 170, in
import fire
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 366, in _Fire
component, remaining_args)
File "C:\Users---\Anaconda2\envs\py36\lib\site-packages\fire\core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 48, in train
def train(**kwargs):
NameError: name 'opt' is not defined

为什么val_accuracy始终为50%左右,验证集的混淆矩阵也基本只有一类有值

@chenyuntc 你好,我按照教程的代码自己实践了一下,训练过程中发现visdom的val_accuracy始终在50%左右,验证集的混淆矩阵也基本只有一类有值,我以为自己哪里写错了,又把原代码跑了一遍,发现也是一样的现象,训练过程中的可视化结果如下图,按道理val_accuracy应该会随着训练的进行不断增加,不知道是哪里有问题?如果有遇到类似问题的朋友也请指教一下,先行谢过!
image

RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH

在运行python main.py train时出现如下问题,系统环境为ubuntu16.04+cuda9.0+cudnn7.0.5,百度之后发现该问题可能是因为cuda计算能力不够,cudnn需要计算能力达到3.0的cuda,但是cuda9.0的计算能力为2.1,是不足以支持的,但是在配置环境的时候网上有很多教程都是ubuntu16.04+cuda9.0+cudnn7.0.5,想问一下真的是cuda计算能力的问题吗还是别的问题

Error in `python': munmap_chunk(): invalid pointer: 0x0000000002a22030

程序在运行的时候出现
"please use transforms.Resize instead.")
/usr/local/lib/python2.7/dist-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
"please use transforms.RandomResizedCrop instead.")
1%| | 137/17500 [01:50<3:33:34, 1.35it/s]
1%| | 137/17500 [01:49<3:34:13, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:34:31, 1.35it/s]
1%| | 137/17500 [01:49<3:33:46, 1.35it/s]
1%| | 137/17500 [01:49<3:33:40, 1.35it/s]
1%| | 137/17500 [01:49<3:33:45, 1.35it/s]
1%| | 137/17500 [01:49<3:32:45, 1.36it/s]
1%| | 137/17500 [01:49<3:32:46, 1.36it/s]
1%| | 137/17500 [01:49<3:32:01, 1.36it/s]
*** Error in `python': munmap_chunk(): invalid pointer: 0x0000000002a22030 ***
======= Backtrace: =========
下面还有一大堆
7f17a776c000-7f17a796b000 ---p 0021b000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f17a796b000-7f17a7987000 r--p 0021a000 08:06 92012725 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0已放弃 (核心已转储)
请问这个问题怎么解决?

发生浮点数溢出问题

在执行的过程中发生了数据溢出,下面是执行过程中的输出:

python main.py train --train-data-root=/home/linux_fhb/data/cat_vs_dog/train --use-gpu --env=classifier
user config:
env classifier
model ResNet34
train_data_root /home/linux_fhb/data/cat_vs_dog/train
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch 10
lr 0.1
lr_decay 0.95
weight_decay 0.0001
parse <bound method parse of <config.DefaultConfig object at 0x7f3e4a85b400>>
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  "please use transforms.Resize instead.")
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  "please use transforms.RandomResizedCrop instead.")
  0%|                                                 | 0/17500 [00:00<?, ?it/s]main.py:99: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  loss_meter.add(loss.data[0])
  3%|█▏                                   | 547/17500 [02:09<1:05:07,  4.34it/s]
main.py:138: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_input = Variable(input, volatile=True)
main.py:139: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_label = Variable(label.type(t.LongTensor), volatile=True)
Traceback (most recent call last):
  File "main.py", line 171, in <module>
    fire.Fire()
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "main.py", line 121, in train
    if loss_meter.value()[0] > previous_loss:          
RuntimeError: value cannot be converted to type float without overflow: 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.000000

其中环境的版本号为:

Python 3.6.5 :: Anaconda, Inc.
fire                               0.1.3    
numpy                              1.14.3   
numpydoc                           0.8.0    
torch                              0.4.1    
torchfile                          0.1.0    
torchnet                           0.0.4    
torchvision                        0.2.1    
visdom                             0.1.8.5  

显卡版本为:NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1), 11G 显存;

有遇到相同问题的兄弟吗?你们是怎么解决的?

--max-epoch = 20 TypeError: ''str' object cannot be interpreted as an integer'

$ CUDA_VISIBLE_DEVICES='2,3' python main.py train --train-data-root=data/train/ --lr=0.005 --batch-size=32 --model='ResNet34' --max-epoch = 20 --use-gpu --env=classifier

TypeError: 'str' object cannot be interpreted as an integer

user config:
env classifier
vis_port 8097
model ResNet34
train_data_root data/train/
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch =
lr 0.005
lr_decay 0.5
weight_decay 0.0
WARNING:root:Setting up a new session...
WARNING:visdom:Without the incoming socket you cannot receive events from the server or register event handlers to your Visdom client.
Traceback (most recent call last):
  File "main.py", line 168, in <module>
    fire.Fire()
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/deepliver4/.conda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "main.py", line 79, in train
    for epoch in range(opt.max_epoch):
TypeError: 'str' object cannot be interpreted as an integer

iteritems错误

File "main.py", line 171, in
fire.Fire()
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/thinkjoy/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 49, in train
opt.parse(kwargs)
File "/home/thinkjoy/PycharmProjects/pytorch-best-practice/config.py", line 30, in parse
for k,v in kwargs.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'

windows下训练loss不下降,

因为我在Python3运行,所以要做一些小的修改,,
win10-64、CPU环境,
1.utils/visualize.py 44行:win=unicode(name) --> win=str(name)
2.main.py 22行: 加 import config
3.main.py 108行:loss_meter.add(loss.data[0]) --> loss_meter.add(loss.item())
4.config.py 10行:load_model_path = 'checkpoints/model.pth' --> load_model_path = None
5.config.py 12行:batch_size = 128 --> batch_size = 8
6.config.py 21行:lr = 0.1 --> lr = 0.001
7.config.py 31行:for k,v in kwargs.iteritems() --> for k,v in kwargs.items()
8.没有执行python -m visdom.server,配置好路径之后直接 python main.py train
打印出loss格式如下,发现loss一直在0.6-1.5之间浮动:
loss: tensor(0.7035, grad_fn=)
也出现了别的同学说的准确率一直在50%左右,也就是学了跟不学一样,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.