andy-yun / pytorch-0.4-yolov3 Goto Github PK

View Code? Open in Web Editor NEW

278.0 13.0 72.0 27.34 MB

Yet Another Implimentation of Pytroch 0.4.1 and YoloV3 on python3

License: MIT License

Python 77.33% Makefile 0.43% C 11.66% C++ 0.51% Cuda 10.07%

pytorch-yolo3 pytorch-yolov3 yolov3 yolov3-training python3 pytorch

pytorch-0.4-yolov3's Introduction

pytorch-0.4-yolov3 : Yet Another Implimentation of Pytroch 0.41 or over and YoloV3

This repository is created for implmentation of yolov3 with pytorch 0.4 from marvis/pytorch-yolo2.

This repository is forked from great work pytorch-yolo2 of @github/marvis, but I couldn't upload or modify directly to marvis source files because many files were changed even filenames.

Difference between this repository and marvis original version.

some programs are re-structured for windows environments. (for example __name__ == '__main__' (variable in python program) is checked for multiple threads).
load and save weights are modified to compatible to yolov2 and yolov3 versions (means that this repository works for yolov2 and yolov3 configuration without source modification.)
fully support yolov3 detection and training
- region_loss.py is renamed to region_layer.py.
- outputs of region_layer.py and yolo_layer.py are enclosed to dictionary variables.
codes are modified to work on pytorch 0.4 and python3
some codes are modified to speed up and easy readings. (I'm not sure.. T_T)
in training mode, check nan value and use gradient clipping.

If you want to know the training and detect procedures, please refer to https://github.com/marvis/pytorch-yolo2 for the detail information.

Train your own data or coco, voc data as follows:

python train.py -d cfg/coco.data -c cfg/yolo_v3.cfg -w yolov3.weights

new weights are saved in backup directory along to epoch numbers (last 5 weights are saved, you control the number of backups in train.py)
The above command shows the example of training process. I didn't execute the above command. But, I did successully train my own data with the pretrained yolov3.weights.
You should notice that the anchor information is different when it used in yolov2 or yolov3 model.
If you want to use the pretrained weight as the initial weights, add -r option in the training command

python train.py -d cfg/my.data -c cfg/my.cfg -w yolov3.weights -r

maximum epochs option, which is automatically calculated, somestimes is too small, then you can set the max_epochs in your configuration.

Recorded yolov2 and yolov3 training for my own data

When you clicked the images, videos will played on yoube.com
yolov2 training recorded :
yolov3 training recorded :
In above recorded videos, if you use the pretrained weights as base, about less than 10 or 20 epochs, you can see the large number of proposals. However, when training is in progress, nPP decreases to zero and increases with model updates.
The converges of yolov2 and yolov3 are different because yolov2 updates all boxes below 12800 exposures.

Detect the objects in dog image using pretrained weights

yolov2 models

wget http://pjreddie.com/media/files/yolo.weights
python detect.py cfg/yolo.cfg yolo.weights data/dog.jpg data/coco.names

Loading weights from yolo.weights... Done!
data\dog.jpg: Predicted in 0.832918 seconds.
3 box(es) is(are) found
truck: 0.934710
bicycle: 0.998012
dog: 0.990524
save plot results to predictions.jpg

yolov3 models

wget https://pjreddie.com/media/files/yolov3.weights
python detect.py cfg/yolo_v3.cfg yolov3.weights data/dog.jpg data/coco.names

Loading weights from yolov3.weights... Done!

data\dog.jpg: Predicted in 0.837523 seconds.
3 box(es) is(are) found
dog: 0.999996
truck: 0.995232
bicycle: 0.999973
save plot results to predictions.jpg

validation and get evaluation results

valid.py data/yourown.data cfg/yourown.cfg yourown_weights

Performances for voc datasets using yolov2 (with 100 epochs training)

CrossEntropyLoss is used to compare classes
Performances are varied along to the weighting factor, for example.

coord_scale=1, object_scale=5, class_scale=1 mAP = 73.1  
coord_scale=1, object_scale=5, class_scale=2 mAP = 72.7  
coord_scale=1, object_scale=3, class_scale=1 mAP = 73.4  
coord_scale=1, object_scale=3, class_scale=2 mAP = 72.8  
coord_scale=1, object_scale=1, class_scale=1 mAP = 50.4

After modifying anchors information at yolo-voc.cfg and applying new coord_mask Finally, I got the

anchors = 1.1468, 1.5021, 2.7780, 3.4751, 4.3845, 7.0162, 8.2523, 4.2100, 9.7340, 8.682
coord_scale=1, object_scale=3, class_scale=1 mAP = 74.4

using yolov3 with self.rescore = 1 and latest code, mAP = 74.9. (with 170 epochs training)

Therefore, you may do many experiments to get the best performances.

License

MIT License (see LICENSE file).

pytorch-0.4-yolov3's People

Contributors

Stargazers

Watchers

pytorch-0.4-yolov3's Issues

some simple problems

Sir,thank you for your excellent object about yolo .Great benefits of studing your object I got. I have some simple problems.
1.Dose this object needs negative sample? 2.the detect.py resize input picture using cv2.resize to get 416 scale ,this would be make a little difference about input picture,may be using data augument style(fill with (127,127,127)to maintain the origin pattern) for resizing will peformance better.

Saving model weights

Hi,

I have a couple questions about variables, keep_backup and old_wgts in

I needed to modify the default save_interval = 5 to 1 epoch, so I wasn't sure how to update default keep_backup = 5? Do I change it to 1 as well and what's the purpose of old_wgts?

problem with cam detection

when i try cam detection i get "Unable to open camera"
i tried to dig into the code and found
cap = cv2.VideoCapture(1)
if not cap.isOpened():
print("Unable to open camera")
exit(-1)
i dont understand where the problem is

one more thing i want to train your implimentation to detect melanoma (small object)
i have a good dataset but i am a beginner and i need step by step tutorial to train this implimentation

yolov3 on coco

Hi
Thank you for the code.
I am just wondering if you could share your results base on your implementation on COCO using yolov3.
Thank you

Inference result and FPS issue

Thanks for your kindness. :-)

I have two questions.

I have found that there is a difference between the C-based code and the pytorch inference result.
Please see the attached picture.

As a result, it is well detected, but especially the box of bike is not the same as the c-based result.
Do you know where this problem occurs in the code?
Solving this difference may help to converge performance to original YOLOv3.
I am trying to resolve this issue.

FPS performance is lower than original one.
The pytorch version seems to be difficult to process in real time.
Is there any way to improve FPS performance?
I think there is not much time difference between pytorch and C in GPU processing...

Thanks for your support.

What is "model.seen" meaning

embedded null byte

Hi
I've been struggling with this problem for a while now. When I run train.py file, I face the error "embedded null byte". It seems to happen when the code tries to read images using filenames. Here's the traceback i get.

Traceback (most recent call last):
File "train.py", line 351, in
main()
File "train.py", line 144, in main
nsamples = train(epoch)
File "train.py", line 204, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "/home/amirking/anaconda3/envs/kerasenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/amirking/anaconda3/envs/kerasenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/amirking/yolo3/andy/dataset.py", line 62, in getitem
img, label = load_data_detection(imgpath, self.shape, jitter, hue, saturation, exposure)
File "/home/amirking/yolo3/andy/image.py", line 122, in load_data_detection
img = Image.open(imgpath).convert('RGB')
File "/home/amirking/anaconda3/envs/kerasenv/lib/python3.6/site-packages/PIL/Image.py", line 2548, in open
fp = builtins.open(filename, "rb")

Problem in train.py

hi, While i was running train.py ,I've noticed that my test() isn't being called .I've changed max_epochs to just 2 just for trying .
Moreover , its taking lot of time to run even 1 epoch. Can you help me??

How to add focal loss to replace the original loss function

Can you give me some suggestions? Thanks a lot

How can I train model on another image size?

Great work thanks, @andy-yun!

Can I modify something so that I can train model on images with another shape? In my case, I want to train model on images with 256x256 size

Training does not converge compared with Darknet Yolo

Hi, I'm using the same configuration files (standard yolov3.cfg file) and data to train using this code, and with darknet. With darknet I get to a loss of <0.5 after a few hours, whereas in Pytorch the loss is at ~100. Moreover, in darknet I get good detection results and in Pytorch the detector does not seem to converge - it has many false positives and many misses.
What am I missing? I thought the two implementations should be compatible.
Thanks very much!

perfect and excellent codes

This is a perfect and excellent codes . I trained on coco dataset after 200 epoches with this repo and used the weights ,got the same performence as yolov3 offical weights in a some video.Thank you for your help, sir.

Training Problem

Hello sir,

First of all thank you for your efforts. I'd like to know whether I've been doing something wrong with training yolo with a simple small dataset (I will try to train it on a bigger dataset). Upon the tutorial mentioned in the link https://medium.com/@manivannan_data/how-to-train-yolov2-to-detect-custom-objects-9010df784f36 , I decided to get the hang of how to train yolo with a dataset that has only 1 class with 232 images. My computer does not have any GPU that can utilize CUDA; therefore, I wanted to see how the training goes with CPU and to do that I changed the function in train.py called curmodel to return the global model only (I tried to print the number of gpus that the code detects, even though I do not own an nvidia gpu, it says I have 4, that's why I changed that function). Then, I adjusted the required .data, .names files and set the tiny_yolo.cfg to have 8 batches so that I can see a couple training steps more quickly, and since there is only 1 class, I also set the number of filters at the last convolutional layer to (1 + 5)*5 = 30 with 1 class to detect at region layer.
As for my problem, the image I added to this issue reprsents it. The loss does not change at all (or I might be mistaken in interpretation). The situation is similar with pretrained weights as well, with a slight change in the loss values only.

I have a couple questions regarding my situation:

What might I be doing wrong?
Is is because I do not use any GPU (when I train with my actual dataset, the machines that I'll use will have GPU's)?
Or am I impatient?

Thanks in advance,

Burak

EDIT: When I set the learning rate to small numbers like 1e-5 or 1e-7, the situation does not change; however, when I set it to 1, it started from losses around 0.035 and almost always decreased and I cut it with Ctrl-C when loss was around 0.009.

About the loss part

nPP is larger then nGT a lot , how to solve it case？

Multi-GPUs setting

Hello, I'm testing out VOC example. In the voc.data config file in the last line specifies the number of GPUs to use. I use the default setting, but nonetheless upon starting training, I notice that all my GPU cards were in used (I have 8 cards). I want to understand why it is that the config file is not configuring the training setting? I tried to utilize 1 GPU, but still training launches on all available cards.

Question about region_layer.py line 65

hi ,
Why should we use a constant number on line 65 of region_layer.py？

I got some trouble about train.py

I got some trouble about train.py......

pyhton 3.5 pytorch 0.4

How to use train.py?

Anybody ran train.py succesfully?

I can not use your code to train VOCdataset

I use the https://github.com/marvis/pytorch-yolo2 method and get the voc_train.txt and I use the voc.data
the command like python train.py -d cfg/voc.data -c cfg/yolo_v3.cfg -w yolov3.weights
Then I modify the some value to adapt the original voc yolo_layer
[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=20
1.I modify the classes number to 20
[convolutional]
size=1
stride=1
pad=1
filters=75
activation=linear
2.I changed the three conv layer's filter numbers =3(20+5)=75
in addition to these operations
I changed the batchsize and sub
batch=64
subdivisions=16

Training

;batch=64
;subdivisions=16
BUT! The strange thing happend

The training stopped

yolov3.weights cannot be found and training format

I was planning to train my own data, but it says yolov3.weights not found. I checked the master repo of you and I didnt find that too. Was it accidentally removed or not meant to happen?

I had the weights earlier, but wondered what happened.

Besides, What is the format of training our custom data?

I have

train/class0

train/class1

and

val/

that has the image#.jpg

edit: I love this code and thanks for everything!

Running voc_eval.py gives error

Hi @andy-yun ,

Thanks for the cool repository.

I am facing an issue while validating voc dataset.

What changes do I need to make changes to run the voc_eval.py in python 3.5.

I saw that the code of the file suits the syntax of python 2.7.

Regards,
Mahavir

Getting training for (390407, 6101)

I am trying to train yolo on my own dataset of 90 images of 2 classes.
I have the .data file with the
train.txt, test.txt, names all set to the absolute paths of the text files, and .name file.
My train and test .txt files contain the absolute path to the images
My .names file contains 2 classes
and I am using the original yolov3.weights file
I also have configured the .cfg to fit 2 classes

However when I train, with the code
python train.py -d {.data file} -c {.cfg file} -w {yolov3.weights}

Only one line is displayed
Training for (390407, 6101)

And the build ends. I do not see any training processes being done. Is there something wrong with what I did?

param about initial weights

sir ,I want to know:
python train.py -d cfg/coco.data -c cfg/yolo_v3.cfg -w yolov3.weights

-w means initial weights?

How can I train model without initial weights?
Thank you sir~

Problem with performance on voc dataset

I trained the yolov2 on voc dataset and got following performance. It seems not correct for such a high prec and low recall. I also tried marvis repo https://github.com/marvis/pytorch-yolo2 and the result is reasonable.

python train.py -d cfg/voc.data -c cfg/yolo-voc.cfg -w darknet19_448.conv.23 -l

2018-09-13 16:29:17 [020] correct: 1045, precision: 0.914261, recall: 0.086866, fscore: 0.158656
2018-09-13 16:55:40 [030] correct: 2319, precision: 0.938866, recall: 0.192768, fscore: 0.319859
2018-09-13 17:22:24 [040] correct: 2666, precision: 0.906494, recall: 0.221613, fscore: 0.356152
2018-09-13 17:48:53 [050] correct: 3214, precision: 0.913068, recall: 0.267165, fscore: 0.413373
2018-09-13 18:15:42 [060] correct: 3986, precision: 0.893922, recall: 0.331338, fscore: 0.483470
2018-09-13 18:42:38 [070] correct: 3731, precision: 0.916257, recall: 0.310141, fscore: 0.463417
2018-09-13 19:10:27 [080] correct: 4692, precision: 0.918379, recall: 0.390025, fscore: 0.547519
2018-09-13 19:38:13 [090] correct: 3848, precision: 0.910123, recall: 0.319867, fscore: 0.473363
2018-09-13 20:05:11 [100] correct: 5322, precision: 0.904640, recall: 0.442394, fscore: 0.594201
2018-09-13 20:32:11 [110] correct: 4921, precision: 0.891809, recall: 0.409061, fscore: 0.560857
2018-09-13 20:59:11 [120] correct: 4925, precision: 0.904666, recall: 0.409393, fscore: 0.563690
2018-09-13 21:26:28 [130] correct: 5215, precision: 0.876765, recall: 0.433500, fscore: 0.580149
2018-09-13 21:53:38 [140] correct: 4876, precision: 0.899631, recall: 0.405320, fscore: 0.558850
2018-09-13 22:21:21 [150] correct: 5108, precision: 0.904551, recall: 0.424605, fscore: 0.577922
2018-09-13 22:48:52 [160] correct: 5569, precision: 0.913250, recall: 0.462926, fscore: 0.614404
2018-09-13 23:15:48 [170] correct: 5630, precision: 0.914406, recall: 0.467997, fscore: 0.619119
2018-09-13 23:42:43 [180] correct: 5571, precision: 0.915982, recall: 0.463092, fscore: 0.615168
2018-09-14 00:09:45 [190] correct: 5675, precision: 0.911940, recall: 0.471737, fscore: 0.621811
2018-09-14 00:37:38 [200] correct: 5744, precision: 0.912325, recall: 0.477473, fscore: 0.626864
2018-09-14 01:04:43 [210] correct: 5623, precision: 0.921803, recall: 0.467415, fscore: 0.620293
2018-09-14 01:31:52 [220] correct: 5607, precision: 0.917976, recall: 0.466085, fscore: 0.618256
2018-09-14 01:59:03 [230] correct: 5739, precision: 0.917066, recall: 0.477057, fscore: 0.627620
2018-09-14 02:26:00 [240] correct: 5806, precision: 0.913468, recall: 0.482627, fscore: 0.631563
2018-09-14 02:53:08 [250] correct: 5703, precision: 0.919839, recall: 0.474065, fscore: 0.625667
2018-09-14 03:20:08 [260] correct: 5728, precision: 0.918243, recall: 0.476143, fscore: 0.627103
2018-09-14 03:47:23 [270] correct: 5675, precision: 0.921865, recall: 0.471737, fscore: 0.624102
2018-09-14 04:14:34 [280] correct: 5668, precision: 0.924331, recall: 0.471155, fscore: 0.624156
2018-09-14 04:41:38 [290] correct: 5762, precision: 0.919713, recall: 0.478969, fscore: 0.629894
2018-09-14 05:08:53 [300] correct: 5717, precision: 0.918689, recall: 0.475229, fscore: 0.626413
2018-09-14 05:35:59 [310] correct: 5802, precision: 0.914277, recall: 0.482294, fscore: 0.631471
2018-09-14 06:45:14 [020] correct: 0, precision: 0.000000, recall: 0.000000, fscore: 0.000000

About the trained result

@andy-yun ,Thank you for give such a good option of using YOLOv3 in Pytorch

I have test many others pytorch YOLOv3 training solution, They usually have similar problem when predict a image .
So could you share a weight or detect result with your trained model.
Thank you ~!

False positive issue

Hi, thanks for providing the code.
Your implementation seems to be the best among Yolov3-pytorch version in github.
I experimented with the code, and I have a few questions.

After training with my data, there is a false positive issue in inference. (I only changed the dataset.)
For example, there is a problem that boxes overlap on small objects and some boxes aren't accurate.
It is expected that there is a problem in the training process through the fact that there is no big problem in inference with using the author's weight.

I'm trying to solve this problem. but it's not easy.
Do you know where this problem occurs in the code?
If you can afford it, I would like you to consider this problem together.
Thank you.

About Conf_mask

Hi, thank you for your efforts in this code. I was confused about the conf_mask in yolo_layer.py.

Before caculating the true conf_mask, it is inializated as conf_mask = torch.ones (nB, nA, nH, nW). #L47
However, it's modifed to zero according to ignore_ix. #L76. And it is assigned to 1 agian based on the selected anchor box. # L99.

Many thanks.

Nan may generate when batch was set as 1

when batch was set as 4 ,iou may get as nan, but set bacth = 8 it seems beter.others get the same question?

About the net

Sir ,I want to know the out part of last layer 13x13x255 /26x26x255 /52x52x255(80 classes)..255 may mean Xx3|Yx3|Wx3|Hx3|confidencex3|classesx3 or (X|Y|W|H|confidence|classes)x3,which order is right?

ValueError: only one element tensors can be converted to Python scalars

Hi,

First of all, great job!!

Second, I am trying to train on voc data (or even my own data).

In test function, when the batch size > 1 and len(gpus) > 1 then I receive the error:

Traceback (most recent call last):
File "/home/o/workspace/pytorch-0.4-yolov3/train.py", line 350, in
main()
File "/home/o/workspace/pytorch-0.4-yolov3/train.py", line 144, in main
fscore = test(epoch)
File "/home/o/workspace/pytorch-0.4-yolov3/train.py", line 307, in test
all_boxes = get_all_boxes(output, conf_thresh, num_classes, use_cuda=use_cuda)
File "/home/o/workspace/pytorch-0.4-yolov3/utils.py", line 111, in get_all_boxes
pred, anchors, num_anchors = output[i]['x'].data, output[i]['a'], output[i]['n'].item()
ValueError: only one element tensors can be converted to Python scalars

The main problem is here : output[i]['n'].item()
because output[i]['n'] contains two values and not only one...

How can I fix it?

When I only do train without test, the loss is not reduced over epochs.
Did someone succeeded to train on voc data?
I want to succeed to train on voc data before I am starting with my own data.

Thank You!!!

Olgit

Augmentation

It seems like your code is hard-coded to perform augmentation (crop, scaling, shift, flip, etc) on images by default. So, I am interested in an update to make that as an option, rather than a default feature. Because, some people might want to use a different set of augmentations.

Insert Fully Connected between 2 Conv layers

How to specify labels path for training?

I'm trying to train the model from the COCO dataset. I've realized I can set the train/valid targets by modifying the parameters in "coco.data". However, I don't see a place where I can set the labels target.
My labels are stored in txt where each txt file contains the label for a single image as follow:

Image: /data/coco/images/train2014/COCO_train2014_000004.jpg
label: /data/coco/labels/train2014/COCO_train2014_000004.txt

How should I proceed?

안녕하세요. 죄송하지만 질문 하나 드려도 될까요?

images do not match error

hello I was testing the code and I used a random pictures (sized 400x544px), I got this error. It was working in the previous images. 5 of the labels are correct, but it couldnt write anything.

5 box(es) is(are) found
bus: 1.000000
person: 1.000000
car: 0.999882
car: 0.999940
car: 0.988874
Traceback (most recent call last):
  File "detect.py", line 113, in <module>
    detect(cfgfile, weightfile, imgfile)
  File "detect.py", line 39, in detect
    plot_boxes(img, boxes, 'data/predictions1.jpg', class_names)
  File "D:\xx\pytorch-0.4-yolov3\utils.py", line 296, in plot_boxes
    drawtext(img, (x1, y1), text, bgcolor=rgb, font=font)
  File "D:\xx\pytorch-0.4-yolov3\utils.py", line 259, in drawtext
    img.paste(box_img, (pos[0], pos[1]-th-2))
  File "D:\xxx\Anaconda3\lib\site-packages\PIL\Image.py", line 1408, in paste
    self.im.paste(im, box)
ValueError: images do not match

Performance with gtx1080ti

Hi i am training my custom dataset with this yolov3 implementation on a gtx1080ti.

And i have 3 questions:

What is the biggest batch size for this network? i have a batchsize of 16 otherwise the vram isnt enough
When i train the network the workload of the graphiccard is only at 20%. Is that normal?
How many iterations does it normally take until the model converge.

thank you in advance
Stefan

How to use multi GPUs?

I successfully started training but it is just using one GPU. is there a anything except the data file that I need to change?

VOC on yolov2

Hi,

Thank you for the code!
Without changing the code for yolov2. And using below parameters. The highest mAP I got is at 90 epoch which is only 0.7222 on VOC. I am not sure if I have missed anything?
anchors = 1.1468, 1.5021, 2.7780, 3.4751, 4.3845, 7.0162, 8.2523, 4.2100, 9.7340, 8.682
object_scale=3
noobject_scale=1
class_scale=1
coord_scale=1

Also, for the training and testing, is image size only 416*416 or is it trained and tested with multiscale?

Thank you for the help!

training error

I trained VOC2007 by 105 epochs, then an error occured, i don't know why, can anyone help me? Thanks.
the error as follows:

Traceback (most recent call last):
File "train.py", line 372, in
main()
File "train.py", line 154, in main
nsamples = train(epoch)
File "train.py", line 216, in train
for batch_idx, (data, target) in enumerate(train_loader):
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 336, in next
return self._process_next_batch(batch)
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 357, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 106, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 187, in default_collate
return [default_collate(samples) for samples in transposed]
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 187, in
return [default_collate(samples) for samples in transposed]
File "f:\Anaconda3\envs\spurs\lib\site-packages\torch\utils\data\dataloader.py", line 164, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 416 and 480 in dimension 2 at c:\programdata\miniconda3\conda-bld\pytorch_1533090623466\work\aten\src\th\generic/THTensorMath.cpp:3616

training on your own data

So I want to train the network on my own data. I have images and and the .xml files for each image. If I understood correctly I have to create my.data file, but Im not quite sure what should i write in it. In coco.data there is written:
train = coco_train.txt
valid = coco_test.txt
names = data/names
backup = backup
gpus = 0,1,2,3
What should I write in the train and valid section? Not sure what backup part does and Im not going to use any gpu. Any suggestions?

_init_ got an unexpected value 'reduction'

which version of pytorch do you use?
I use pytorch0.4 with cuda 9.0 and meet this error ,It says pytorch 0.4.1 may work?appreciate for your reply

If I want to add new class to the Yolov3 model with trained coco weight.

@andy-yun Thank you for so much effort in the repo.
I have a question,
If I want to train a model which could detect a class not in the COCO,I have the image and label,could I modify the cfg file of coco ,and transfer the lower layers weight and freeze then ,and then just train the top layers related to the class number?
Thank you ~

All values of tensor become Zero and training stops

Hi @andy-yun ,

Thanks for your amazing work.

But I am facing some issue while using your code fro training a model using pre-trained yolov3.weights and VOC dataset.

The models gets loaded successfully while training but stops abruptly with all the tensor values zero.

Can you please help me out in debugging the cause ?

I have below attached the images for the same.

With Regards,
Mahavir

Possibly Incorrect Loss Terms

Hello, thank you for your YOLOv3 repo. I noticed your loss term is different than the official YOLOv3 loss in at least two ways:

I think you should use BCELoss for loss_cls, as the YOLOv3 paper section 2.2 clearly states "During training we use binary cross-entropy loss for the class predictions."
Why is MSELoss used in place of BCELoss for loss_conf? Did you make this choice yourself or did you see this in darknet?
Why divide loss_coord by 2?

https://github.com/andy-yun/pytorch-0.4-yolov3/blob/master/yolo_layer.py#L161-L164

loss_coord = nn.MSELoss(size_average=False)(coord*coord_mask, tcoord*coord_mask)/2
loss_conf = nn.MSELoss(size_average=False)(conf*conf_mask, tconf*conf_mask)
loss_cls = nn.CrossEntropyLoss(size_average=False)(cls, tcls) if cls.size(0) > 0 else 0
loss = loss_coord + loss_conf + loss_cls

about training

hi, @andy-yun , thank you for your wonderful yolo-v3 pytorch code.
I saw your mAP results on VOC is 74.9%, do you get a better result recently?
And are there any training tricks in your code?

YOLO v3 for UCF101

Hi,

I want to train UCF101 video dataset for YOLO v3, how should I start and do this?

CUDA out of memory error during training

Hi,

I'm training a model on my own data, on a dual GPU system, Python 3.6, Pytorch 0.4.1, both from Anaconda. Everything is fine for a while: config was adapted to single class model, both GPUs get used properly, batch size of 48 with 2 subdivisions result in 12GB memory usage out of 16GB on each GPU, and the model slowly converges over iterations. However, after a number of iterations (different each time, but somewhere between 1300 and 1500), the training crashes and can't resume as is, with a CUDA out of memory error.

Traceback (most recent call last):
File "train.py", line 362, in
main()
File "train.py", line 152, in main
nsamples = train(epoch)
File "train.py", line 242, in train
sum(org_loss).backward()
File "/home/bruno/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/bruno/anaconda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory

From there, my only option to resume training from checkpoint is to reduce the batch size to about a single image per GPU per iteration, anything else saturates at least one GPU (the second one seems to increase its memory usage and crash first, but maybe that'd happen too on the first one given time).

I thought increasing subdivisions would help, and ultimately that having subdivisions == batch size would have as low a memory footprint as possible, but that still produces the error, which only goes away with a tiny batch size.

I tried setting "random=0" in the config as if I understand correctly, that disables the random resizing of the net, which could make the memory usage vary, but the error happens again around the same number of iterations when I expected it to remain constant over time.

I assume there's something I'm doing wrong or missing in the YOLO configuration but not sure what. Any ideas?

data should be more simple?

.....................

About the predicted result

Hi, Andy. I have two questions about the output.

I found that there is something wrong in the detected box. After using the function correct_yolo_boxes( ), the left, top of the box can be negative.

e.g. -3.15759015083 26.1544475555 92.0868988037 106.778274536

The conf_threshold used in testing is 0.5 while it is set to 0.25 in training. I think they should be the same.

Another question is about non-maximum suppression in training. I didn't find it in the training procedure.
Do you use nms in the training procedure?

Many thanks.

learning rate/batch_size and loss

appreciate of this great work!!

I have some questions about loss and learning rate.

In the region_layer.py of marvis/yolov2,
all losses[cls, conf, etc.] are summed, learning rate in train.py are divided into batch_size.
Just my guess, it is equal the back propagation as follow:
w` = update weight
w = current weight
lr = learning rate

origin : w = w - lr * dw(average of batch) marvis : w = w - lr /batch * dw(sum of batch)

In your region_layer.py,
losses is divided by nB and also lr of optimizer is divided by batch_size in the train.py

If possible, would you tell me about this intention?

Error when validating model in VOC

Hello, thanks for the script.
The train.py code are working great. However, when I do the following command to evaluate model in VOC:
python valid.py cfg/voc.data cfg/yolo-voc.cfg backup/000002.weights
python scripts/voc_eval.py results/comp4_det_test_
I got a error:
Traceback (most recent call last):
File "scripts/voc_eval.py", line 265, in
_do_python_eval(res_prefix, output_dir = 'output')
File "scripts/voc_eval.py", line 241, in do_python_eval
use_07_metric=use_07_metric)
File "scripts/voc_eval.py", line 149, in voc_eval
BB = BB[sorted_ind, :]
IndexError: too many indices for array
"save_interval" is set to 2, and 000002.weights is a trained model weight file.
I find all generated comp4_det_test* file sizes in the /results directory are 0 Bytes.
By looking at the valid.py and utils.py files and printing the results of the variables, I guess that because the conf values are all less than the conf_thresh, the following code has not been executed,,
if conf > conf_thresh:(line 174, in utils.py)
bcx = xs[ind]
bcy = ys[ind]
bw = ws[ind]
bh = hs[ind]
...
I want to know what went wrong and what should i do...
Any suggestion will be appreciated! Thanks.