cannot get right result when training about pytorch-yolov3 HOT 17 CLOSED

eriklindernoren commented on July 27, 2024

cannot get right result when training

from pytorch-yolov3.

Comments (17)

glenn-jocher commented on July 27, 2024 8

@libzzluo of course. My update to @eriklindernoren 's yolo v3 is available here:

https://github.com/ultralytics/xview-yolov3

I'm not sure how usable it is to you out of the box however, as I've significantly modified it to not just fix the convergence issue but also to adapt it to the xView 2018 Object Detection Challenge (rather than COCO).

This repo converges on training with the xView dataset, and was able to produce a 0.16 mAP in the challenge (highest mAP was .27). I attached a picture of the losses, precision and recall during training here. I'm going to be branching this repo into a new COCO-specific repo in the coming days, I'll post that when complete.

from pytorch-yolov3.

ludovic-carre commented on July 27, 2024 2

Can you please give more details about what still needs to be done for training to work ? I am currently working on your training code and would like to implement what is missing.

from pytorch-yolov3.

LalitPradhan commented on July 27, 2024 1

HI @eriklindernoren , I'm trying to train yolov3 for a small dataset, 1.3K images which is significantly different from COCO. I intend to use the darknet53.conv.74 pretrained weights provided by pjreddie repo.

Since I have a small dataset, I figured training using the darknet53.conv.74 weights is the best. Now darknet53.conv.74 has weights upto line line 549 in yolo-obj.cfg (I followed the instructions for training on custom images).

I did the following.

Loaded the weights at random using model.apply(weights_init_normal)
Loaded and overrode the weights till conv_73 using model.load_weights(opt.weights_path) where weights_path pointed to darknet53.conv.74.
This way I had pretrained weights upto conv73 and weights initialized randomly for the rest of the layers (3 YOLO).
Now, I train all the layers keeping the lr as the default one given for training in your code.

My training stdout shows the conf drop to 0.01 at the end of 30 epochs and the detect.py doesn't detect any object even with a low conf threshold.

Am I training it the correct way? Or do I need to keep the lr = 0 for the pretrained weights upto conv73 and train the remaining layers. Also do I need to change the method to save the weights using state_dict() as suggested in some of the Issues. (#45 )

from pytorch-yolov3.

eriklindernoren commented on July 27, 2024

Hi,

How long did you train the model? There is still work to be done on the training support. Data augmentation as well as weight and learning rate decays needs to be added as well as some other stuff.

from pytorch-yolov3.

eriklindernoren commented on July 27, 2024

The confidence mask needs to be fixed. I will probably get around to fixing that pretty soon. The issue I keep running into is that the recall and precision seems to increase as I train the model but during inference (using model.eval()) the model outputs junk. Maybe this can be attributed to the way the network is trained w.r.t. the confidence loss at the moment (or the fact that I simply don't train it long enough) but I'm not sure. I'll keep experimenting. Feel free to do the same if you want. :)

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

I see a similar dissociation between good training results but poor inference later on.

I don't see an obvious culprit yet. One small change I noticed is that the loss criterion for xy coordinates should be MSE rather than BCE.

I think the selection of the best anchor in build_targets has a bug in it also, as it relies on iou's based on a common top-left corner. I fixed this and vectorized these operations in my cloned repo, and see slightly different results. I could submit a pull request if you'd like to take a look.

The code is very clean and concise though, nice work.

from pytorch-yolov3.

libzzluo commented on July 27, 2024

@glenn-jocher Hi.. Would you mind sharing your code? I met the same problem, but i can't fix it. Thanks

from pytorch-yolov3.

feidongxi commented on July 27, 2024

THANKS!!! @glenn-jocher That's amazing!!! I will try the code on my repo.😊😊

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

Ah @feidongxi @libzzluo yes you can try and fork this right now, but it might be easier if you wait a day or two, as I'm going to backtrack all the COCO -> xView specific changes I made into a new repo so that it's back to the COCO facing implementation Erik originally created. In any case, I'll post that when its done.

@eriklindernoren I could submit a pull request at that point if you'd like to try to merge these changes, but first of course I need to verify that COCO trains to a proper mAP, which will take more time, perhaps a week.

from pytorch-yolov3.

eriklindernoren commented on July 27, 2024

@glenn-jocher That sounds great. Haven't had much time to work on this lately. Appreciate it!

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

@eriklindernoren got it. What happened when you tried to train originally?

I've validated your .58 mAP in my forked repo after realigning it to COCO (using Redmon's weights), and I've tentatively trained a few epochs from scratch with good convergence, but I'm realizing it will be a significant challenge replicating Redmon's mAP after 160 epochs simply due to a few missing details in his paper, such as his polynomial learning rate scheduler, his multi-scale training, and his augmentation strategies, which are touched on but not explicitly described anywhere I know of.

Some or all of these may be in the darknet repo, do you have any info on them? If not I can try and take my best stab at it and see where the mAP lands. Theoretically my update is capable of all these things, including full augmentation, so its just a matter of figuring out what settings to use.

It's also going to take longer than I thought. My 1080 Ti appears good for about 16 epochs per day (120k images per epoch), so I imagine ~10 days to get to 160 epochs. Is this close to what you saw on your end?

from pytorch-yolov3.

libzzluo commented on July 27, 2024

@glenn-jocher
I check the output log file built by origin darknet framework, the network randomly change its input size between 320 and 608 (the step is 32) every 10 epochs. What's more, if i set the random = 1 in the origin framework's yolov3.cfg, the GPU (1080) memory (8G) will be insufficient sometimes.

I will also try to analyze the details in the source code and read the paper. If anything is discovered, I will add it in time. But the C source code is sooo complicated...

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

@libzzluo hmm ok thanks for the info. So we have the multi-scale information now. That was one of my missing links. I'm surprised it's every 10 epochs, that only leaves room for 16 different input sizes over 160 epochs. Is random = 1 set by default? Yes the C is hard to decipher, I have not looked at it yet.

The full list of unknowns I had is:

multi-scale training i.e. img_size = random.choice(range(10, 20)) * 32
polynomial learning rate scheduler (for use with SGD)
image colorspace augmentation (currently I have +/- 50% on the SV channels of HSV. This seems a bit excessive to my eye...)
image spatial augmentation (currently I have +/- 20% translation and zoom, random left-right flips, and 10 deg rotation. Rotation is my own addition, can be set to 0. Labels augmented along with image.)

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

I created examples with and without augmentation to illustrate the results. Both spatial and colorspace augmentation are active here. I toned down the rotation to +/- 5 deg and added random shear of +/- 3 deg (both of these can be disabled).

Augmented

Standard

from pytorch-yolov3.

glenn-jocher commented on July 27, 2024

I've finished creating my yolov3 repository:
https://github.com/ultralytics/yolov3

I've started training COCO using this repo, including the augmentation shown above. I'm running about 15 epochs per day so it will take a while to get to 160.

One concern I have, besides my stated unknowns above is that I see significant improvement in precision and recall using CELoss in place of BCELoss for the classification term. This is contrary to the original yolov3 loss function, which uses BCE for both classification and objectness, and MSE on the bounding boxes. This could be an indicator of underlying problems elsewhere. Unfortunately I suppose I have to wait another week until I reach epoch 160 to see what the final effect of this change is. For now, current progress shown here:

from pytorch-yolov3.

eriklindernoren commented on July 27, 2024

@glenn-jocher That's great. Could you make a PR with your additions?

from pytorch-yolov3.

eriklindernoren commented on July 27, 2024

Yes, I believe there is an issue with Darknet.save_weights (see #89). I have changed to saving and loading state dicts in master and I now I get the model to converge, as well as preserved performance when saving and loading the model.

from pytorch-yolov3.

cannot get right result when training about pytorch-yolov3 HOT 17 CLOSED

Comments (17)

Augmented

Standard

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent