Hello, I download the provided ckpt files and use <code class="notra

This is too funny!! <a href="https://www.kaggle.com/c/global-wheat-detection/discu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

@jinfagang thanks for the data point! The funny thing from <a class="user-mention notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

About reproduced results,about ultralytics/yolov5

Comments (83)

glenn-jocher commented on April 30, 2024 18

This is too funny!!
https://www.kaggle.com/c/global-wheat-detection/discussion/167375#931494

from yolov5.

glenn-jocher commented on April 30, 2024 7

@WongKinYiu excellent!! Good, I'm glad the results are reproducible. And yes, I have also trained a set of yolov5 models using the csp bottlenecks. The csp models are faster and more accurate, except for yolov5x, which is much faster, but may drop a bit in mAP. The updated models swap Bottleneck() for BottleneckCSP(), and increase the P5 bottlenecks from 3 to 6. yolov5s.yaml looks like this for example:

# parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

# yolov5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 1-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4
   [-1, 3, Bottleneck, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 4-P3/8
   [-1, 9, BottleneckCSP, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 6-P4/16
   [-1, 9, BottleneckCSP, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 8-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 6, BottleneckCSP, [1024]],  # 10
  ]

# yolov5 head
head:
  [[-1, 3, BottleneckCSP, [1024, False]],  # 11
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1, 0]],  # 12 (P5/32-large)

   [-2, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 3, BottleneckCSP, [512, False]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1, 0]],  # 17 (P4/16-medium)

   [-2, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 3, BottleneckCSP, [256, False]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1, 0]],  # 22 (P3/8-small)

   [[], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

The new table looks like this.

Model	AP^val	AP^test	AP₅₀	Latency	FPS	params	FLOPs
YOLOv5-s (ckpt)	35.5	35.5	55.0	2.5ms	400	7.1M	12.6B
YOLOv5-m (ckpt)	42.4	42.4	61.8	4.4ms	227	22.0M	39.0B
YOLOv5-l (ckpt)	45.7	45.9	65.1	6.8ms	147	50.3M	89.0B
YOLOv5-x (ckpt)	-	-	-	11.7ms	85	95.9M	170.3B
YOLOv3-SPP (ckpt)	45.6	45.5	65.2	7.9ms	127	63.0M	118.0B

Now I am waiting for yolov5x to finish training. Probably another 1-2 days, and then I will push a commit with the updates. Note most of the speed gains you see here (compared to current readme table) are due to the reduced flops from the csp bottlenecks, though a bit of the speed improvement (maybe 0.1-0.4ms) is due to increased batch-size from 16 to 32.

from yolov5.

WongKinYiu commented on April 30, 2024 7

@glenn-jocher

Finish tested performance of channel last, it increase about 10~15% training speed and ~3% batch-8 inference speed.
And the most interested thing is - it reduce ~30% GPU RAM of training.

https://github.com/ultralytics/yolov5/blob/master/train.py#L77
from

        model = Model(opt.cfg, ch=3, nc=nc).to(device)  # create

        model = Model(opt.cfg, ch=3, nc=nc).to(device).to(memory_format=torch.channels_last)  # create

https://github.com/ultralytics/yolov5/blob/master/train.py#L271
from

                pred = model(imgs)

                pred = model(imgs.to(memory_format=torch.channels_last))

https://github.com/ultralytics/yolov5/blob/master/test.py#L91
add

    model = model.to(memory_format=torch.channels_last)

https://github.com/ultralytics/yolov5/blob/master/test.py#L104
from

            inf_out, train_out = model(img, augment=augment)  # inference and training outputs

            inf_out, train_out = model(img.to(memory_format=torch.channels_last), augment=augment)  # inference and training outputs

from yolov5.

glenn-jocher commented on April 30, 2024 5

@WongKinYiu thanks for the comparison! Yes it is good to do an apples to apples comparison. YOLOv5 is just getting started, so I hope there is a lot of improvement we can look forward to.

One point though is that all 3 of these models were trained and tested with ultralytics repos, so yolov4 here is not the same results as the paper, with the bag of specials, etc, it is the result of extensive and very difficult ultralytics development over the last year that allows ultralytics yolov4 to exceed the published yolov4 paper results. The yolov4 paper on arxiv shows 43.5 AP. No one will realize this since ultralytics training is not mentioned in the yolov4 paper, only darknet training.

When yolov4 was published, ultralytics/yolov3 43.1 AP was ignored, only very old pjreddie AP of 33 was shown, so if we follow this example, then we should only include published benchmarks for comparison, such as efficientdet, FCOS etc. and 43.5 AP from darknet yolov4.

Also for a full comparison, you may want to include yolov5m, as it will be better than all of these 3 models for all latencies < 5 ms.

from yolov5.

WongKinYiu commented on April 30, 2024 5

from yolov5.

glenn-jocher commented on April 30, 2024 4

@AlexeyAB yes I think you and @WongKinYiu may be right about the batch size. This is the first time I've made a latency plot actually, so I'm still learning. I can remake the plot with batch-size 1.

One detail that seems to be missed also is that EfficientDet speeds are cited using FP16 on V100, whereas the pytorch plots I made are for full FP32 inference (which is a bit slower, since it will not use the tensor cores available on the gpu). You can see this comment in their latency section, which is where I obtained their batch-8 speeds (you are right, they are not in the paper).
https://github.com/google/automl/tree/master/efficientdet#4-benchmark-model-latency

About the improvements, I usually think of yolov3-spp as a good example. When it was released it was benchmarked at about 36 AP. My best efforts over 2 years of hard work have brought it up to 45.6 AP using the same architecture (same anchors too), no changes.

So when considering yolo architecture updates past yolov3-spp, I would say extra AP above 45-46 should definitely be credited to architecture improvements, but I have a feeling this 'extra' AP we see may be much smaller than the +10 AP I've achieved from better training and loss. I think you can get a sense of this from @WongKinYiu's plot in #6 (comment), which really shows just how similar all the yolo models are these days when trained comparably. I suppose an alternative way of thinking about it is that the AP has been latent in the architecture all along, it just takes the right training to realize it's potential.

Which brings me back to v5, which unfortunately has been causing so much controversy. I apologize to anyone that's been offended. The actual purpose of the models is not to achieve the best mAP, my goal was to produce the best compromise against a variety of competing interests, which I'm still working on, and hope to keep improving as the year goes on. The main goals are:

ease of use (user friendliness, 'out of the box' performance)
exportability (to onnx and then onward to coreml and tflite)
memory requirements (minimized for largest batch sizes for faster training and inference)
speed (both training speed, part of user friendliness, and inference speed, mobile too!)
mAP (improved mainly via augmentation and loss, i.e. mosaic, grid_scale overlaps, etc.)
market size - increased range of performance (like efficientdet) to increase practical user base

Because of this I've not used some of the advancements in yolov4, like mish, which produces better mAP but slows down training, and for others, like panet heads, I simply have not had time to implement yet.

from yolov5.

WongKinYiu commented on April 30, 2024 4

@glenn-jocher @AlexeyAB

Hello,
I tried channel_last in PyTorch.
It can make training and testing faster on GPU with tensor core.
And the AP only drops about 0.002% on channel_first trained model.

Just add code something like:

model = model.to(memory_format=torch.channels_last)
inf_out, train_out = model(img.to(memory_format=torch.channels_last), augment=augment)

reference

from yolov5.

glenn-jocher commented on April 30, 2024 3

@jinfagang thanks for the data point! The funny thing from @WongKinYiu's plots is really just how similar all the versions of yolo are (at least when they are trained on the same repo here).

Perhaps it means training methods and loss functions are becoming more important these days than architecture, since after all yolov3 used to be near 33 AP, and we've pulled it up to 45.6 now with no changes at all to the architecture.

CORRECTION: SPP architecture change is +3, as @AlexeyAB mentions below, so yolov3-spp started at around 36, not 33.

from yolov5.

WongKinYiu commented on April 30, 2024 3

@glenn-jocher @AlexeyAB

new scaling method performs good in fp32 mode but not in fp16 mode.
and scaling depth and width performs good in both fp32 and fp16 mode.

for convenient to see the figure, i only keep Pareto optimality curve.

i will think about how to design scaling method for both traditional and modern gpus.

YOLOv4(CSP,Leaky) models are trained on ultralytics/yolov3, other models are trained on ultralytics/yolov5.
All models are trained with 640x640 input resolution.

from yolov5.

WongKinYiu commented on April 30, 2024 3

@glenn-jocher Hello,

Would you like to integrate mish_cuda to the repo?
It takes almost same training time as LeakyReLU model when batch size is same.
And bring huge benefit on AP, following results are training with default setting.

Model	Test Size	AP^val	AP₅₀^val	AP₇₅^val	AP_S^val	AP_M^val	AP_L^val
YOLOv4(CSP,Leaky)	672	47.2%	65.9%	51.7%	29.8%	52.3%	61.5%
YOLOv4(CSP,Mish)	672	48.1%	66.8%	52.6%	31.9%	53.3%	61.0%

YOLOv4(CSP,Mish)	544	47.1%	65.6%	51.0%	29.0%	52.2%	63.6%

By the way, MixUp can benefit AP by ~0.3%.

from yolov5.

glenn-jocher commented on April 30, 2024 2

@WongKinYiu ah, yes you found a bug in test.py! I've pushed a fix for this in ee8988b, along with solving the datasets.py mystery.

It turns out the one line of code that letterboxed images to 64-multiples in datasets.py was helping the mAP by reducing edge effects on certain images. I've modified this line to pad to 32-multiples now, and forced a minimum of 0.5 grid cells (16 pixels) around each image by increasing the grids + 1 (0.5 on each side). This allows for testing yolov3-spp at --img 640 for example with 0.455 mAP now!

The 16 pixel padding looks like this. test.py --img 640 will actually test at 672 now, and --img 672 will actually test at 704, etc, with 16 pixels of letterbox around each side. This allows a much better balance of speed - mAP than before.

from yolov5.

WongKinYiu commented on April 30, 2024 2

@glenn-jocher Thanks

I just test YOLOv4(CSP, Leaky) model in the same testing protocol, the results will be like this.

Will update the information after finish my training.

And do you test EfficentDet with batch size equals to 32, could you also help for providing the (GPU_latency, AP_val) information of EfficientDet?

from yolov5.

glenn-jocher commented on April 30, 2024 2

@WongKinYiu yes, I know the efficientdet values are at batch-size 8. They do not show information for larger batch sizes, probably because their models run out of GPU ram at larger batch-sizes. In the figure caption I clearly state our data conditions, batch-size 32.

from yolov5.

WongKinYiu commented on April 30, 2024 2

@glenn-jocher Thanks for updating many information.

#6 (comment)

I think it is better to remove efficientdet in the figure if they are running in different condition.

#6 (comment)

I agree large-batch inference is very important in cloud streaming inference, since the latency of model inference usually can be ignored when compare with internet latency.
But batch-size 1 inference is still very important on GPU, a good example is auto-driving scenario, the inference latency of main streaming should be less than 1/200 second.

#6 (comment)

I think efficientdet can run in larger batch-size due to depth-wise convolution usually need less memory space for inference (but it need huge memory space for training).

#6 (comment)

Thanks for the update, it seems currently YOLOv4-608(45.9%) can gets comparable AP as YOLOv5l-736(45.7%).

from yolov5.

WongKinYiu commented on April 30, 2024 2

@glenn-jocher @AlexeyAB Hello,

contribution of training and architecture is 50% / 50%

Yes, I think it is correct.

I change the training scale of YOLOv4 from 512x512 to 640x640, following is temporally results.
So when YOLOv4 and YOLOv5 use same training strategy, currently YOLOv4 can achieve same AP and ~28% faster batch 32 FPS when compare to YOLOv5.

But unfortunately, my server crash again, so the temporally result is from 220/300 epochs, I will restart training soon.

I suppose an alternative way of thinking about it is that the AP has been latent in the architecture all along, it just takes the right training to realize it's potential.

I think this is also correct, CSP is designed for optimized learning process for back-propagation training. We can also directly improve back-propagation to make models learn better.

from yolov5.

WongKinYiu commented on April 30, 2024 2

@glenn-jocher @AlexeyAB

ultralytics/yolov3 trained models get higher FPS, ultralytics/yolov5 trained models get higher AP.

YOLOv4(CSP,Leaky) models are trained on ultralytics/yolov3, other models are trained on ultralytics/yolov5.

from yolov5.

glenn-jocher commented on April 30, 2024 2

@Ammad53 and report your comparison results here! :)

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu you may be referencing the in-house repo mAP calculation, which is always a little lower than the official pycocotools results, but is used for mAP plots during training because it is much faster than pycocotools.

The correct pycocotools val and test-dev server results are shown in the readme table. You can use the notebook to reproduce easily. yolov3-spp for example:
https://github.com/ultralytics/yolov5/blob/master/tutorial.ipynb

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu ah yes. I moved pycocotools into a try except clause, to avoid the problem where for example training completes but a pycocotools error would prevent the final model from being saved at the very end of the process etc.

So now if pycocotools fails for some reason it will only print a warning to screen and continue with the process. 672 image-size produces results almost as well as 736 BTW.

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu yes, here are the results in the attached study.zip. The results are reproduced by this code. The entire study takes about 1-2 hours to run. study.zip

python test.py --task study

yolov5/test.py

Lines 264 to 275 in 3a5c532

 elif opt.task == 'study': # run over a range of settings and save/plot 

 for weights in ['yolov5s.pt', 'yolov5m.pt', 'yolov5l.pt', 'yolov5x.pt']: 

 f = 'study_%s_%s.txt' % (Path(opt.data).stem, Path(weights).stem) # filename to save to 

 x = list(range(288, 896, 64)) # x axis 

 y = [] # y axis 

 for i in x: # img-size 

 print('\nRunning %s point %s...' % (f, i)) 

 r, _, t = test(opt.data, weights, opt.batch_size, i, opt.conf_thres, opt.iou_thres, opt.save_json) 

 y.append(r + t) # results and times 

 np.savetxt(f, y, fmt='%10.4g') # save 

 os.system('zip -r study.zip study_*.txt') 

 # plot_study_txt(f, x) # plot

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu you might ask the efficientdet guys to provide speeds at higher batch-sizes if they can. I'm not sure it's possible though, i.e. google/automl#85

One of the main benefits of yolov5 are the large batch sizes that can be used, since the GPU memory requirements are very low. For example yolov5x can run at --batch-size 64 --img 640 with no problems, and yolov5s can run at up to --batch-size 256 --img 640 on certain rectangular images like HD video.

So yes, I can artificially limit ourselves to batch size 8 to compare to efficientdet, but I'm not sure if this makes sense to do when one of the primary benefits of yolov5 is larger batch sizes and lower gpu memory.

from yolov5.

lucasjinreal commented on April 30, 2024 1

I have trained yolov5 on my own dataset, it gots a 87.6 AP@50 while also try darknet version yolov4 got 84 AP@50, I think yolov5 suppressed yolov4 darknet version. (both with 800x800 input, v5 in large mode).

from yolov5.

WongKinYiu commented on April 30, 2024 1

@glenn-jocher @AlexeyAB

Update all models to use same NMS as the latest version of yolov5.
It can help for understanding the performance of architectures.

for convenient to see the figure, i only keep latest version of yolov5 for comparison.

YOLOv4(CSP,Leaky) models are trained on ultralytics/yolov3, other models are trained on ultralytics/yolov5.

from yolov5.

WongKinYiu commented on April 30, 2024 1

@AlexeyAB

Are these 2 yellow curves: YOLOv4(CSP,Leaky) and YOLOv4s(CSP,Leaky)?

Yes, they are YOLOv4(CSP,Leaky) and YOLOv4s(CSP,Leaky), which are models shown in #6 (comment).

does model with new NMS work +10% faster with the same accuracy?

No, new NMS reduces 0.1~0.2% AP and reduces 0.3~0.5 ms NMS time.

from yolov5.

WongKinYiu commented on April 30, 2024 1

of course, i just replaced the function without changing the code.

from yolov5.

BipinJG commented on April 30, 2024 1

@WongKinYiu
Can you please also include the results of speed vs mAP accuracy and compare YOLOv4_pacsp with original YOLOv4 implementation?

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu I see. I've tested 20.06 and 20.07 but for some reason they train about 10% slower than 20.03 with pip install pytorch 1.6, so I've left the Dockerfile at 20.03 for now.

Yes I don't actually know if the overhead is small. In theory torch.interpolate should be able to create the output, and the gradient could be defined by another torch.interpolate, autograd seems to operate with very little overhead when the upsample op is set to linear instead of nearest. Maybe I'll try this if I find some time.

from yolov5.

glenn-jocher commented on April 30, 2024 1

@WongKinYiu I tried to start this and realized that there really is no 1d interpolation function in pytorch! I had assumed there was something like https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html that would work on GPU with autograd, but there is nothing like that surprisingly.

This means I'd have to write my own interpolation function, which isn't too difficult, but may not be well optimized either, so could be slow in practice. The function I had in mind is basically this. The interpolate and diff functions need defining.

class LMish(nn.Module):
    def __init__(self):  # ch_in, kernel
        super().__init__()
        m = Mish()
        self.x = torch.linspace(-10, 10, 100)
        self.y = m(self.x)
        self.dy = diff(self.x, self.y)
        # import matplotlib.pyplot as plt
        # plt.plot(x,m(x),'.')

    @staticmethod
    def forward(self, x):
        return interpolate(self.x, self.y, x, mode='nearest')

    @staticmethod
    def backward(self, x):
        return interpolate(self.x, self.dy, x, mode='nearest')

from yolov5.

WongKinYiu commented on April 30, 2024

Thanks.

Yes, I got the results from in-house repo mAP calculation.

I just checked the code and find the reason is that my coco path is different as default path in test.py, cocoGt = COCO(glob.glob('../coco/annotations/instances_val*.json')[0]).
https://github.com/ultralytics/yolov5/blob/master/test.py#L207
So there were no results calculated by pycocotools in my testing.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher

Due to the setting in dataset.py, 672/736 will use 704/768 for testing.
Why don't you directly use 704/768 for producing the results?
self.batch_shapes = np.ceil(np.array(shapes) * img_size / 64.).astype(np.int) * 64
https://github.com/ultralytics/yolov5/blob/master/utils/datasets.py#L325

By the way, there seems have no corresponding process for test task in test.py.
https://github.com/ultralytics/yolov5/blob/master/test.py#L246-L268

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher great!

and one more question, I would like to know how much AP improvements you get from change
wh = torch.exp(p[:, 2:4]) * anchor_wh
to
y[..., 2:4] = (y[..., 2:4].sigmoid() * 2) ** 2 * self.anchor_grid[i]?

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu ah yes. This change is not about mAP improvement, this change is about training stability on custom datasets. Too many users were reporting unstable/nan width-height loss, I suspect due to exp(), so I removed it and replaced it with a sigmoid that ranges from 0 to 4 instead. This simplifies the code as well for inference, as the entire output is sigmoid() now, removing a bit of slicing tensors. etc.

But to answer your question I don't have an exact number attributable to this change, as this change was one of many that combined to create the change.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher OK, thank you very much.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher

I just finish training yolov3-spp_csp, and following are results compare with yolov3-spp and yolov3-spp_csp using python test.py --img-size 736 --conf 0.001.

yolov3-spp: 45.5% AP, 65.1% AP50, 49.6% AP75.
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Speed: 10.4/2.1/12.6 ms inference/NMS/total per 736x736 image at batch-size 16

yolov3-spp_csp: 45.6% AP, 65.4% AP50, 49.7% AP75
Model Summary: 275 layers, 4.90092e+07 parameters, 4.90092e+07 gradients
Speed: 9.1/2.0/11.1 ms inference/NMS/total per 736x736 image at batch-size 16

It seems YOLOv4-based models outperforms than YOLOv3-based models and YOLOv5-based models. yolov3-spp_csp gets 12.5% faster model inference speed and 0.1/0.3/0.1% higher AP/AP50/AP75 than yolov3-spp. Do you have plan to implemented other YOLOv4-based models in this repository?

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu also note I did not replace the first bottleneck with csp, because this slowed down training and inference substantially (about 10%), and increased CUDA memory, due the large P2 grid size, and resulted in negligible params/FLOPs savings. There are 3 problems that need resolving now after this upcoming update:

Architecture: We should try PANet or BiFPN heads. I have not had time to do this.
Training. Overfitting is a serious issue, especially for objectness. Overfitting starts earlier, and is more severe, for the larger models.
Scaling. I've been scaling depth by 1/3 and width by 1/4 for each new model. I'm not sure if there is a better way. The gains seem to be diminishing past yolov5l. The gains from s to x are: +6.9, +3.3, +1.3 (maybe). This is a bad trend. Ideally we want a scaling strategy to keep increasing mAP by +3 going from m to l to x and higher.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher

If there were no accident, I will get results of ALL CSP MODEL (apply CSP on both backbone and neck) next week. And it already show that it is more efficient in both of inference speed and parameters.

yolov3-spp 43.6%
Model Summary: 152 layers, 6.29719e+07 parameters, 6.29719e+07 gradients
Speed: 6.8/1.6/8.3 ms inference/NMS/total per 608x608 image at batch-size 16

cd53s-yocsp 43.9%
Model Summary: 194 layers, 4.3134e+07 parameters, 4.3134e+07 gradients
Speed: 6.0/1.6/7.6 ms inference/NMS/total per 608x608 image at batch-size 16

cd53s-pacsp 45.0%
Model Summary: 230 layers, 5.28891e+07 parameters, 5.28891e+07 gradients
Speed: 6.5/1.7/8.2 ms inference/NMS/total per 608x608 image at batch-size 16

Architecture: We should try PANet or BiFPN heads. I have not had time to do this.

I will upload ALL CSP MODEL to github after finish training, here is the cfg file of cd53s-pacsp, you can take a look of it if you would like to implemented it.

Training. Overfitting is a serious issue, especially for objectness. Overfitting starts earlier, and is more severe, for the larger models.

Yes, I also find my model becomes over-fitting at ~230 epochs.

Scaling. I've been scaling depth by 1/3 and width by 1/4 for each new model. I'm not sure if there is a better way. The gains seem to be diminishing past yolov5l. The gains from s to x are: +6.9, +3.3, +1.3 (maybe). This is a bad trend. Ideally we want a scaling strategy to keep increasing mAP by +3 going from m to l to x and higher.

I am designing new scaling strategy based on CSPNet, will deliver it if it scucess.

from yolov5.

glenn-jocher commented on April 30, 2024

That's great news! Yes looking at your results you see a big jump with panet head. I will try and use cd53s-pacsp to implement the head this week.

A good example of overfitting is my latest yolov5l run, plotted here from epoch 50 to 280. yolov5s, in comparison, does not really overfit until right before 300, so it is affecting larger models more than smaller models.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu yolov5x topped out at 47.2 and started overfitting, so I will update the readme shortly, after I recreate the plots. I've put this for the update. Are there any changes you'd like me to make?

June 9, 2020: CSP updates to all YOLOv5 models. New models are faster, smaller and more accurate. Credit to @WongKinYiu for his excellent work with CSP.

Here is a closeup of the results. You can see the overtraining progression from s to x:

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher Hello,

Could you provide (GPU_latency, AP_val) data of points in the figure?

I would like to add YOLOv4 to make a comparison after I finish training my models.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu ah, very interesting. Are these new models trained with ultralytics repository or a different one?

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher

Yes, just replace coco14.data to coco17.data to train the model from scratch in #6 (comment).
The performance improves ~0.5% after remove iscrowd=1 bounding boxes.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu ah ok. You should probably plot over the full range of image sizes to get the same comparison:

list(range(288, 896, 64))
Out[5]: [288, 352, 416, 480, 544, 608, 672, 736, 800, 864]

Efficientdet values from their tables here:
https://github.com/google/automl/tree/master/efficientdet

Hmm, I will look at iscrowd again, not sure if I am using them or not.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu it may make sense to label ultralytics-trained yolov4 models to differentiate them from darknet-trained yolov4 models also, because I believe there is a performance difference.

There is also an inference difference also, most notably in that the ultralytics repos do not use any of the special tricks mentioned in the yolov4 paper, which supposedly make yolov4 what it is, so I would suspect that the training and inference both taking place on ultralytics repos is important to the story.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher OK,

So in the figure, EfficentDet models use batch size equals to 8, and YOLOv5 models use batch size equals to 32.

I download coco17 using https://github.com/ultralytics/yolov5/blob/master/data/get_coco2017.sh.

yes, darknet-trained yolov4 gets better AP50 while ultralytics-trained yolov4 gets better AP.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu batch-size 32 throughput is important for cloud video streaming inference. We have customers that use one 2080ti GPU with 16 simultaneous RTSP feeds running inference in parallel for example, so this number is a direct value of the throughput capability a model can provide them with a single GPU.

Batch-size 1 FPS is not really a useful value for a GPU, as it will underutilize the resource tremendously. It is more informative for example to display this in iDetection, which we plan on doing when we have some time.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu, I've updated the study results to include yolov3-spp, and also to update yolov5m, which was reporting results from an older training before (new results are +0.3mAP to match correct table).

The updated study results are here: study2.zip

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher

YOLOv3-SPP & YOLOv5l: trained with resolution 640 +50% -50%
YOLOv4: trained with resolution 512 +50% -33%

from yolov5.

lucasjinreal commented on April 30, 2024

@glenn-jocher Even though, still wondering how many parts different between v4 and v5? Including model archirecture, heads design, loss, hyper params in training, data augmentation?

from yolov5.

AlexeyAB commented on April 30, 2024

@glenn-jocher Hi,

Perhaps it means training methods and loss functions are becoming more important these days than architecture, since after all yolov3 used to be near 33 AP, and we've pulled it up to 45.6 now with no changes at all to the architecture.

Yes, BoF improves accuracy for free, while BoS/Architecture improves accuracy but usually slightly decreases speed.
In our tests, a very approximate contribution of each feature to accuracy AP (depends on in what order to add features):

SPP: +3%
CSP+PAN: +2%
SAM: +0.3%
CIoU+S: +1.5%
Mosaic+Hyperparam: +2%
Scaled Anchors: +1%
In total: +~10% = 5% Architecture + 5% BoF(Loss/HyperParams/data augmentation/Anchors)

So it is 50% / 50%

What do you think, maybe it's worth comparing only networks with the same batch size, instead of comparing networks with batch = 1, 8 and 32 on the same chart?

Why do you think that it is worth to measure the latency with batch = 32 and then divide it by 32, while the latency of one sample in the batch cannot be less than the latency of the whole batch?

Such a comparison may be harmful than good for your repository. The fact that the repository supports ONNX and can process 140 FPS on a smartphone is already great, even without unfair comparisons.

then we should only include published benchmarks for comparison, such as efficientdet, FCOS etc. and 43.5 AP from darknet yolov4.

But even in the latest modified paper of EfficientDet accuracy/speed results are published only for batch=1, there is no result with batch = 8 in the paper: https://arxiv.org/abs/1911.09070

Table 2: EfficientDet performance on COCO [23] – Results are for single-model single-scale. test-dev is the COCO test set and val is the validation set. Params and FLOPs denote the number of parameters and multiply-adds. Latency denotes inference latency with batch size 1

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu I don't believe your overlay is correct. You can not show the same model trained at a higher resolution inferenced at a higher resolution (your new point), and then show the same model trained at a lower resolution in the same color.

Higher resolution training will lead to worse results at lower resolutions, so your simple overlay of a new point is deceptive.

from yolov5.

glenn-jocher commented on April 30, 2024

You are also still not noting that all of the models shown are trained with ultralytics repos.

People who just glance at your image will think they can head over to darknet and train yolov4 to these results because you have not labelled them as yolov4 trained with ultralytics/yolov3.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu lastly, I should note that as I've said before, yolov5 development is just getting started, we aim to further evolve, and we will update the readme throughout the year as new results come in, and publish at least a short synopsis by the end of the year. Thank you.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher thanks for point out these concerns.

I don't believe your overlay is correct. You can not show the same model trained at a higher resolution inferenced at a higher resolution (your new point), and then show the same model trained at a lower resolution in the same color.

I only test one resolution is due to it is just a temporally result while the model is not finish training.

You are also still not noting that all of the models shown are trained with ultralytics repos.

In all of my reported results, I mentioned the results are produced by pytorch version or ultralytics, and also marked ultralytics-trained model can get higher AP.

lastly, I should note that as I've said before, yolov5 development is just getting started, we aim to further evolve, and we will update the readme throughout the year as new results come in, and publish at least a short synopsis by the end of the year. Thank you.

Yes, your goal #6 (comment) is very clear and inspiring people who is doing object detection research. I posted these results is also hope your repository can be more strong #6 (comment).

To avoid the misunderstanding, I remake the figure follow your suggestion and add note directly on the figure instead of on the text.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher @AlexeyAB

Sorry for the late update, I just finish training the YOLOv4(CSP,Leaky) model.

YOLOv4(CSP,Leaky) is trained on ultralytics/yolov3, other models are trained on ultralytics/yolov5.
All models are trained with 640x640 input resolution.
Small models run faster than big models in FP16 mode in low input resolution due to TensorCore is limited.
Focus layer benefit the FPS in large input resolution.

Because I have only few GPUs, the update will be slow, following are my schedule.

Now training new designed scaling method of YOLOv4(CSP,Leaky) based on ultralytics/yolov3.
Start training YOLOv4(CSP,Leaky) based on ultralytics/yolov5 today.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu thanks for the updates!

Yes, I designed the Focus() module specifically for v5 for this :)

It's working well in pytorch, though it is making coreml export more difficult for me presently.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu if you are starting to try training with the v5 repo, everything should be fine as long as you can write a good yolov4.yaml file. There's an issue #144 open on the topic, and a PR as well #153, though the PR does not accurately recreate the correct yolov4 architecture I believe.

You should also be careful with your anchor orders. If you see high GIoU losses initially this could be an indicator that your anchor order is backwards. I've written a simple function that should check the anchor orders and update them if necessary in fc171e2.

def check_anchor_order(m):
    # Check anchor order against stride order for YOLOv5 Detect() module m, and correct if necessary
    a = m.anchor_grid.prod(-1).view(-1)  # anchor area
    da = a[-1] - a[0]  # delta a
    ds = m.stride[-1] - m.stride[0]  # delta s
    if da.sign() != ds.sign():  # same order
        m.anchors[:] = m.anchors.flip(0)
        m.anchor_grid[:] = m.anchor_grid.flip(0)

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher Hello,

Thanks for your advice, I created three kind of new layers based on ultralytics/yolov5. I think there were only some bn and act layers are different from the model implemented based on ultralytics/yolov3. And I used routing index to avoid the problem of anchor order.
Currently the learning curve is better than yolov5l, If there is no problem in final results, I will share the code.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher Hello,

I saw your panet update, could you provide your study.txt files.
thanks.

from yolov5.

AlexeyAB commented on April 30, 2024

@WongKinYiu

Are these 2 yellow curves: YOLOv4(CSP,Leaky) and YOLOv4s(CSP,Leaky)?
does model with new NMS work +10% faster with the same accuracy?

from yolov5.

AlexeyAB commented on April 30, 2024

@WongKinYiu @glenn-jocher What slows down the speed in Ultralytics5 and can it be solved? Or is it feature that improves accuracy?

from yolov5.

WongKinYiu commented on April 30, 2024

@AlexeyAB

models trained by ultralytics/yolov5 generate more bboxes, so NMS usually need more processing time (larger size need longer NMS time).
also, yes, it is the reason why ultralytics/yolov5 trained models can get higher AP.

from yolov5.

jundengdeng commented on April 30, 2024

@glenn-jocher @AlexeyAB

ultralytics/yolov3 trained models get higher FPS, ultralytics/yolov5 trained models get higher AP.

YOLOv4(CSP,Leaky) models are trained on ultralytics/yolov3, other models are trained on ultralytics/yolov5.

Thanks for sharing the comparison figure? Can you please share the full commands to reproduce the comparison?

from yolov5.

WongKinYiu commented on April 30, 2024

for ultralytics/yolov5 trained models, use python test.py --batch-size 8 --data coco.yaml --task study.
for ultralytics/yolov3 trained models, you should modify the utils.py to use new nms, modify test.py to enable fp16, and test each input resolution following the readme.

from yolov5.

jundengdeng commented on April 30, 2024

p

Thanks for the response. It's mentioned that the new NMS should be used with ultrlytics/yolov3. Could you please confirm that is it correct to use NMS from yolov5 with ultralytics/yolov3 trained models? Thanks again!

from yolov5.

glenn-jocher commented on April 30, 2024

@AlexeyAB @WongKinYiu I thought you two might be interested in this. I've never done a Kaggle competition myself, but I was recently made aware that nearly all of the highest ranking teams in a recent competition are using YOLOv5 to top the leaderboard :)
https://www.kaggle.com/c/global-wheat-detection

@glenn-jocher : thanks for your quick feedback.
To be fully clear, your last version of yolo is performing better than any other model from frcnn, efficientdet, etc in the wheat competition on Kaggle but the organizers are likely to disregard any solution involving a GPL licence like using yolov5. This is a real shame given the performance and how fast it is to train/inference, how easy it is to use it and adapt it to custom cases.

Originally posted by @chabir in #317 (comment)

from yolov5.

AlexeyAB commented on April 30, 2024

@glenn-jocher Yes, I got issues about this Kaggle challenge (global-wheat-detection), they can't compile Darknet-YOLO due to broken libcuda.so in their global-wheat-detection-Kaggle-Notebook, they can't upload compiled Darknet-YOLO due to Permission denied for all uploaded files, and can't use Ultralytics-YOLO due to GPL-license. So it looks like an extremely poorly organized kaggle challenge for students.

But I'm more interested in real projects of large companies:

Academia SINICA + Elan + Taiwanese government https://twitter.com/TaiwanNews886/status/1278625979536039937 and https://www.taiwannews.com.tw/en/news/3957400
Amazon: https://twitter.com/JoeSpeeds/status/1281728514304008193 and https://github.com/amzn/distance-assistant

...

And much larger private projects.

Also interesting YOLOv4 extension for RGB+Lidar -> 3D-rotated-bboxes: https://github.com/maudzung/Complex-YOLOv4-Pytorch

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher Hello,

Thanks for the information, if you are interested to compare results with yolov4 you could integrate the code to your repo.
I add some new layers, but I ignore some input arguments of the functions, you may need clean up some of the code.

Model	Test Size	AP^val	AP₅₀^val	AP₇₅^val	AP_S^val	AP_M^val	AP_L^val
YOLOv4_pacsp-s	736	38.9%	58.0%	42.1%	22.3%	44.0%	49.3%
YOLOv4_pacsp	736	46.9%	66.0%	51.2%	29.7%	52.7%	59.6%
YOLOv4_pacsp-x	736	48.6%	67.3%	53.2%	32.1%	54.0%	62.2%

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu yes, all PRs are welcome!! :)

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu just checked this out again, I see this is for the custom building cuda from source repo. Unfortunately I don't think that's an option for most people, as we have a lot of users training on colab, kaggle etc. I think the only way we might be able to use it is if the author pushed a docker image that includes his custom build that we could start 'FROM', but this would also be a serious maintenance problem as base torch images are pushed every month by nvidia.

yolov5/Dockerfile

Line 2 in f807e7b

FROM nvcr.io/nvidia/pytorch:20.03-py3

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu I had an idea here. When I used to do physics simulations I used a package from CERN called GEANT. GEANT managed to recreate life-like behaviors of particles from quite simple lookup tables. In some cases I was surprised to see that complex transfer functions were represented by a simple 10 or 100 bin lookup table. Swish and Mish are continuous, but I think they might be easily represented in the same way by a lookup table, say 10, 100 or even 1000 bins from -10 to 10, extrapolating values past those limits and incorporating aprior the gradient information as the simple slope of the data between bins.

I think this might achieve perhaps 90% of the gains of Swish and Mish with very little overhead, and may still be executable as an in-place operation. Do you know if the author has tried something like this? If you have time it might be worth a shot.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher yes, it can directly work with nvcr.io/nvidia/pytorch:20.03-py3 but not nvcr.io/nvidia/pytorch:20.06-py3 when using pip install.

from yolov5.

WongKinYiu commented on April 30, 2024

@glenn-jocher I do not know if the author has tried something like this or not. I think use line search to find the gradient is a similar idea, it can always be an in-place operation but I can not guarantee the overhead is little.

from yolov5.

Ammad53 commented on April 30, 2024

I am so confused, which version should I use for object detection (insulator) in power grind for my project?
Yolo v4 vs Yolo V5?

from yolov5.

glenn-jocher commented on April 30, 2024

@Ammad53 try both, see which works better for you.

from yolov5.

bonlime commented on April 30, 2024

@glenn-jocher I saw your discussion about building faster swish/mish and want to comment.

there is no need to build custom CUDA kernels for mish/swish. it could be implemented in pure pytorch using AutoGrad to be memory efficient and jit.script to be faster. like this This version is not traceable/exportable so you need a naive swish/mish implementation to be able to switch.
Mobilenet v3 introduced Hard Swish approximation for Swish. And is has already made it to pytorch master (in 1.6 version). check F.hard_swish. It is almost identical to ReLU in terms of speed but still gives a boost in performance.
Efficient native implementation of Swish is also coming to PyTorch (it is already available in nightly build btw). They call it SiLU because it appeared under this name first, not swish.

from yolov5.

glenn-jocher commented on April 30, 2024

@bonlime thanks for the tips! In the course of my research above I experimented a bit with all 3 (swish/SiLU), hardswish and Mish. hardswish from mobilenet v3 seems to be a good balance of improved performance with minimal performance penalites, though it may actually be less exportable to coreml than normal swish.

I'm training a few models with updated activations now. I believe we should be able to release a new set of v3.0 models around mid-late August with updated activations.

I don't like the Swish branding either by the way. It's typical corporate modus operandi to take previous research and brand and resell it, api it, or cloud service it.

One odd item is that i see you have hardswish inplace option. I thought it was not possible for any x * f(x) operation to be executed inplace. The hardsigmoid yes, but if x is modified inplace by the hardsigmoid op, how would one still have the original x to multiply it against?

from yolov5.

bonlime commented on April 30, 2024

The inplace version is only for compatibility with the rest of the codebase. It's not really in place, it only accepts this arg.

from yolov5.

glenn-jocher commented on April 30, 2024

@bonlime ah I see, ok.

from yolov5.

glenn-jocher commented on April 30, 2024

@WongKinYiu wow, that's a super speedup!! Even more impressive if it can reduce CUDA memory. That's an amazing discovery you made there!!

I didn't know about this option. Would it make sense to convert the imgs to channels_last at the same time they are converted to pytorch tensors (L93 here for example)?

yolov5/test.py

Lines 92 to 93 in 4fb8cb3

 for batch_i, (img, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)): 

 img = img.to(device, non_blocking=True)

EDIT: hmm here they are already tensors, must have been converted in the dataloader here:

yolov5/utils/datasets.py

Line 561 in 4fb8cb3

return torch.from_numpy(img), labels_out, self.img_files[index], shapes

So checkpointing the channels_last model and then loading it for inference are ok?

from yolov5.

glenn-jocher commented on April 30, 2024

Looking at the pytorch channels_last examples, it seems .contiguous() plays a role in reversing the format. I wonder what this means for the .contiguous() op in the yolo forward method. I have this in place because it speeds up training a bit vs not having it.

yolov5/models/yolo.py

Lines 35 to 42 in 4fb8cb3

 def forward(self, x): 

 # x = x.copy() # for profiling 

 z = [] # inference output 

 self.training |= self.export 

 for i in range(self.nl): 

 x[i] = self.m[i](x[i]) # conv 

 bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85) 

 x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

The train and test dataloader also passes the images through a numpy contiguous op. I don't remember if removing this causes an error or if I put it there simply for speed:

yolov5/utils/datasets.py

Lines 557 to 559 in 4fb8cb3

 # Convert 

 img = img[:, :, ::-1].transpose(2, 0, 1) # BGR to RGB, to 3x416x416 

 img = np.ascontiguousarray(img)

from yolov5.

rafale77 commented on April 30, 2024

@glenn-jocher Hello,

Would you like to integrate mish_cuda to the repo?
It takes almost same training time as LeakyReLU model when batch size is same.
And bring huge benefit on AP, following results are training with default setting.

Model Test Size APval AP50val AP75val APSval APMval APLval
YOLOv4(CSP,Leaky) 672 47.2% 65.9% 51.7% 29.8% 52.3% 61.5%
YOLOv4(CSP,Mish) 672 48.1% 66.8% 52.6% 31.9% 53.3% 61.0%
YOLOv4(CSP,Mish) 544 47.1% 65.6% 51.0% 29.0% 52.2% 63.6%
By the way, MixUp can benefit AP by ~0.3%.

@WongKinYiu, just wondering how your x and l mish models compare on the graph with yolo V5?

I have re-implemented within a home assistant project, inferring with a fixed framerate across multiple (6) camera streams I am was able to compare to the opencv implementation of the yoloV4 darknet model:

Just comparing the resources consumed on my setup with an i7 6700K and RTX 2070:

opencv/darknet yoloV4 608: CPU ~30% -- RAM ~ 7GB -- GPU 8% -- VRAM 3GB
pytorch yoloV4l-mish 672: CPU ~ 53% -- RAM <0.5GB -- GPU 9% -- VRAM 2GB
pytorch yolov5l 672: CPU ~50% -- RAM < 0.5GB -- GPU 7% -- VRAM 1.8GB
pytorch yolov4x-mish 672: CPU ~60% -- RAM <0.5GB -- GPU 15% -- VRAM 2.5GB

from yolov5.

github-actions commented on April 30, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from yolov5.

RostamDin commented on April 30, 2024

How to avoid Dnn speed drop while using yolov5 simultaneously.

I want to detect people in image with Yolo5 and their age with Dnn. Age detection speed is fine before I try to load YOLOV5 to my code. loading the YOLOV5 model causes to drop process speed to less then half even when I am not started using the YOLOV5 model to detect any thing yet.

import numpy as np
import time
import torch
import os



# below line drops the process speed
detectObjectsModel = torch.hub.load('ultralytics/yolov5', 'yolov5s')



age_proto = os.path.join(".\\models", 'age_deploy.prototxt')
age_model = os.path.join(".\\models", 'age_net.caffemodel')
net = cv2.dnn.readNet(age_model, age_proto)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

frame = np.random.randint(255, size=(416, 416, 3), dtype=np.uint8) # put your image here!
cv2.imshow("frame",frame)
blob = cv2.dnn.blobFromImage(frame, 0.00392, (227, 227), [0, 0, 0], True, False)


start = time.time()
for i in range(100):
    net.setInput(blob)
    detections = net.forward(net.getUnconnectedOutLayersNames())
end = time.time()

ms_per_image = (end - start) * 1000 / 100```

from yolov5.

About reproduced results about yolov5 HOT 83 CLOSED

Comments (83)

How to avoid Dnn speed drop while using yolov5 simultaneously.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	elif opt.task == 'study': # run over a range of settings and save/plot
	for weights in ['yolov5s.pt', 'yolov5m.pt', 'yolov5l.pt', 'yolov5x.pt']:
	f = 'study_%s_%s.txt' % (Path(opt.data).stem, Path(weights).stem) # filename to save to
	x = list(range(288, 896, 64)) # x axis
	y = [] # y axis
	for i in x: # img-size
	print('\nRunning %s point %s...' % (f, i))
	r, _, t = test(opt.data, weights, opt.batch_size, i, opt.conf_thres, opt.iou_thres, opt.save_json)
	y.append(r + t) # results and times
	np.savetxt(f, y, fmt='%10.4g') # save
	os.system('zip -r study.zip study_*.txt')
	# plot_study_txt(f, x) # plot

	for batch_i, (img, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)):
	img = img.to(device, non_blocking=True)

	def forward(self, x):
	# x = x.copy() # for profiling
	z = [] # inference output
	self.training \|= self.export
	for i in range(self.nl):
	x[i] = self.m[i](x[i]) # conv
	bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
	x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

	# Convert
	img = img[:, :, ::-1].transpose(2, 0, 1) # BGR to RGB, to 3x416x416
	img = np.ascontiguousarray(img)