Hello, I am working on my senior design project that involves object detection. I'm ha

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Custom object detection model training keeps failing about jetson-inference HOT 11 OPEN

olutsiv commented on May 28, 2024

Custom object detection model training keeps failing

from jetson-inference.

Comments (11)

olutsiv commented on May 28, 2024

I get this error when I only included 60 pictures in the data. Same 60 picture IDs in the train.txt, val.txt test.txt, and trainval.txt just for testing, I know it should be split roughly 80/10/10. Not sure why I get a different error when there is less pictures but my final project will have 4 classes and roughly 6-8k images. It worrying me that I'm not able to even get 1 class with 1300 images to work

python3 train_ssd.py --dataset-type=voc --data=data/ambulance1 --model-dir=models/ambulance8 --batch-size=4 --workers=2 --epochs=1
2024-03-09 00:54:19 - Using CUDA...
2024-03-09 00:54:19 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/ambulance8', dataset_type='voc', datasets=['data/ambulance1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-09 00:54:31 - model resolution 300x300
2024-03-09 00:54:31 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - Prepare training datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Stored labels into file models/ambulance8/labels.txt.
2024-03-09 00:54:31 - Train dataset size: 60
2024-03-09 00:54:31 - Prepare Validation datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Validation dataset size: 60
2024-03-09 00:54:31 - Build network.
2024-03-09 00:54:31 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-09 00:54:32 - Took 0.69 seconds to load the model.
2024-03-09 00:54:32 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-09 00:54:32 - Uses CosineAnnealingLR scheduler.
2024-03-09 00:54:32 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/Pillow-9.5.0-py3.8-linux-aarch64.egg/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 252) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train_ssd.py", line 406, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 139, in train
for i, data in enumerate(loader):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 252) exited unexpectedly

from jetson-inference.

dusty-nv commented on May 28, 2024

@olutsiv killed means that the board ran out of memory, try decreasing the --batch-size 1 and --num-workers 1 and mounting swap, ect

from jetson-inference.

olutsiv commented on May 28, 2024

I have tried doing all of that with no luck. Could it be the annotation XML file that's wrong? Because I'm using my own pictures labeled in XML format. At first the format was a little different compared to the XML files that the webcam labeling software produces but i wrote a script to edit them and now they are the exact same. Is it possible for the nano to run out of memory with only 30 pictures and --batch-size 1, --num-workers 1, and --epoch 1?

I uploaded the xml files that i was using. Maybe you can take a look at them and see if you can spot anything. I would really appreciate any help I can get, my group is kinda stuck right now and we need to get this working in order to finish our senior project.
https://drive.google.com/drive/folders/1PQCeKoK-mGdD49nlpCE4eyMbp6KsWg8e?usp=sharing

This is the error im getting with the updated XML files, only 28 pictures and it says killed.

python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:02:52 - Using CUDA...
2024-03-28 17:02:52 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:03:00 - model resolution 300x300
2024-03-28 17:03:00 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - Prepare training datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Stored labels into file models/EMSdetect/labels.txt.
2024-03-28 17:03:00 - Train dataset size: 28
2024-03-28 17:03:00 - Prepare Validation datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Validation dataset size: 28
2024-03-28 17:03:00 - Build network.
2024-03-28 17:03:00 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:03:01 - Took 0.68 seconds to load the model.
2024-03-28 17:03:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:03:01 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:03:01 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2024-03-28 17:03:22 - Epoch: 0, Step: 10/28, Avg Loss: 8.0487, Avg Regression Losroot@oleg:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect3 --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:12:04 - Using CUDA...
2024-03-28 17:12:04 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect3', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:12:15 - model resolution 300x300
2024-03-28 17:12:15 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - Prepare training datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Stored labels into file models/EMSdetect3/labels.txt.
2024-03-28 17:12:15 - Train dataset size: 28
2024-03-28 17:12:15 - Prepare Validation datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Validation dataset size: 28
2024-03-28 17:12:15 - Build network.
2024-03-28 17:12:15 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:12:16 - Took 0.68 seconds to load the model.
2024-03-28 17:12:16 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:12:16 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:12:16 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Killed

from jetson-inference.

olutsiv commented on May 28, 2024

I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work.

from jetson-inference.

dusty-nv commented on May 28, 2024

What is the resolution of your own pictures? Maybe they are really large and it is keeping them in memory? Did you mount enough swap? You can also run these pytorch training scripts on another Linux/GPU machine with more memory or in Google collab I think

…

________________________________ From: olutsiv ***@***.***> Sent: Thursday, March 28, 2024 2:19:08 PM To: dusty-nv/jetson-inference ***@***.***> Cc: Dustin Franklin ***@***.***>; Comment ***@***.***> Subject: Re: [dusty-nv/jetson-inference] Custom object detection model training keeps failing (Issue #1806) I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work. — Reply to this email directly, view it on GitHub<#1806 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADVEGK4FOZQ2IWC2OOVT6OTY2RNJZAVCNFSM6AAAAABENT7LBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVHAZTQMZRGM>. You are receiving this because you commented.Message ID: ***@***.***>

from jetson-inference.

olutsiv commented on May 28, 2024

Hmmm, yah maybe it is the pictures, the resolutions vary, they are not all the same. What's the recommended or the max resolution of pictures I should be using?

I wasn't quite sure how much swap I can mount. What would you recommend for a nano orin?

from jetson-inference.

dusty-nv commented on May 28, 2024

Swap, I typically mount the same amount as the board has RAM, so 8GB swap for Orin Nano.

I would probably keep the pictures to 1920x1080 resolution or similar...the camera-capture program captures them at 1280x720. The model downsamples them to 300x300 anyways

from jetson-inference.

olutsiv commented on May 28, 2024

Ok thank you so much for that information. I believe the pictures we are using are all around that size or even smaller. I was mounting 4GB but I will try to mount 8 and see what happens. I’m just confused why it wasn’t even able to train 4 pictures.

from jetson-inference.

dusty-nv commented on May 28, 2024

I'm not sure either since you said it trained fine on what you captured with camera-capture, which would lead one to believe it is related to the dataset

from jetson-inference.

olutsiv commented on May 28, 2024

Yeah that’s the conclusion I came to too. I think we will try and relabel some of our images with CVAT and run a test model with those and see if it trains properly. If it does then we will just have to relabel all of our images with CVAT.

from jetson-inference.

dusty-nv commented on May 28, 2024

OK gotcha - I have used CVAT in the past for this. If you have another machine with more memory capable of running PyTorch, you can do the training there too.

from jetson-inference.

Custom object detection model training keeps failing about jetson-inference HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent