Giter Club home page Giter Club logo

Comments (11)

olutsiv avatar olutsiv commented on May 28, 2024

I get this error when I only included 60 pictures in the data. Same 60 picture IDs in the train.txt, val.txt test.txt, and trainval.txt just for testing, I know it should be split roughly 80/10/10. Not sure why I get a different error when there is less pictures but my final project will have 4 classes and roughly 6-8k images. It worrying me that I'm not able to even get 1 class with 1300 images to work

python3 train_ssd.py --dataset-type=voc --data=data/ambulance1 --model-dir=models/ambulance8 --batch-size=4 --workers=2 --epochs=1
2024-03-09 00:54:19 - Using CUDA...
2024-03-09 00:54:19 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/ambulance8', dataset_type='voc', datasets=['data/ambulance1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-09 00:54:31 - model resolution 300x300
2024-03-09 00:54:31 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - Prepare training datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Stored labels into file models/ambulance8/labels.txt.
2024-03-09 00:54:31 - Train dataset size: 60
2024-03-09 00:54:31 - Prepare Validation datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Validation dataset size: 60
2024-03-09 00:54:31 - Build network.
2024-03-09 00:54:31 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-09 00:54:32 - Took 0.69 seconds to load the model.
2024-03-09 00:54:32 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-09 00:54:32 - Uses CosineAnnealingLR scheduler.
2024-03-09 00:54:32 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/Pillow-9.5.0-py3.8-linux-aarch64.egg/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 252) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train_ssd.py", line 406, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 139, in train
for i, data in enumerate(loader):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 252) exited unexpectedly

from jetson-inference.

dusty-nv avatar dusty-nv commented on May 28, 2024

@olutsiv killed means that the board ran out of memory, try decreasing the --batch-size 1 and --num-workers 1 and mounting swap, ect

from jetson-inference.

olutsiv avatar olutsiv commented on May 28, 2024

I have tried doing all of that with no luck. Could it be the annotation XML file that's wrong? Because I'm using my own pictures labeled in XML format. At first the format was a little different compared to the XML files that the webcam labeling software produces but i wrote a script to edit them and now they are the exact same. Is it possible for the nano to run out of memory with only 30 pictures and --batch-size 1, --num-workers 1, and --epoch 1?

I uploaded the xml files that i was using. Maybe you can take a look at them and see if you can spot anything. I would really appreciate any help I can get, my group is kinda stuck right now and we need to get this working in order to finish our senior project.
https://drive.google.com/drive/folders/1PQCeKoK-mGdD49nlpCE4eyMbp6KsWg8e?usp=sharing

This is the error im getting with the updated XML files, only 28 pictures and it says killed.

python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:02:52 - Using CUDA...
2024-03-28 17:02:52 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:03:00 - model resolution 300x300
2024-03-28 17:03:00 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - Prepare training datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Stored labels into file models/EMSdetect/labels.txt.
2024-03-28 17:03:00 - Train dataset size: 28
2024-03-28 17:03:00 - Prepare Validation datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Validation dataset size: 28
2024-03-28 17:03:00 - Build network.
2024-03-28 17:03:00 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:03:01 - Took 0.68 seconds to load the model.
2024-03-28 17:03:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:03:01 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:03:01 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2024-03-28 17:03:22 - Epoch: 0, Step: 10/28, Avg Loss: 8.0487, Avg Regression Losroot@oleg:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect3 --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:12:04 - Using CUDA...
2024-03-28 17:12:04 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect3', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:12:15 - model resolution 300x300
2024-03-28 17:12:15 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - Prepare training datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Stored labels into file models/EMSdetect3/labels.txt.
2024-03-28 17:12:15 - Train dataset size: 28
2024-03-28 17:12:15 - Prepare Validation datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Validation dataset size: 28
2024-03-28 17:12:15 - Build network.
2024-03-28 17:12:15 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:12:16 - Took 0.68 seconds to load the model.
2024-03-28 17:12:16 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:12:16 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:12:16 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Killed

from jetson-inference.

olutsiv avatar olutsiv commented on May 28, 2024

I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work.

from jetson-inference.

dusty-nv avatar dusty-nv commented on May 28, 2024

from jetson-inference.

olutsiv avatar olutsiv commented on May 28, 2024

Hmmm, yah maybe it is the pictures, the resolutions vary, they are not all the same. What's the recommended or the max resolution of pictures I should be using?

I wasn't quite sure how much swap I can mount. What would you recommend for a nano orin?

from jetson-inference.

dusty-nv avatar dusty-nv commented on May 28, 2024

Swap, I typically mount the same amount as the board has RAM, so 8GB swap for Orin Nano.

I would probably keep the pictures to 1920x1080 resolution or similar...the camera-capture program captures them at 1280x720. The model downsamples them to 300x300 anyways

from jetson-inference.

olutsiv avatar olutsiv commented on May 28, 2024

Ok thank you so much for that information. I believe the pictures we are using are all around that size or even smaller. I was mounting 4GB but I will try to mount 8 and see what happens. I’m just confused why it wasn’t even able to train 4 pictures.

from jetson-inference.

dusty-nv avatar dusty-nv commented on May 28, 2024

I'm not sure either since you said it trained fine on what you captured with camera-capture, which would lead one to believe it is related to the dataset

from jetson-inference.

olutsiv avatar olutsiv commented on May 28, 2024

Yeah that’s the conclusion I came to too. I think we will try and relabel some of our images with CVAT and run a test model with those and see if it trains properly. If it does then we will just have to relabel all of our images with CVAT.

from jetson-inference.

dusty-nv avatar dusty-nv commented on May 28, 2024

OK gotcha - I have used CVAT in the past for this. If you have another machine with more memory capable of running PyTorch, you can do the training there too.

from jetson-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.