Comments (11)
I get this error when I only included 60 pictures in the data. Same 60 picture IDs in the train.txt, val.txt test.txt, and trainval.txt just for testing, I know it should be split roughly 80/10/10. Not sure why I get a different error when there is less pictures but my final project will have 4 classes and roughly 6-8k images. It worrying me that I'm not able to even get 1 class with 1300 images to work
python3 train_ssd.py --dataset-type=voc --data=data/ambulance1 --model-dir=models/ambulance8 --batch-size=4 --workers=2 --epochs=1
2024-03-09 00:54:19 - Using CUDA...
2024-03-09 00:54:19 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/ambulance8', dataset_type='voc', datasets=['data/ambulance1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-09 00:54:31 - model resolution 300x300
2024-03-09 00:54:31 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-09 00:54:31 - Prepare training datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Stored labels into file models/ambulance8/labels.txt.
2024-03-09 00:54:31 - Train dataset size: 60
2024-03-09 00:54:31 - Prepare Validation datasets.
2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-09 00:54:31 - Validation dataset size: 60
2024-03-09 00:54:31 - Build network.
2024-03-09 00:54:31 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-09 00:54:32 - Took 0.69 seconds to load the model.
2024-03-09 00:54:32 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-09 00:54:32 - Uses CosineAnnealingLR scheduler.
2024-03-09 00:54:32 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/Pillow-9.5.0-py3.8-linux-aarch64.egg/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 252) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train_ssd.py", line 406, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 139, in train
for i, data in enumerate(loader):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 252) exited unexpectedly
from jetson-inference.
@olutsiv killed
means that the board ran out of memory, try decreasing the --batch-size 1
and --num-workers 1
and mounting swap, ect
from jetson-inference.
I have tried doing all of that with no luck. Could it be the annotation XML file that's wrong? Because I'm using my own pictures labeled in XML format. At first the format was a little different compared to the XML files that the webcam labeling software produces but i wrote a script to edit them and now they are the exact same. Is it possible for the nano to run out of memory with only 30 pictures and --batch-size 1, --num-workers 1, and --epoch 1?
I uploaded the xml files that i was using. Maybe you can take a look at them and see if you can spot anything. I would really appreciate any help I can get, my group is kinda stuck right now and we need to get this working in order to finish our senior project.
https://drive.google.com/drive/folders/1PQCeKoK-mGdD49nlpCE4eyMbp6KsWg8e?usp=sharing
This is the error im getting with the updated XML files, only 28 pictures and it says killed.
python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:02:52 - Using CUDA...
2024-03-28 17:02:52 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:03:00 - model resolution 300x300
2024-03-28 17:03:00 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:03:00 - Prepare training datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Stored labels into file models/EMSdetect/labels.txt.
2024-03-28 17:03:00 - Train dataset size: 28
2024-03-28 17:03:00 - Prepare Validation datasets.
2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:03:00 - Validation dataset size: 28
2024-03-28 17:03:00 - Build network.
2024-03-28 17:03:00 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:03:01 - Took 0.68 seconds to load the model.
2024-03-28 17:03:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:03:01 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:03:01 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2024-03-28 17:03:22 - Epoch: 0, Step: 10/28, Avg Loss: 8.0487, Avg Regression Losroot@oleg:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect3 --batch-size=1 --workers=1 --epochs=1
2024-03-28 17:12:04 - Using CUDA...
2024-03-28 17:12:04 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect3', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-03-28 17:12:15 - model resolution 300x300
2024-03-28 17:12:15 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-03-28 17:12:15 - Prepare training datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Stored labels into file models/EMSdetect3/labels.txt.
2024-03-28 17:12:15 - Train dataset size: 28
2024-03-28 17:12:15 - Prepare Validation datasets.
2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance')
2024-03-28 17:12:15 - Validation dataset size: 28
2024-03-28 17:12:15 - Build network.
2024-03-28 17:12:15 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-03-28 17:12:16 - Took 0.68 seconds to load the model.
2024-03-28 17:12:16 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-03-28 17:12:16 - Uses CosineAnnealingLR scheduler.
2024-03-28 17:12:16 - Start training from epoch 0.
/usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Killed
from jetson-inference.
I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work.
from jetson-inference.
from jetson-inference.
Hmmm, yah maybe it is the pictures, the resolutions vary, they are not all the same. What's the recommended or the max resolution of pictures I should be using?
I wasn't quite sure how much swap I can mount. What would you recommend for a nano orin?
from jetson-inference.
Swap, I typically mount the same amount as the board has RAM, so 8GB swap for Orin Nano.
I would probably keep the pictures to 1920x1080 resolution or similar...the camera-capture program captures them at 1280x720. The model downsamples them to 300x300 anyways
from jetson-inference.
Ok thank you so much for that information. I believe the pictures we are using are all around that size or even smaller. I was mounting 4GB but I will try to mount 8 and see what happens. I’m just confused why it wasn’t even able to train 4 pictures.
from jetson-inference.
I'm not sure either since you said it trained fine on what you captured with camera-capture, which would lead one to believe it is related to the dataset
from jetson-inference.
Yeah that’s the conclusion I came to too. I think we will try and relabel some of our images with CVAT and run a test model with those and see if it trains properly. If it does then we will just have to relabel all of our images with CVAT.
from jetson-inference.
OK gotcha - I have used CVAT in the past for this. If you have another machine with more memory capable of running PyTorch, you can do the training there too.
from jetson-inference.
Related Issues (20)
- Fail to load vgg-16 for inferencing on Jetson Nano 4GB HOT 1
- detectnet with GMSL Camera
- Is there a way to manually EOS in python object? HOT 2
- How to save videos through videosource, instead of using output.render HOT 1
- MobileNetV3 constant dips in accuracy.
- unable to download ResNet18-Tagging-VOC/resnet18.onnx
- Jetson Inference Docker using Ubuntu 22.04 and TensorRT 10.0.1 HOT 2
- Segnet fails to load when using DLA, but PASSED using TensorRT.trtexec HOT 2
- build a GPU accelerated docker container with jetson-inferense, python3.10 and ros2 humble for jetson nano 4G HOT 4
- Implementing WebRTC peer connection from Unity3D HOT 1
- Implementing a simple pipeline with WebRTC HOT 4
- where to download the models in this project? HOT 1
- could update the model to yolov5 or yolov8? model are caffe, are old!
- Issue with detecting small objects using Detectnet
- ./tools/install-pytorch.sh stucked on Jetson Orin Nano (Linux 36.2/JetPack6.0DP) HOT 2
- Facing an issue to re-train the model in Jetson-inference HOT 6
- werbrtc server is displayed HOT 1
- onnx_export.py says torch not compiled with CUDA, but it should be
- Combining Jetson Inference with Jetbot native code
- Run superResNet
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jetson-inference.