Comments (24)
Thank you very much.
from asl.
three observations to start with:
for testing convergence, use smaller resolution (224) and larger batch size (128)
i don't see any augmentations in your training code.
use RandAugment or AutoAugment at least + cutout
also, something weird with your scheduler:
scheduler = lr_scheduler.OneCycleLR(optimizer, max_lr = 0.0002, total_steps = total_step, epochs = 25)
its hardcoded to 25 epochs, yet you loop over only 5 epochs
add epochs as hyperparameter to arg list, and use it everywhere instead of hard-coded numbers. search for other hyper-parameters that should belong to arg list as well
from asl.
p.s. 1
also for testing and prototyping, use tresnet_m
p.s. 2
also you need to implement "true weight-decay" (not doing weight decay on bias and batch norm)
p.s. 3
i will probably notice other problems in the future, but we need to start from somewhere :-)
from asl.
Thanks for your comment, I have tried to follow your instructions and the new train.py is available at https://github.com/GhostWnd/reproducingASL, the newest one is train_ver3.py, I will try to run it and report later.
I have run the train_ver3.py for around 600 iterations, and it appears that the training speed is much slower than the beginning, at the begining, it requires 3 second for each iteration, but after 600 iteration, it requires 9 second for an iteration. It puzzles me much, I doubu whether I have implemented the code right.
from asl.
i will take a look at the code and try to run It when i have the time.
good work so far, i think with joint forces we are on our way to finally have a modern multi-label code for the community to use, that vast majority of repo that exist are way outdated.
several more corrections and suggestions:
args.do_bottleneck_head = False (not True)
one more correction is that you are using the 2017 split. while this is not a "mistake" (and your results will be a little higher), in articles people use the 2014 split.
what about mixed precision ? with modern pytorch it is a few line of code ("with autocast():"...)
to improve speed, you don't have to update EMA every iteration. you cant update it every ~5 iterations with slightly higher decaey rate, and still ger similar results.
load a pretrained model, run only inference, and make sure you reproduce the article results (after switching to 2014 split)
make sure, especially in validation, that you are not building enormous vectors along the training that clog the RAM memory. sometimes its better to pre-allocate memory if you need to store large vectors
you have not implemented true WD correctly. this is not AdamW.
see example for true WD in:
https://github.com/rwightman/pytorch-image-models/blob/198f6ea0f3dae13f041f3ea5880dd79089b60d61/timm/optim/optim_factory.py
(def add_weight_decay...)
from asl.
Thank you for your comment, I will try to correct it.
And if it's possible, could you please release the loss change when you run my code? Pure data is the best.
Thank you very much.
I have tried to correct true WD, it's now train_ver4.py, available at https://github.com/GhostWnd/reproducingASL
from asl.
Hi GhostWnd
I took a deeper look at the code. there are several major problems there.
make sure you understand whats the problem in each and every one, and apply proper corrections. don't skip a single one.
most of these problems are "deal-breakers".
after correcting all of them, repeat your runs, and we can compare results.
I hope i will have some results to compare until then (If i won't find more bugs)
don't get discouraged, we are making progress, and sometimes the journey is more educational than the destination.
problems:
-
currently not using randaugment (commented in train_loader)
-
using uninitilizaing model (for training and comparison to article, you should initialize model from relevant imagenet model https://github.com/Alibaba-MIIL/TResNet/blob/master/MODEL_ZOO.md)
-
using 2017 coco split is wrong (use instead 2014 coco split, different json files only )
-
Cutout(n_holes = 1, length = 16) -> Cutout(n_holes = 1, args.image_size/2)
-
validation should be done once an epoch, no more and no less
-
preds.append(output.cpu()) targets.append(target.cpu()) ->
preds.append(output.cpu().detach())
targets.append(target.cpu().detach())
mAP_score = validate_multi(val_loader, model, args, ema)
->
model.eval()
mAP_score = validate_multi(val_loader, model, args, ema)
model.train()
- calculate only mAP metrics. remove other metrics, they are only confusing during training
from asl.
just to give you motivation, i got a good score last night when running a corrected code...
from asl.
Thank you for your comment and effort, I will try to correct the code and run it.
Thank you very much.
I have tried to fix the problems you mentioned , my code is train_ver5.py available at https://github.com/GhostWnd/reproducingASL
Other than train_ver5.py, I also edit helper_functions.py and to allow me to use 2014 json to train 2017 data.
Here is the change:
path = coco.loadImgs(img_id)[0]['file_name']
img = Image.open(os.path.join(self.root, path)).convert('RGB')
->
path = coco.loadImgs(img_id)[0]['file_name']
path = path.split('_')[-1] #remove 'MSCOCO_2014'
img = Image.open(os.path.join(self.root, path)).convert('RGB')
When I try to use 2014json to train 2017data, it seems that when validate, there are some images that in 2014 validate dataset while not in 2017 validate dataset, I would like to know, does the difference between 2014 and 2017 affect the result much?
Thank you very much.
from asl.
Sorry to bother again, I know due to commercial issues you can't release your training code, but could you release the code you correct based on my train.py?
If that is not possible, could you please release the loss record of your corrected code based on my training code, so that I can compare the result myself?
Thank you very much.
from asl.
Hi GhostWnd
there were other problems in the code.
The two major ones:
- sigmoid was done twice (!) - once in the direct prediction, second in the loss.
- EMA was not performed correctly (its a separate model with separate validation)
anyway, this code fully reproduces the article results (i think it even surpasses it):
train_asl_reproduce.zip
i will attach logs for 224 and 448 trainings later
you are welcome to test it yourself and give me feedback.
thanks for the collaboration, together we will release the first publicly available modern multi-label code
:-)
from asl.
Thank you so much! :-)
I will upload the train file to make it publicly available as well as test it by myself and give feedback to you as soon as possible.
from asl.
this is an example log file (notice - resolution 224, mtresnet)
mtresnet_224.txt
from asl.
Thank you so much! :-)
I will upload the train file to make it publicly available as well as test it by myself and give feedback to you as soon as possible.
do you have an objection that i will add the code also to
https://github.com/Alibaba-MIIL/ASL ?
i think it will help it gain more traction. there are very few (zero) modern multi-label code-bases like this, with top results.
i will of course share credit with you, i had made a lot of changes and enhancements to the code, but you provided the base implementation
from asl.
No objection, it's my pleasure, thank you very much.
from asl.
And I wonder whether you can put the mode based on tresnet_m and input size 224 into you pretrained model in https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md?
I would like to adjust some hyper parameters to test the influence of those hyperparameters.
And apply it to other dataset as pretrained model.
Thank you very much.
from asl.
i am not sure i fully understand your question.
models in
https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md
are standard imagenet models for downstream tasks. these are the models you should use to initialize training on COCO.
from asl.
Well, I think if I don't make a mistake
models in ASL/blob/main/MODEL_ZOO.md are models trained on MSCOCO, the link is https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md
while models in TResNet/blob/master/MODEL_ZOO.md are standard imagenet models, the link is https://github.com/Alibaba-MIIL/TResNet/blob/master/MODEL_ZOO.md, right?
I just wonder whether you could upload the model you trained with tresnet_m with input size 224 into https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md, the ASL/blob/main/MODEL_ZOO.md one
from asl.
Or could you please share the model you trained on tresnet_m and input size 224 with me?
I would like to adjust some hyper parameters to test the influence of those hyperparameters.
And apply it to other dataset as pretrained model.
Thank you very much.
from asl.
just to be clear:
tresnet_m 224 model trained on MS-COCO ?
from asl.
yes, the one that produces the log file mtresnet_224.txt
from asl.
added to
https://github.com/Alibaba-MIIL/ASL/blob/main/MODEL_ZOO.md
from asl.
this is an example log file (notice - resolution 224, mtresnet)
mtresnet_224.txt
Can you attach logs for 448 resolution with tresnet_l using this training code? I found it's hard for me to reprodect the 86.8mAP resault in paper.
from asl.
@LOOKCC
run
https://github.com/Alibaba-MIIL/ASL/blob/main/train.py
from asl.
Related Issues (20)
- How can I normalize these datasets NUS-WIDE and Open Image?
- Some questions about the 'clamp' operation
- Can this be used for multilabel dataset in classification HOT 1
- Unit tests are invalid
- 怎么样多卡训练,谢谢。
- question about data augmentation HOT 3
- What is the best practice to increase the number of tags of an existing model without retraining the whole model again?
- How to pass weights to ASL?
- the problem of shifted probability HOT 1
- 如何能够实现自适应的超参数?
- Some questions about openimages v6 datasets? HOT 1
- Do you have Tensorflow implementation? HOT 1
- Hello
- In my task,it alwasy return nan,May be it is not as wide use as BCE? HOT 3
- Maybe a Bug HOT 1
- How to multi-GPU training
- ResNet101+ASL in MS-COCO HOT 7
- target by object area
- bug: this class equals to nn.CrossEntropyLoss with labelsoothing
- Adaptive Asymmetry
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from asl.