Giter Club home page Giter Club logo

Comments (8)

 avatar commented on August 24, 2024 1

Hi, everyone

I need to combine the MSCOCO, Flicker8k and Flicker 30k images for training my captioning model. Can someone help me out with a script?

from neuraltalk2.

SaMnCo avatar SaMnCo commented on August 24, 2024

@cgq5, I am interested in this. I have been looking into it and started coding some stuff. I think I'll have some results by Christmas.

Steps I am doing:

  • From your dataset, create the equivalent of val.json or train.json.
  • from there, update the ipython notebook to convert into a sbu_raw.json
  • Run the training model from there as well.

I think what I will do is actually merge the MS Coco dataset and the SBU one (both are taken from Flickr), so that the dataset is even bigger.

Training MS Coco for a first model just took me a little bit over 30 hours on a g2.2xlarge instance on AWS, for ~125k images. As this new one has 1M of them, I guess the process will take about 10 to 15 days.

I am upgrading my docker containers to be able to do training as well. If you have a powerful setup (some nvidia boards in a rack), I'd be happy to collaborate with you to make this happen off cloud (AWS is kind of expensive...).

Question: what are the images about? Are they a random images or carefully selected?

from neuraltalk2.

SaMnCo avatar SaMnCo commented on August 24, 2024

Maybe at some point we can do http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images

:D

from neuraltalk2.

SaMnCo avatar SaMnCo commented on August 24, 2024

@cgq5 check out the attached script. It will download first 100 images of the SBU dataset and create a file similar to coco_raw.json (of course you can update that to run as many images as you want)
It also checks whether the images are missing or not, and if yes, will not use them. It's a little slow cause Flickr doesn't give you a 404, it just sends you a png to download saying the original image doesn't exist anymore.
Anyway, once you have that, you can essentially run the prepo.py then train.lua on it. I'll probably give it a go for testing once I find a machine with enough storage (if all that is linear, the .h5 files will be about 300GB in size, pretty big...) and 1M images @~200k = 200GB of images, not that small either, and expensive on the cloud.

build-dataset.txt

from neuraltalk2.

SaMnCo avatar SaMnCo commented on August 24, 2024

@cgq5 did move forward a little bit with this. You can find an improved script in my fork https://github.com/SaMnCo/neuraltalk2/tree/master/im2text

use it with ./build-dataset.sh /path/to/destination FIRST_PIC NB_PIC

example: ./build-dataset.sh ~/Picture/test 1 1000000
will download ALL images, using 10 concurrent threads (you can change that in the script). If it fails at some point, you can restart it, it will not attempt to download existing images.
I am using a weak test for checking if images are OK or not, but if you look in the code, you can enable the MD5 check pretty easily. This will slow down the code a LOT, but will guarantee consistency.

The output will be a subfolder of the target dir with all images, along with a json file compliant with the eval.lua. You can then do the prepro.py on it and you should be alright!
On my end, I am currently in the range of 500k images downloaded, hoping to get to the million by Christmas.

One key difference I noticed between SBU and this dataset is that the SBU has a single caption for each image, whereas MSCOCO has several. Not sure how it impacts quality, but it probably makes MSCOCO much more relevant at least for its subset of data.

I hope this helps,

from neuraltalk2.

gqcao avatar gqcao commented on August 24, 2024

@SaMnCo Good to hear that you kick off the work. I have trained on 300K images over SBU since I put up this post but the result is not satisfying. I think the main reason behind is that on the SBU set, they have one image corresponding to one text annotation, while on the COCO, around 5 annotations can be found to each image. Rich text surely helps in understanding images in this case..

Btw, there are around 10% of the total images missing from the Flickr website. I contacted the author of the im2text project, and have an alternative place to get the images. You can download them by following the format,

http://tlberg.cs.unc.edu/vicente/im2text/0001/001.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0001/999.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0999/999.jpg

Lastly, I tried to online finetune the pretrained COCO model, and had got some interesting results. All these are trained on a NVIDIA K40 card from csc.fi. See this video: https://youtu.be/FmSsek5luHk

Don't hesitate to let me know if you need any code related.

from neuraltalk2.

SaMnCo avatar SaMnCo commented on August 24, 2024

Great, that is awesome! thanks for sharing. I actually happen to have some
code around it, see https://github.com/SaMnCo/dl-training-datasets

  • im2text: I have a builder and a download, as I store the images on my S3
    account. However, as you mention, I am missing some of them, I will
    definitely update the code with this information. For now, this has only
    919.987 images in it. I also had a bug in my first run of the captioning
    script, it's running again as I write, I will update the repo when it's
    done.
  • imagenet: same issue, many images are missing from Flickr and other
    services. I am downloading the whole thing, but it takes ages. I currently
    have about 4M images and their descriptions, and my script ran for about 10
    days (I can't run at full speed at I do this from home and need bandwidth
    for other things :))
  • ms coco: is complete, and also provides ways to recreate source training
    sets for neuraltalk.

Note that my building scripts create a JSON that is formatted like:
{
"captions": [
" natalie listens for the ocean in a small seashell. this is at the
beach on the north side of camano island."
],
"id": "000000028267",
"file_path":
"/home/scozannet/Pictures/sbu-dataset/im2text/im2text_000000028267.jpg",
"url": "http://static.flickr.com/3311/3229411919_eaf0ae261a.jpg",
"image_id": "3229411919"
},

==> with the flickr ID, you can query their API and collect more
information about the picture. I hope to enrich the dataset over time with
additional comments, to improve the quality of training. I agree with your
comment about the # of caption per training image. Only 1 is poor, we need
more data. One day I'll look at the 100M Flickr image dataset, but I don't
have enough storage right now. Hopefully one day...

Regarding your code, I'd be interested in knowing more about the settings
your used to fine tune. I am new to all this, and the amount of knowledge
to ingest to understand what's going on is really big... I also trained and
fine tuned a model, but after a while it stopped increasing in size. I
assume it means with my settings it reached a max in quality but I can't
tell for sure.

On Tue, Jan 5, 2016 at 10:12 AM, Guanqun Cao [email protected]
wrote:

@SaMnCo https://github.com/SaMnCo Good to hear that you kick off the
work. I have trained on 300K images over SBU since I put up this post but
the result is not satisfying. I think the main reason behind is that on the
SBU set, they have one image corresponding to one text annotation, while on
the COCO, around 5 annotations can be found to each image. Rich text surely
helps in understanding images in this case..

Btw, there are around 10% of the total images missing from the Flickr
website. I contacted the author of the im2text project, and have an
alternative place to get the images. You can download them by following the
format,

http://tlberg.cs.unc.edu/vicente/im2text/0001/001.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0001/999.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0999/999.jpg

Lastly, I tried to online finetune the pretrained COCO model, and had got
some interesting results. See this: https://youtu.be/FmSsek5luHk

Don't hesitate to let me know if you need any code related.


Reply to this email directly or view it on GitHub
#33 (comment)
.

SaM'n'Co
R'U Ready? I was BORN ready!

from neuraltalk2.

gqcao avatar gqcao commented on August 24, 2024

Hi @SaMnCo , your code looks good! I also started a project page, and you can find it here: https://github.com/cgq5/Video-Caption. I basically provided a piece of code to extract the VGG-16 feature, and use it to extract key frames. The captions of the key frames then are returned to be attached onto the video.

Regarding how to set the training parameters, it is quite empirical, and usually is drawn by the experience. If you find the gradient reaches a local optimum, then you may want to wait a bit as it can tangle and then continue to descend. But you may also want to stop early sometimes to avoid overfitting. Simonyan et al. gave a talk recently about how to configure the network, and you can find it here: http://image-net.org/tutorials/cvpr2015/recent.pdf. Also, an older paper by Bengio which I find quite interesting is here: http://arxiv.org/abs/1206.5533

For the data, besides COCO, SBU, and ImageNet, there is a new multimodal dataset to be released by the end of the month at http://www.statmt.org/wmt16/multimodal-task.html. They promise to have more than 5 captions to one image, and an addional German translation. It could be interesting to use..

It seems only we are interested in the topic, so I will close the thread here. Feel free to @ me in any of your repository later!

from neuraltalk2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.