Comments (8)
Hi, everyone
I need to combine the MSCOCO, Flicker8k and Flicker 30k images for training my captioning model. Can someone help me out with a script?
from neuraltalk2.
@cgq5, I am interested in this. I have been looking into it and started coding some stuff. I think I'll have some results by Christmas.
Steps I am doing:
- From your dataset, create the equivalent of val.json or train.json.
- from there, update the ipython notebook to convert into a sbu_raw.json
- Run the training model from there as well.
I think what I will do is actually merge the MS Coco dataset and the SBU one (both are taken from Flickr), so that the dataset is even bigger.
Training MS Coco for a first model just took me a little bit over 30 hours on a g2.2xlarge instance on AWS, for ~125k images. As this new one has 1M of them, I guess the process will take about 10 to 15 days.
I am upgrading my docker containers to be able to do training as well. If you have a powerful setup (some nvidia boards in a rack), I'd be happy to collaborate with you to make this happen off cloud (AWS is kind of expensive...).
Question: what are the images about? Are they a random images or carefully selected?
from neuraltalk2.
Maybe at some point we can do http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images
:D
from neuraltalk2.
@cgq5 check out the attached script. It will download first 100 images of the SBU dataset and create a file similar to coco_raw.json (of course you can update that to run as many images as you want)
It also checks whether the images are missing or not, and if yes, will not use them. It's a little slow cause Flickr doesn't give you a 404, it just sends you a png to download saying the original image doesn't exist anymore.
Anyway, once you have that, you can essentially run the prepo.py then train.lua on it. I'll probably give it a go for testing once I find a machine with enough storage (if all that is linear, the .h5 files will be about 300GB in size, pretty big...) and 1M images @~200k = 200GB of images, not that small either, and expensive on the cloud.
from neuraltalk2.
@cgq5 did move forward a little bit with this. You can find an improved script in my fork https://github.com/SaMnCo/neuraltalk2/tree/master/im2text
use it with ./build-dataset.sh /path/to/destination FIRST_PIC NB_PIC
example: ./build-dataset.sh ~/Picture/test 1 1000000
will download ALL images, using 10 concurrent threads (you can change that in the script). If it fails at some point, you can restart it, it will not attempt to download existing images.
I am using a weak test for checking if images are OK or not, but if you look in the code, you can enable the MD5 check pretty easily. This will slow down the code a LOT, but will guarantee consistency.
The output will be a subfolder of the target dir with all images, along with a json file compliant with the eval.lua. You can then do the prepro.py on it and you should be alright!
On my end, I am currently in the range of 500k images downloaded, hoping to get to the million by Christmas.
One key difference I noticed between SBU and this dataset is that the SBU has a single caption for each image, whereas MSCOCO has several. Not sure how it impacts quality, but it probably makes MSCOCO much more relevant at least for its subset of data.
I hope this helps,
from neuraltalk2.
@SaMnCo Good to hear that you kick off the work. I have trained on 300K images over SBU since I put up this post but the result is not satisfying. I think the main reason behind is that on the SBU set, they have one image corresponding to one text annotation, while on the COCO, around 5 annotations can be found to each image. Rich text surely helps in understanding images in this case..
Btw, there are around 10% of the total images missing from the Flickr website. I contacted the author of the im2text project, and have an alternative place to get the images. You can download them by following the format,
http://tlberg.cs.unc.edu/vicente/im2text/0001/001.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0001/999.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0999/999.jpg
Lastly, I tried to online finetune the pretrained COCO model, and had got some interesting results. All these are trained on a NVIDIA K40 card from csc.fi. See this video: https://youtu.be/FmSsek5luHk
Don't hesitate to let me know if you need any code related.
from neuraltalk2.
Great, that is awesome! thanks for sharing. I actually happen to have some
code around it, see https://github.com/SaMnCo/dl-training-datasets
- im2text: I have a builder and a download, as I store the images on my S3
account. However, as you mention, I am missing some of them, I will
definitely update the code with this information. For now, this has only
919.987 images in it. I also had a bug in my first run of the captioning
script, it's running again as I write, I will update the repo when it's
done. - imagenet: same issue, many images are missing from Flickr and other
services. I am downloading the whole thing, but it takes ages. I currently
have about 4M images and their descriptions, and my script ran for about 10
days (I can't run at full speed at I do this from home and need bandwidth
for other things :)) - ms coco: is complete, and also provides ways to recreate source training
sets for neuraltalk.
Note that my building scripts create a JSON that is formatted like:
{
"captions": [
" natalie listens for the ocean in a small seashell. this is at the
beach on the north side of camano island."
],
"id": "000000028267",
"file_path":
"/home/scozannet/Pictures/sbu-dataset/im2text/im2text_000000028267.jpg",
"url": "http://static.flickr.com/3311/3229411919_eaf0ae261a.jpg",
"image_id": "3229411919"
},
==> with the flickr ID, you can query their API and collect more
information about the picture. I hope to enrich the dataset over time with
additional comments, to improve the quality of training. I agree with your
comment about the # of caption per training image. Only 1 is poor, we need
more data. One day I'll look at the 100M Flickr image dataset, but I don't
have enough storage right now. Hopefully one day...
Regarding your code, I'd be interested in knowing more about the settings
your used to fine tune. I am new to all this, and the amount of knowledge
to ingest to understand what's going on is really big... I also trained and
fine tuned a model, but after a while it stopped increasing in size. I
assume it means with my settings it reached a max in quality but I can't
tell for sure.
On Tue, Jan 5, 2016 at 10:12 AM, Guanqun Cao [email protected]
wrote:
@SaMnCo https://github.com/SaMnCo Good to hear that you kick off the
work. I have trained on 300K images over SBU since I put up this post but
the result is not satisfying. I think the main reason behind is that on the
SBU set, they have one image corresponding to one text annotation, while on
the COCO, around 5 annotations can be found to each image. Rich text surely
helps in understanding images in this case..Btw, there are around 10% of the total images missing from the Flickr
website. I contacted the author of the im2text project, and have an
alternative place to get the images. You can download them by following the
format,http://tlberg.cs.unc.edu/vicente/im2text/0001/001.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0001/999.jpg
...
http://tlberg.cs.unc.edu/vicente/im2text/0999/999.jpgLastly, I tried to online finetune the pretrained COCO model, and had got
some interesting results. See this: https://youtu.be/FmSsek5luHkDon't hesitate to let me know if you need any code related.
—
Reply to this email directly or view it on GitHub
#33 (comment)
.
SaM'n'Co
R'U Ready? I was BORN ready!
from neuraltalk2.
Hi @SaMnCo , your code looks good! I also started a project page, and you can find it here: https://github.com/cgq5/Video-Caption. I basically provided a piece of code to extract the VGG-16 feature, and use it to extract key frames. The captions of the key frames then are returned to be attached onto the video.
Regarding how to set the training parameters, it is quite empirical, and usually is drawn by the experience. If you find the gradient reaches a local optimum, then you may want to wait a bit as it can tangle and then continue to descend. But you may also want to stop early sometimes to avoid overfitting. Simonyan et al. gave a talk recently about how to configure the network, and you can find it here: http://image-net.org/tutorials/cvpr2015/recent.pdf. Also, an older paper by Bengio which I find quite interesting is here: http://arxiv.org/abs/1206.5533
For the data, besides COCO, SBU, and ImageNet, there is a new multimodal dataset to be released by the end of the month at http://www.statmt.org/wmt16/multimodal-task.html. They promise to have more than 5 captions to one image, and an addional German translation. It could be interesting to use..
It seems only we are interested in the topic, so I will close the thread here. Feel free to @ me in any of your repository later!
from neuraltalk2.
Related Issues (20)
- Replacing words in captions
- Missing file name HOT 1
- Few notes on the image test split of this repo.
- coco_preprocess.ipynb: print val.keys() ^ SyntaxError: invalid syntax HOT 1
- Is it possible to port this to Pytorch?
- Do you have Deep-Visual alignment cvpr 2015 implemented?
- Already captioned mscoco images
- cudnn problem HOT 3
- protobuf, loadcaffe, and hdf5 are also required for inference
- train.lua error , cuda runtime error (2) : out of memory HOT 1
- Wrong JPEG library version: library is 90, caller expects 80
- language_eval error HOT 5
- Good pretrained checkpoint model
- hi,do you think pytorch is better than torch?
- AssertionError: error: some caption had no words?
- Chinese image description, In the result, multiple words of the same type appear
- init.lua:235: Not a JPEG file: starts with 0x47 0x49 HOT 1
- train.lua run error
- how to run it in win10???
- DeepAI.org display is non-functional. Is this supposed to be the case? IDK.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neuraltalk2.