Giter Club home page Giter Club logo

sceneparsing's Introduction

Development Kit for MIT Scene Parsing Benchmark

[NEW!] Our PyTorch implementation is released in the following repository:

https://github.com/hangzhaomit/semantic-segmentation-pytorch

Introduction

Table of contents:

  • Overview of scene parsing benchmark
  • Benchmark details
    1. Image list and annotations
    2. Submission format
    3. Evaluation routines
  • Pretrained models

Please open an issue for questions, comments, and bug reports.

Overview of Scene Parsing Benchmark

The goal of this benchmark is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. It is similar to semantic segmentation tasks in COCO and Pascal Dataset, but the data is more scene-centric and with a diverse range of object categories. The data for this benchmark comes from ADE20K Dataset (the full dataset will be released after the benchmark) which contains more than 20K scene-centric images exhaustively annotated with objects and object parts. Specifically, the benchmark data is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are in total 150 semantic categories included in the benchmark for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that non-uniform distribution of objects occurs in the images, mimicking a more natural object occurrence in daily scenes.

The webpage of the benchmark is at http://sceneparsing.csail.mit.edu. You could download the data at the webpage.

Benchmark details

Data

There are three types of data, the training, the validation and the testing. The training data contains 20210 images, the validation data contains 2000 images. The testing data contains 2000 images which will be released in middle August. Each image in the training data and validation data has an annotation mask, indicating the labels for each pixel in the image.

After untarring the data file (please download it from http://sceneparsing.csail.mit.edu), the directory structure should be similar to the following,

the training images:

images/training/ADE_train_00000001.jpg
images/training/ADE_train_00000002.jpg
    ...
images/training/ADE_train_00020210.jpg

the corresponding annotation masks for the training images:

annotations/training/ADE_train_00000001.png
annotations/training/ADE_train_00000002.png
    ...
annotations/training/ADE_train_00020210.png

the validation images:

images/validation/ADE_val_00000001.jpg
images/validation/ADE_val_00000002.jpg
    ...
images/validation/ADE_val_00002000.jpg

the corresponding annotation masks for the validation images:

annotations/validation/ADE_val_00000001.png
annotations/validation/ADE_val_00000002.png
    ...
annotations/validation/ADE_val_00002000.png

the testing images will be released in a separate file in the middle Auguest. The directory structure will be: images/testing/ADE_test_00000001.jpg ...

Note: annotations masks contain labels ranging from 0 to 150, where 0 refers to "other objects". We do not consider those pixels in our evaluation.

objectInfo150.txt contains the information about the labels of the 150 semantic categories, including indices, pixel ratios and names.

Submission format to the evaluation server

To evaluate the algorithm on the test set of the benchmark (link: http://sceneparsing.csail.mit.edu/eval/), participants are required to upload a zip file which contains the predicted annotation mask for the given testing images to the evaluation server. The naming of the predicted annotation mask should be the same as the name of the testing images, while the filename extension should be png instead of jpg. For example, the predicted annotation mask for file ADE_test_00000001.jpg should be ADE_test_00000001.png.

Participants should check the zip file to make sure it could be decompressed correctly.

Interclass similarity

Some of the semantic classes in this dataset show some level of visual and semantic similarities across them. In order to quantify such similarities we include a matrix in human_semantic_similarity.mat, which includes human-perceived similarities between the 150 categories and can be used to train the segmentation models. In demoSimilarity.m, we show how to use that file.

Evaluation routines

The performance of the segmentation algorithms will be evaluated by the mean of (1) pixel-wise accuracy over all the labeled pixels, and (2) IoU (intersection over union) avereaged over all the 150 semantic categories.

Intersection over Union = (true positives) / (true positives + false positives + false negatives)
Pixel-wise Accuracy = correctly classifield pixels / labeled pixels
Final score = (Pixel-wise Accuracy + mean(Intersection over Union)) / 2

Demo code

In demoEvaluation.m, we have included our implementation of the standard evaluation metrics (pixel-wise accuracy and IoU) for the benchmark. As mentioned before, we ignore pixels labeled with 0's.

Please change the paths at the begining of the code accordingly to evalutate your own results. While running it correctly, you are expected to see output similar to:

Mean IoU over 150 classes: 0.1000
Pixel-wise Accuracy: 100.00%

In this case, we will take (0.1+1.0)/2=0.55 as your final score.

We have also provided demoVisualization.m, which helps you to visualize individual image results.

Training code

We provide the training code for three popular frameworks, Caffe, Torch7 and PyTorch (https://github.com/CSAILVision/sceneparsing/tree/master/trainingCode). You might need to modify the paths, and the data loader code accordingly to have all the things running on your own computer.

Pre-trained models

We release the pre-trained models for scene parsing at (http://sceneparsing.csail.mit.edu/model/). The demo code along with the model download links is at (https://github.com/CSAILVision/sceneparsing/blob/master/demoSegmentation.m). The models can be used for research only. The detail of how the models are trained is in the reference below. The performance of the models on the validation set of MIT SceneParse150 is as follows,

Prediction

The qualitative results of the models are below:

Prediction

Reference

If you find this scene parse benchmark or the data or the pre-trained models useful, please cite the following paper:

Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. (http://people.csail.mit.edu/bzhou/publication/scene-parse-camera-ready.pdf)

@inproceedings{zhou2017scene,
    title={Scene Parsing through ADE20K Dataset},
    author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year={2017}
}

Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442. (https://arxiv.org/pdf/1608.05442.pdf)

@article{zhou2016semantic,
  title={Semantic understanding of scenes through the ade20k dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={arXiv preprint arXiv:1608.05442},
  year={2016}
}

sceneparsing's People

Contributors

hangzhaomit avatar xavierpuigf avatar zhoubolei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sceneparsing's Issues

Class that corresponds to each color

For starters, great work!
I want to find out which are the classes present in an image. But I am new to both neural networks and matlab code, and I have a hard time understanding how to use sceneparsing/visualizationCode/.
How can I do that?
Thank you

How to create more training date with some extra classes

Hi
I have been trying so create more training data with the object which are not accurately detected to improve the segmentation, but i can't figure out which color encoding to use which will give me single channel masking like yours. I have tried to pass same colored annotations with the encoding given in color150.mat file but the color encoding used in the original ADE20K dataset annotation are different.
In the below below images the floor color encoding are different. The green floor one will give the single channel input as required but the brown floor one will give only black mask. So can anyone tell me how to get the correct color encoding to pass it from https://github.com/CSAILVision/sceneparsing/blob/master/convertFromADE/convertFromADE.m to get annotations

6
ADE_train_00000196_seg

Faile to reproduce DilatedNet performance

Hi,

I am trying to reproduce DilatedNet.

However, my training results show that
pixel acc : 72.4%
mean acc: 38.6%
mean iou: 28.7%.

Further training does not show improvement.

I am using a pre-trained net and multiple gpus with mini-batch size of 8. I did not use augmentations as the paper do not explain what augmentations are used. I expect that augmentation does affect the results at a small amount, otherwise you probably present augmentations in the paper.

(1) Could you explain what augmentations are used and how much does it improve results?

(2) Could you provide training and validation log files?

Thank you so much.

Class to Color Correspondence

Hi
I wish to relabel the indoor images of ADE20k Dataset of 150 class labels into smaller number of categories e.g. floor, wall, furniture, person, stairs etc. I am having a hard time finding a file that provides class to color correspondence. I would be very grateful if anyone can help me with that so I can proceed with relabelling. Thanks.

Mirrored Data File Download?

Hi,

This isn't exactly related to the code, but I cannot download the train/val data from the website. I'm getting interruptions and corrupted zip files. Is there a mirror for the data?

Thanks

Model does not learn on training(very high CE)

Hi, I trained the model using the given code twice(once with re-scaled images of size 384 by 384(bicubic for images and nearest for annotations)) and once without scaling. I trained for around 150,000 iterations. But in both cases with the trained snapshot weights, when I do inference on the validation images, I get blank images. Also during the training the cross entropy is very high all the time(~600000) and doesn't seem to come down at all. So did you use the same settings given in the solver_FCN ? specifically the base_lr: 1e-10? Are there any other tricks that must be used to train the model because now the predictions are completely blank and with this high CE it's obvious that the model is not learning anything.

Note- With the pre-trained weights that you have provided I can reproduce your results and I get 71.95% pixel accuracy using the FCN model. Just the training part does not seem to work. Also I tried initialising all the layers before fc6 with the pretrained VGG-16 weights. Any pointers highly appreciated.
Thanks!

ADE20k classes

So I have a question about the ADE20k itself.

I read all the seg-masks from the training set (~15k files) and counted the number of unique class values. I got 2231 unique values, where the highest value is 3144. This makes no sense as the number of classes is supposed to be 150.

I'm using this code to load the *_seg.png files in Python (adapted from the Matlab code on the dataset site):

mask = np.array(Image.open(mask_path), dtype=np.uint16)
R,G,B = mask[:,:,0], mask[:,:,1], mask[:,:,2]
class_mask = R // 10 * 256 + G

A error about demoSegmenation.m

When I excute this code, I get a error:

[libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 5: 15 Message type "caffe.LayerParameter" has no field named "input_param".

I know, this is seem a error of CAFFE, but when I change the deploy, the error gone.
input: "data"
input_dim: 1
input_dim: 3
input_dim: 513
input_dim: 513

But, I get another question. If I want to use the prototxt with CRF, how to setting the prototxt, there a special parameter "data_dim".
layer {
name: "data"
type: "ImageSegData"
top: "data"
top: "label"
top: "data_dim"
......................

I hope your help, thank you!

Reference for Cascade-SegNet?

As we all know, SegNet is available in public. However, Cascade-Segnet and Cascade-DilatedNet are both reported as state-of-the-art. Can some please explain what is the difference between SegNet and Cascade-SegNet?

Preprocessing script for torch

Is there a pre-processing script for the torch training code that you can make publically available ? i.e. for generating the h5 file and the json files.

Failing to train models

I've been having a bit more trouble than I thought I had bargained for with these models that are intended to work out of the box, specifically with the AdeSegDataLayer. I think I almost have it, but I am getting the following error:

I0623 00:19:42.193922 27604 layer_factory.hpp:77] Creating layer data
I0623 00:19:42.639711 27604 net.cpp:100] Creating Layer data
I0623 00:19:42.639730 27604 net.cpp:408] data -> data
I0623 00:19:42.639760 27604 net.cpp:408] data -> label
I0623 00:19:43.050173 27604 net.cpp:150] Setting up data
I0623 00:19:43.050217 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050227 27604 net.cpp:157] Top shape: 1 1 1944 2592 3 (15116544)
I0623 00:19:43.050235 27604 net.cpp:165] Memory required for data: 120932352
I0623 00:19:43.050258 27604 layer_factory.hpp:77] Creating layer data_data_0_split
I0623 00:19:43.050281 27604 net.cpp:100] Creating Layer data_data_0_split
I0623 00:19:43.050292 27604 net.cpp:434] data_data_0_split <- data
I0623 00:19:43.050312 27604 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0623 00:19:43.050330 27604 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0623 00:19:43.050793 27604 net.cpp:150] Setting up data_data_0_split
I0623 00:19:43.050809 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050817 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050822 27604 net.cpp:165] Memory required for data: 241864704
I0623 00:19:43.050829 27604 layer_factory.hpp:77] Creating layer conv1_1
I0623 00:19:43.050853 27604 net.cpp:100] Creating Layer conv1_1
I0623 00:19:43.050859 27604 net.cpp:434] conv1_1 <- data_data_0_split_0
I0623 00:19:43.050871 27604 net.cpp:408] conv1_1 -> conv1_1
I0623 00:19:43.700464 27604 net.cpp:150] Setting up conv1_1
I0623 00:19:43.700536 27604 net.cpp:157] Top shape: 1 64 2142 2790 (382475520)
I0623 00:19:43.700549 27604 net.cpp:165] Memory required for data: 1771766784
I0623 00:19:43.700593 27604 layer_factory.hpp:77] Creating layer relu1_1
I0623 00:19:43.700616 27604 net.cpp:100] Creating Layer relu1_1
I0623 00:19:43.700634 27604 net.cpp:434] relu1_1 <- conv1_1
I0623 00:19:43.700644 27604 net.cpp:395] relu1_1 -> conv1_1 (in-place)
I0623 00:19:43.701800 27604 net.cpp:150] Setting up relu1_1
I0623 00:19:43.701817 27604 net.cpp:157] Top shape: 1 64 2142 2790 (382475520)
I0623 00:19:43.701825 27604 net.cpp:165] Memory required for data: 3301668864
I0623 00:19:43.701958 27604 layer_factory.hpp:77] Creating layer conv1_2
I0623 00:19:43.701982 27604 net.cpp:100] Creating Layer conv1_2
I0623 00:19:43.701988 27604 net.cpp:434] conv1_2 <- conv1_1
I0623 00:19:43.702000 27604 net.cpp:408] conv1_2 -> conv1_2
F0623 00:19:43.704733 27604 blob.cpp:34] Check failed: shape[i] <= 2147483647 / count_ (2790 vs. 1740) blob size exceeds INT_MAX
*** Check failure stack trace: ***
    @     0x7f0da310bdaa  (unknown)
    @     0x7f0da310bce4  (unknown)
    @     0x7f0da310b6e6  (unknown)
    @     0x7f0da310e687  (unknown)
    @     0x7f0da3794b5e  caffe::Blob<>::Reshape()
    @     0x7f0da37e81d6  caffe::BaseConvolutionLayer<>::Reshape()
    @     0x7f0da37b618f  caffe::CuDNNConvolutionLayer<>::Reshape()
    @     0x7f0da375ec7c  caffe::Net<>::Init()
    @     0x7f0da375faf5  caffe::Net<>::Net()
    @     0x7f0da379bb9a  caffe::Solver<>::InitTrainNet()
    @     0x7f0da379cc9c  caffe::Solver<>::Init()
    @     0x7f0da379cfca  caffe::Solver<>::Solver()
    @     0x7f0da377d2b3  caffe::Creator_AdamSolver<>()
    @           0x40f4ae  caffe::SolverRegistry<>::CreateSolver()
    @           0x408504  train()
    @           0x405e6c  main
    @     0x7f0da1966f45  (unknown)
    @           0x406773  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)


I noticed that in the AdeSegDataLayer there doesn't appear to be any place to resize the data, but everywhere on the project page and in the evaluation scripts it looks as though the data is supposed to be 384x384. could that be the cause? if so, why isn't that in the DataLayer, and more importantly, can you suggest a change to my datalayer [attached] to do that resize properly (I could just shrink the smaller height dimension to 384 then crop the width, or shrink the width to 384 and pad the height...which is what you all did?)

ade_layers.py.zip

How were the numbers uner 'Ratio', 'Train', and 'Val' calculated in objectInfo150.txt?

I am trying to understand where 'Ratio', 'Train', and 'Val' come from in the objectInfo150.txt file.

Presumably the 'Ratio' is the pixel ratio of each category presented over all the images. I tried to reproduce the number for the 'wall' category by 1) counting the number of pixels labelled with '1' for each image, divided by the total number of pixels in the image, then averaged over total number of images under the training/validation set separately/altogether; 2) similar to 1) but averaged over the sum of total number of pixels in all images. Neither approach successfully reproduce the number(around 0.1 off 0.1576)

I guess the numbers under 'Train' and 'Val' are the instance counts for each category? For this I simply count if the category 'wall' is present in every image under the training and validation sets. Since 'wall' is a stuff I guess it is sufficient to just check existence. But the numbers also don't match (11588 vs. 11664, 1167 vs. 1172)

I want to ask where my understanding goes wrong? Thanks a lot!

'Out of Memory' When Training Own Model

Hi,

Recently, I try to use the caffe training code to train own FCN model, but an odd problem paused me.

When I call binary caffe command like:

caffe train -gpu 0 -solver solver_FCN.prototxt

And then, I get the following error:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 566256237

I0620 14:34:02.241405 19481 net.cpp:744] Ignoring source layer fc6

I0620 14:34:02.241425 19481 net.cpp:744] Ignoring source layer fc7

I0620 14:34:03.245385 19481 solver.cpp:218] Iteration 0 (6.64999e-08 iter/s, 0.997409s/40 iters), loss = 1.27303e+06

I0620 14:34:03.245477 19481 solver.cpp:237] Train net output #0: loss = 1.27303e+06 (* 1 = 1.27303e+06 loss)

I0620 14:34:03.245496 19481 sgd_solver.cpp:105] Iteration 0, lr = 1e-10

I0620 14:34:20.222455 19481 solver.cpp:218] Iteration 40 (2.35623 iter/s, 16.9763s/40 iters), loss = 1.06719e+06

I0620 14:34:20.222501 19481 solver.cpp:237] Train net output #0: loss = 1.30907e+06 (* 1 = 1.30907e+06 loss)

I0620 14:34:20.222533 19481 sgd_solver.cpp:105] Iteration 40, lr = 1e-10

F0620 14:34:24.133394 19481 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0) out of memory

*** Check failure stack trace: ***

Aborted (core dumped)

I have checked my GPU, and find no problem.

Is there anyone has the same problem?

Regards,

Inconsistent space in the file "convertFromADE/mapFromADE.txt"

In the file "convertFromADE/mapFromADE.txt", the characters between the first column and the second column vary at different rows. At some lines they are tabs and at some lines they are spaces. Thus, it would meet problems when users want to use split string function.
I think you'd better replace this file with consistent split characters. It wont be too much work.

How to calculate the probability of each pixel belongs to every class?

After I ran the code,
I output the array "imPred" after line 66 in "demoSegmentation.m",
% imPred = net.forward({im_inp});
the imPred{1} array is 384 * 384 * 151,
so I expect to get the probability of each pixel belongs to each class,
for instance, 0.8, 0.53, 0.01...etc, which are between 0~1.

However the numbers I got from imPred{1} were like -1.2331, 3.0104, -0.7758, 10.1961...etc,
so I was wondering if these numbers can convert to probabilities,
and how to convert them?

Thank you.

Can not reproduce DilatedNet result using provided solver, data_layer and training network

I tried to reproduce the result with provided data layer, training network and solver parameters but fail to produce even close result as provided DilatedNet model [http://sceneparsing.csail.mit.edu/model/DilatedNet_iter_120000.caffemodel].

A test run on validation images with model of 120000 iteration gives me following stats:

  • iteration 120000 overall accuracy 0.71458910259
  • iteration 120000 mean accuracy 0.321233999994
  • iteration 120000 mean IU 0.243954227299
  • iteration 120000 fwavacc 0.567641521075

However, the reported baseline performance is (73.6, 44.6, 32.3, 60.1)

I wonder what is going wrong and what I should do to have a matching result?
[The training images are resized to 384x384 and mirrored to match the setting of author's]

ADE20K testset

Hi! I didn't find the testset download link in official website. So where I can download the testset? Looking fowrard to your reply. Thanks!

Scene names as classification labels

Hi,

I'd like to know if you have the scene names for the test data and I can evaluate my trained model using it. As far as I know, the original dataset (released here) have scene names for each training/validation data. But, the test data which can be downloaded from the above site don't contain scene names.

I think the original dataset is novel because it has wide variety of segmentation labels and also each image belongs to scene. Therefore, I believe it would be great if you could provide us with the scene names as classification labels for each test data. Of course, if we can evaluate the performance of classification against test data, it would be enough. (It means you don't need to make the scene names of test data public.)

if you could kindly consider it, I would be grateful. Also, if this post is not appropriate here, please make it close.
Thanks,

Licencing for ADE20K Dataset

Hello!
Great work and thanks for releasing the dataset! I was curious what the license is for the dataset?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.