Giter Club home page Giter Club logo

deepfashion2's Introduction

DeepFashion2 Dataset

image

DeepFashion2 is a comprehensive fashion dataset. It contains 491K diverse images of 13 popular clothing categories from both commercial shopping stores and consumers. It totally has 801K clothing clothing items, where each item in an image is labeled with scale, occlusion, zoom-in, viewpoint, category, style, bounding box, dense landmarks and per-pixel mask.There are also 873K Commercial-Consumer clothes pairs.
The dataset is split into a training set (391K images), a validation set (34k images), and a test set (67k images).
Examples of DeepFashion2 are shown in Figure 1.

Figure 1: Examples of DeepFashion2.

image From (1) to (4), each row represents clothes images with different variations. At each row, we partition the images into two groups, the left three columns represent clothes from commercial stores, while the right three columns are from customers.In each group, the three images indicate three levels of difficulty with respect to the corresponding variation.Furthermore, at each row, the items in these two groups of images are from the same clothing identity but from two different domains, that is, commercial and customer.The items of the same identity may have different styles such as color and printing.Each item is annotated with landmarks and masks.

Announcements

Download the Data

DeepFashion2 dataset is available in DeepFashion2 dataset. You need fill in the form to get password for unzipping files. Please refer to Data Description below for detailed information about dataset.

Data Organization

Each image in seperate image set has a unique six-digit number such as 000001.jpg. A corresponding annotation file in json format is provided in annotation set such as 000001.json.
Each annotation file is organized as below:

  • source: a string, where 'shop' indicates that the image is from commercial store while 'user' indicates that the image is taken by users.
  • pair_id: a number. Images from the same shop and their corresponding consumer-taken images have the same pair id.
    • item 1
      • category_name: a string which indicates the category of the item.
      • category_id: a number which corresponds to the category name. In category_id, 1 represents short sleeve top, 2 represents long sleeve top, 3 represents short sleeve outwear, 4 represents long sleeve outwear, 5 represents vest, 6 represents sling, 7 represents shorts, 8 represents trousers, 9 represents skirt, 10 represents short sleeve dress, 11 represents long sleeve dress, 12 represents vest dress and 13 represents sling dress.
      • style: a number to distinguish between clothing items from images with the same pair id. Clothing items with different style numbers from images with the same pair id have different styles such as color, printing, and logo. In this way, a clothing item from shop images and a clothing item from user image are positive commercial-consumer pair if they have the same style number greater than 0 and they are from images with the same pair id.(If you are confused with style, please refer to issue#10.)
      • bounding_box: [x1,y1,x2,y2],where x1 and y_1 represent the upper left point coordinate of bounding box, x_2 and y_2 represent the lower right point coordinate of bounding box. (width=x2-x1;height=y2-y1)
      • landmarks: [x1,y1,v1,...,xn,yn,vn], where v represents the visibility: v=2 visible; v=1 occlusion; v=0 not labeled. We have different definitions of landmarks for different categories. The orders of landmark annotations are listed in figure 2.
      • segmentation: [[x1,y1,...xn,yn],[ ]], where [x1,y1,xn,yn] represents a polygon and a single clothing item may contain more than one polygon.
      • scale: a number, where 1 represents small scale, 2 represents modest scale and 3 represents large scale.
      • occlusion: a number, where 1 represents slight occlusion(including no occlusion), 2 represents medium occlusion and 3 represents heavy occlusion.
      • zoom_in: a number, where 1 represents no zoom-in, 2 represents medium zoom-in and 3 represents lagre zoom-in.
      • viewpoint: a number, where 1 represents no wear, 2 represents frontal viewpoint and 3 represents side or back viewpoint.
    • item 2
      ...
    • item n

Please note that 'pair_id' and 'source' are image-level labels. All clothing items in an image share the same 'pair_id' and 'source'.

The definition of landmarks and skeletons of 13 categories are shown below. The numbers in the figure represent the order of landmark annotations of each category in annotation file. A total of 294 landmarks covering 13 categories are defined.

Figure 2: Definitions of landmarks and skeletons.

image

We do not provide data in pairs. In training dataset, images are organized with continuous 'pair_id' including images from consumers and images from shops. (For example: 000001.jpg(pair_id:1; from consumer), 000002.jpg(pair_id:1; from shop),000003.jpg(pair_id:2; from consumer),000004.jpg(pair_id:2; from consumer),000005.jpg(pair_id:2; from consumer), 000006.jpg(pair_id:2; from consumer),000007.jpg(pair_id:2; from shop),000008.jpg(pair_id:2; from shop)...) A clothing item from shop images and a clothing item from consumer image are positive commercial-consumer pair if they have the same style number which is greater than 0 and they are from images with the same pair id, otherwise they are negative pairs. In this way, you can construct training positive pairs and negative pairs in instance-level.

As is shown in the figure below, the first three images are from consumers and the last two images are from shops. These five images have the same 'pair_id'. Clothing items in orange bounding box have the same 'style':1. Clothing items in green bounding box have the same 'style': 2. 'Style' of other clothing items whose bouding boxes are not drawn in the figure is 0 and they can not construct positive commercial-consumer pairs. One positive commercial-consumer pair is the annotated short sleeve top in the first image and the annotated short sleeve top in the last image. Our dataset makes it possbile to construct instance-level pairs in a flexible way.

image

Data Description

Training images: train/image Training annotations: train/annos

Validation images: validation/image Validation annotations: validation/annos

Test images: test/image

Each image in seperate image set has a unique six-digit number such as 000001.jpg. A corresponding annotation file in json format is provided in annotation set such as 000001.json. We provide code to generate coco-type annotations from our dataset in deepfashion2_to_coco.py. Please note that during evaluation, image_id is the digit number of the image name. (For example, the image_id of image 000001.jpg is 1). Json files in json_for_validation and json_for_test are generated based on the above rule using deepfashion2_to_coco.py. In this way, you can generate groundtruth json files for evaluation for clothes detection task and clothes segmentation task, which are not listed in DeepFashion2 Challenge.

In validation set, we provide image-level information in keypoints_val_information.json, retrieval_val_consumer_information.json and retrieval_val_shop_information.json. ( In validation set, the first 10844 images are from consumers and the last 20681 images are from shops.) For clothes detection task and clothes segmentation task, which are not listed in DeepFashion2 Challenge, keypoints_val_information.json can also be used.

We provide keypoints_val_vis.json, keypoints_val_vis_and_occ.json, val_query.json and val_gallery.json for evaluation of validation set. You can get validation score locally using Evaluation Code and above json files. You can also submit your results to evaluation server in our DeepFashion2 Challenge.

In test set, we provide image-level information in keypoints_test_information.json, retrieval_test_consumer_information.json and retrieval_test_shop_information.json.( In test set, the first 20681 images are from consumers and the last 41948 images are from shops.) You need submit your results to evaluation server in our DeepFashion2 Challenge.

Dataset Statistics

Tabel 1 shows the statistics of images and annotations in DeepFashion2. (For statistics of released images and annotations, please refer to DeepFashion2 Challenge).

Table 1: Statistics of DeepFashion2.

Train Validation Test Overall
images 390,884 33,669 67,342 491,895
bboxes 636,624 54,910 109,198 800,732
landmarks 636,624 54,910 109,198 800,732
masks 636,624 54,910 109,198 800,732
pairs 685,584 query: 12,550
gallery: 37183
query: 24,402
gallery: 75,347
873,234

Figure 3 shows the statistics of different variations and the numbers of items of the 13 categories in DeepFashion2.

Figure 3: Statistics of DeepFashion2.

image

Benchmarks

Clothes Detection

This task detects clothes in an image by predicting bounding boxes and category labels to each detected clothing item. The evaluation metrics are the bounding box's average precision ,,.

Table 2: Clothes detection trained with released DeepFashion2 Dataset evaluated on validation set.

AP AP50 AP75
0.638 0.789 0.745

Table 3: Clothes detection on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.604 0.700 0.660 0.712 0.654 0.372 0.695 0.629 0.466 0.624 0.681 0.641 0.667
AP50 0.780 0.851 0.768 0.844 0.810 0.531 0.848 0.755 0.563 0.713 0.832 0.796 0.814
AP75 0.717 0.809 0.744 0.812 0.768 0.433 0.806 0.718 0.525 0.688 0.791 0.744 0.773

Landmark and Pose Estimation

This task aims to predict landmarks for each detected clothing item in an each image.Similarly, we employ the evaluation metrics used by COCOfor human pose estimation by calculating the average precision for keypoints ,, where OKS indicates the object landmark similarity.

Table 4: Landmark estimation trained with released DeepFashion2 Dataset evaluated on validation set.

AP AP50 AP75
vis 0.605 0.790 0.684
vis && hide 0.529 0.775 0.596

Table 5: Landmark Estimation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.Results of evaluation on visible landmarks only and evaluation on both visible and occlusion landmarks are separately shown in each row

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.587
0.497
0.687
0.607
0.599
0.555
0.669
0.643
0.631
0.530
0.398
0.248
0.688
0.616
0.559
0.489
0.375
0.319
0.527
0.510
0.677
0.596
0.536
0.456
0.641
0.563
AP50 0.780
0.764
0.854
0.839
0.782
0.774
0.851
0.847
0.813
0.799
0.534
0.479
0.855
0.848
0.757
0.744
0.571
0.549
0.724
0.716
0.846
0.832
0.748
0.727
0.820
0.805
AP75 0.671
0.551
0.779
0.703
0.678
0.625
0.760
0.739
0.718
0.600
0.440
0.236
0.786
0.714
0.633
0.537
0.390
0.307
0.571
0.550
0.771
0.684
0.610
0.506
0.728
0.641

Figure 4 shows the results of landmark and pose estimation.

Figure 4: Results of landmark and pose estimation.

image

Clothes Segmentation

This task assigns a category label (including background label) to each pixel in an item.The evaluation metrics is the average precision including ,, computed over masks.

Table 6: Clothes segmentation trained with released DeepFashion2 Dataset evaluated on validation set.

AP AP50 AP75
0.640 0.797 0.754

Table 7: Clothes Segmentation on different validation subsets, including scale, occlusion, zoom-in, and viewpoint.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back
AP 0.634 0.703 0.666 0.720 0.656 0.381 0.701 0.637 0.478 0.664 0.689 0.635 0.674
AP50 0.811 0.865 0.798 0.863 0.824 0.543 0.861 0.791 0.591 0.757 0.849 0.811 0.834
AP75 0.752 0.826 0.773 0.836 0.780 0.444 0.823 0.751 0.559 0.737 0.810 0.755 0.793

Figure 5 shows the results of clothes segmentation.

Figure 5: Results of clothes segmentation.

image

Consumer-to-Shop Clothes Retrieval

Given a detected item from a consumer-taken photo, this task aims to search the commercial images in the gallery for the items that are corresponding to this detected item. In this task, top-k retrieval accuracy is employed as the evaluation metric. We emphasize the retrieval performance while still consider the influence of detector. If a clothing item fails to be detected, this query item is counted as missed.

Table 8: Consumer-to-Shop Clothes Retrieval trained with released DeepFashion2 Dataset using detected box evaluated on validation set.

Top-1 Top-5 Top-10 Top-15 Top-20
class 0.079 0.198 0.273 0.329 0.366
keypoints 0.182 0.326 0.416 0.469 0.510
segmentation 0.135 0.271 0.350 0.407 0.447
class+keys 0.192 0.345 0.435 0.488 0.524
class+seg 0.152 0.295 0.379 0.435 0.477

Table 9: Consumer-to-Shop Clothes Retrieval on different subsets of some validation consumer-taken images. Each query item in these images has over 5 identical clothing items in validation commercial images. Results of evaluation on ground truth box and detected box are separately shown in each row. The evaluation metrics are top-20 accuracy.

Scale Occlusion Zoom_in Viewpoint Overall
small moderate large slight medium heavy no medium large no wear frontal side or back top-1 top-10 top-20
class 0.520
0.485
0.630
0.537
0.540
0.502
0.572
0.527
0.563
0.508
0.558
0.383
0.618
0.553
0.547
0.496
0.444
0.405
0.546
0.499
0.584
0.523
0.533
0.487
0.102
0.091
0.361
0.312
0.470
0.415
pose 0.721
0.637
0.778
0.702
0.735
0.691
0.756
0.710
0.737
0.670
0.728
0.580
0.775
0.710
0.751
0.701
0.621
0.560
0.731
0.690
0.763
0.700
0.711
0.645
0.264
0.243
0.562
0.497
0.654
0.588
mask 0.624
0.552
0.714
0.657
0.646
0.608
0.675
0.639
0.651
0.593
0.632
0.555
0.711
0.654
0.655
0.613
0.526
0.495
0.644
0.615
0.682
0.630
0.637
0.565
0.193
0.186
0.474
0.422
0.571
0.520
pose+class 0.752
0.691
0.786
0.730
0.733
0.705
0.754
0.725
0.750
0.706
0.728
0.605
0.789
0.746
0.750
0.709
0.620
0.582
0.726
0.699
0.771
0.723
0.719
0.684
0.268
0.244
0.574
0.522
0.665
0.617
mask+class 0.656
0.610
0.728
0.666
0.687
0.649
0.714
0.676
0.676
0.623
0.654
0.549
0.725
0.674
0.702
0.655
0.565
0.536
0.684
0.648
0.712
0.661
0.658
0.604
0.212
0.208
0.496
0.451
0.595
0.542

Figure 6 shows queries with top-5 retrieved clothing items. The first and the seventh column are the images from the customers with bounding boxes predicted by detection module, and the second to the sixth columns and the eighth to the twelfth columns show the retrieval results from the store.

Figure 6: Results of clothes retrieval.

image

Citation

If you use the DeepFashion2 dataset in your work, please cite it as:

@article{DeepFashion2,
  author = {Yuying Ge and Ruimao Zhang and Lingyun Wu and Xiaogang Wang and Xiaoou Tang and Ping Luo},
  title={A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images},
  journal={CVPR},
  year={2019}
}

deepfashion2's People

Contributors

geyuying avatar switchablenorms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepfashion2's Issues

what is mean key+class

Screenshot from 2020-03-24 17-27-07

In customer to shop, does key + class mean adding landmarks skeletons and classes?
I don't know what the key means.

The number of pairs does not match.

When I tried to extract the pairs from the training set (use annotations of non-zero style and create pairs using same pair_id & style but different source), it will have 394,661 pairs. However, the link indicates that there are 337,293 pairs, which is different from the number I have got. I wonder if there are some issues in my pair generation procedure.

Match R-CNN?

Is there any chance to get the Match R-CNN?
Or how difficult would it be to rebuild the net?

Retrieval evaluation

How do you evaluate the model on retrieval task ? Comparing each item of a query image with all the items in the gallery images means to make 10,844 x 21,309 comparisons. Do you consider any pre-processing to reduce the number of inferences?

annFile not found

Hi,

I want to run deepfashion2_test.py file but I cannot find the annFile, which file sould I use please ?

Where is 'val_query.json' and 'val_gallery.json' ?

"We provide keypoints_val_vis.json, keypoints_val_vis_and_occ.json, val_query.json and val_gallery.json for evaluation of validation set. You can get validation score locally using ..."

In the README.md, it is mentioned 'val_query.json' and 'val_gallery.json' files are provided.
But I cannot find those two files neither in github nor in downloaded dataset.
Can you tell me how can get those files?

Step-by-step procedure for running DeepFashion2 API

Hi All,

I am new to fashion (cloth detection and segmentation) data set.

Can anyone help me with step by step procedure for making predictions on new images using DeepFashion2 API in python.

Thank you for your support

Did'nt get password for datasets

I filled the form 2 days ago but didn't get the password for datasets. Is there any problems with my answers for form or email?

What is the pair_id for each items ?

I'm trying to work with the dataset, let's say I have this annotation file

{
  'item2': {
    'segmentation': [
      [1, 2, 1, 17, 94, 58, 128, 2, 163, 2, 180, 86, 203, 173, 370, 149, 490, 81, 463, 1, 1, 2],
      [1, 2, 1, 17, 94, 58, 128, 2, 1, 2]
    ],
    'scale': 2,
    'viewpoint': 2,
    'zoom_in': 3,
    'landmarks': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 94, 58, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 180, 86, 2, 203, 173, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'style': 0,
    'bounding_box': [0, 0, 495, 179],
    'category_id': 1,
    'occlusion': 1,
    'category_name': 'short sleeve top'
  },
  'source': 'shop',
  'pair_id': 1811,
  'item1': {
    'segmentation': [
      [237, 160, 378, 153, 461, 92, 519, 214, 535, 348, 440, 428, 292, 420, 247, 309, 237, 160]
    ],
    'scale': 2,
    'viewpoint': 2,
    'zoom_in': 1,
    'landmarks': [237, 160, 1, 378, 153, 2, 461, 92, 1, 247, 309, 2, 292, 420, 2, 440, 428, 2, 535, 348, 2, 519, 214, 2],
    'style': 1,
    'bounding_box': [227, 86, 543, 455],
    'category_id': 9,
    'occlusion': 2,
    'category_name': 'skirt'
  }
}

Which pair_id will belong to which items? There are many cases when I saw 5,6 item in an image, it's really confusing to find the correct pair_id with the correct item.

Hope you can clarify this for me. Thanks

The annotations does not correspond to paper

I have noticed, as well as read in here, that the full dataset was not release yet and the version currently available contains only half of all samples. That is why I have tried to plot the graphs, mentioned in the Figure 3, myself. I have observed misalignment when it comes to scale, I have not been able to check the other ones. The pie charts for different measurements are as follows:
scale
The first one is made from the annotations file so it just uses the pre-made scale labels. Second uses the bounding boxes/areas from the annotation files and the third one uses area of bounding boxes from segmentation masks created by pycocoutils package. The second and third compare the area of the bounding box to the area of the image and split the images into 3 buckets according to the ration with the ones <10% area, 10%-40% area and >40% area.

Which of them is correct approach? Why the details in annotations differ from calculated ones?

Question about retrieval of commercial-consumer clothing items

I've seen a detailed description of how the consumer clothing item is detected in issue #14, but it's not entirely clear to me how the chosen detected consumer clothing item is used to retrieve the commercial clothing item from the gallery set. I hope I'm not misunderstanding something here, but to my understanding the process of choosing the detected consumer clothing item can be done in this way according to you:

First, a ground truth label will be assigned to each detected query clothing item according to its IoU with all the ground truth items. Then find out all detected items which are assigned the given ground truth label and are classified correctly. Finally select the detected item with the highest score among these detected items. The retrieved results of this selected query item will be evaluated.

However, later on you provide details merely about how you compare a retrieved commerical clothing gallery item and a ground truth corresponding gallery item:

If IoU between retrieved item from shop images and one of the ground truth corresponding gallery item is over the thresh(we set thresh as 0.5), the retrieved result is positive.

This does not describe how you used the chosen detected clothing item to retrieve the commerical clothing gallery item. If this is true, it would be much appreciated if you could provide such details.

Thanks in advance!

Would you please upload the dataset to a platform other than google drive?

Thank you for your work!

There are a lot of issues with having datasets on google drive, in terms of zipping the data into colab. Mostly due to google drive having download limits on files within the drive.

Would it be possible to upload the data to a platform that does not have as many restrictions as google drive?

For example, you can upload the dataset to kaggle and set it to private. Then, share the link with the individuals who have submitted the form to strictly use the data for research purposes. It's essentially the same thing as the current method of sending the password to the zipped folders, but this way people wont have to deal with google drive restrictions.

deepfashion2_to_coco occlusion attribute

Sorry for the misunderstanding, I was not used to the coco dataset format. I thought the "iscrowd" attribute referred to overlapping objects, but apparently there is no such information in the coco dataset.

Some questions about the Match-Net.

  1. In the retrieval task, the paper mentions that 12 epochs are used for training. I wonder what is the definition of 1 epoch. Does it mean all image pairs? (i.e. 337,293 image pairs in #30)

  2. In #31 (also in #17), the details of the network show that a fixed number (i.e. 8) of proposals is used for each image in match net during training. However, sometimes the number of possible proposals is less than 8. In #31, the number of proposals is always 8. I wonder if some augmented proposals are used.

  3. If we use the mask features (after RoIAlign) for match net, the spatial resolution is 14x14 right? How to combine the bbox (spatial resolution 7x7) and mask (spatial resolution 14x14) RoI features?

  4. It is possible to get the coefficient for the loss terms?

API error

When I run 2 deepfashion2_test.py file with python 2.7 it says NameError: name 'annFile' is not defined.

How to visualize? (final)

Sorry for frequent issue makings...

I had similar issue before at here #20

and I thought that i have to train and visualize as Detectron

And finally i trained and visualized as Detectron, I got accurate bbox but categories are different which

means skirt -> boat, shirt -> person like this

I realized that it was wrong with using detectron

And I read intensively your git

I think that

  1. make coco type json with tools/deepfashion2coco.py
  2. train with main.py and and make color splashed image(I think it is optional and usage is segmentation right?)
  3. visualize with lib/visualize.py but I cannot find how to use visualize.py

there's no and arguments or init or main ....

to visualize with detectron it need pkl type weight and yaml type configs

but match_rcnn makes h5 type weight and relative configs...

Would you please how to visualize with key & segmentation & bbox?
ex) which code can execute visualize.py or which can transform h5 and config into pkl and yaml type ...

Segmentation mask overlapping

Hi,

Great appreciate for this work! I am trying to use this dataset to do semantic segmentation and I found the segmentation annotations overlap within different categories, so the overlapping part will not belong to only one class, which may lead to a multi-label classification.

Could you please share how you use the dataset to train a segmentation model and which model you have used?

Thanks!
屏幕快照 2019-09-04 下午3 36 11

How to visualize with result json file?

Hi. Thanks for sharing such a nice codes.

I'm a student who studies CV and I read it from paperswithcode.

I wanna visualize this code but I have no idea.

I understood that result json files are the information of bbox, segm etc...

But I don't know how to visualize with it....

If you've already uploaded the code to visualize, then I'm sorry that I haven't read it in details

Evaluation of retrieval task

Hi,

I have read the questions and answers in #14, but I still have another question regarding to evaluation of top-k accuracy. When we retrieval user garments from the shop items, do the shop items contain all the items from shop images(including style 0) or do they just contain the items that appear over 5 times?

Some questions about the Match R-CNN in the paper

Hi! I have some questions about the design of Match R-CNN:

  1. What are the coefficient lambda_1 ~ lambda_5 in the loss function L?
  2. I would like to know the details to create pairs in training stage:
  • My understanding: Assume we have batch_size=16. Thus, we will have 8 instance pairs where 4 are positive pairs and 4 are negative pairs:
    • To form the positive pairs, we sample one user instance (source = 'user') with tuple (pair_id, style_id) from all images (style_id != 0) and then sample one commercial instance (source = 'shop') with the same tuple (pair_id, style_id).
    • To form the negative pairs, we sample one another user instance (source = 'user') with tuple (pair_id, style_id) from all images (style_id != 0) and then sample one commercial instance (source = 'shop') with the different tuple (pair_id', style_id').
    • We will use the 16 images which contain the 16 specified instances in training.
  • Afterwards, we train the Match R-CNN with bbox / mask / keypoint on all proposals and use the best one proposal (i.e. the proposal has the highest IoU with the ground-truth bbox of the instance in the pair) assigned to each instance in the pairs to obtain the feature and feed it into matching network.

Q1: Is my understanding correct? Do we use the same number of positive pairs and negative pairs?
Q2: In bbox / mask / keypoint training of Mask R-CNN/Match R-CNN, we optimize the loss on all selected proposals. However, for the matching network, do we use one proposal of each instance in the pairs or all proposals of each instance in the pairs? If we use all proposals, there will be many combinations of proposals for matching network (count = #proposals of user instance * #proposals of commercial instance).

Thank you for your great help!

How to determine the Threshold of Detector

The number of samples in the results of my detector is much larger than the number of pictures. How do you set the threshold of the detector? So that neither sample loss nor redundant samples can be detected.

Image number error?

My unzipped images number of train/validation/test sets are:

(191961, 32153, 62629)

which is smaller than that described on the paper. is that my fault? I wonder why?

Discrepancy is the number of images in the zip vs. discrepancy

Just download and unzipped the dataset zip files. The training set has 191961 images and the validation set has 32153. The description in the README.md says The dataset is split into a training set (391K images), a validation set (34k images), and a test set (67k images).

Is there going to be a second release with the remainder of the images?

How to select objects in consumer to shop?

If two clothing items are detected in the image, which is the similar image based on which?
For example, if a consumer wears shorts in short sleeves, do you find similar images in short sleeves or shorts?

Question about pre-trained model?

Hi , nice work I gonna to say it is. And I have a question about that will you make the pre-trained net work access possible ? Have a nice day :-).

How to train with entire dataset? Anyone mind sharing their trained model?

Running on Google Colab and training using Mask-RCNN. I converted the dataset into the COCO format using the given script, but it seem's that it can only load about 32k images into Mask RCNN before I run out of memory (I mainly suspect pycoco's loadAnns()).

I've been training with 32k images at a time (complete training with 32k images, retrain with 32k images part 2, retrain with other 32k images... etc). So far I've only done the first three 32k image batches, and it is obviously very time consuming as I have to keep retraining. Is there a way to circumvent this? Anyone mind sharing their trained model?

Question about the design of Match-Net and the features fed in.

  1. According to the paper, the feature extractor of match-net has 4-conv layers, one pooling layer and one fc layer. Are these layers:
    -- Conv1: 3x3 conv - 256 channels -> ReLU
    -- Conv2: 3x3 conv - 256 channels -> ReLU
    -- Conv3: 3x3 conv - 1024 channels -> ReLU
    -- Conv4: 3x3 conv - 1024 channels -> ReLU
    -- Pooling: GlobalAvgPool
    -- FC: 1024 to 256 channels (No ReLU)
    Besides, the similarity learning net have:
    -- Substraction (output 256 channels)
    -- Element-wise square (output 256 channels)
    -- FC: 256 to 1 channels (No ReLU)
    -- Sigmoid function.
    Am I correct?

  2. In the mask head, it has the procedure:
    backbone -> RoI Pooling -> 4x conv (feature extractor) -> 1x deconv + 1 conv (predictor)
    So in the paper, for the experiments using mask features, the RoI features fed into the match net should be the features after RoI Pooling. Am I correct? Do we have individual RoI Pooling for match net or just re-use the RoI Pooled features from mask head?

ResFile not define

Hello.
I have problem with Resfile path. Could you help me resolve it.
How can i find or create ResFile for Eval task? Please help me.
Best regards.

Viewpoint missing in the train data

The train data annotations does contain only 3 classes for viewpoint (1, 2, 3), however, there are 4 classes described in the paper. I assume that the labels should be as follows:

  • 1 - frontal
  • 2 - side
  • 3 - back
  • 4 - no wear

If so, even though, there are some images with which are not wore by any person they still belong to one of the top 3 classes.

Code to reproduce the issue:
Screenshot from 2020-03-16 15-53-54

Landmarks predict?

I have trained the model with mtcnn main.py, and check the results with matterport mask ecnyn demo
My question is I am getting segmentation mask, but not the landmarks
How to get landmarks on segmentation or object?

Different number of keypoins

Hi! Keypoint’s location is modeled as a one-hot mask. And it is done with Conv or ConvTranspose layer, where output_channel_size = num_keypoints. I see here: 1 2
But number of keypoints in DeepFashion2 is different for different classes (for example, in COCO dataset num_keypoints=17). How is this problem solved @geyuying ?

Coefficient lambda values not provided anywhere in the research paper or issues

I can't seem to find the values of the coefficient lambdas of the loss functions. Could you please provide these values? They don't appear in the research paper or any of the issues as far as I know, I've even looked at issue #14, but it's not there.

Also I got some follow-up questions from my issue #33 regarding the consumer-commercial clothing item retrieval:

  1. Specifically what is the evaluation of the chosen detected consumer clothing item based on? Is it based on the quote below? And so if the result of the proposed method in the below quote is positive, does that mean that the evaluation of the chosen detected consumer clothing item will be positive?

If IoU between retrieved item from shop images and one of the ground truth corresponding gallery item is over the thresh(we set thresh as 0.5), the retrieved result is positive.

  1. I still don't understand fully how the gallery item is retrieved? The matching network only outputs a similarity score, it doesn't retrieve anything. So do you compare the similarity score of ALL gallery items with the chosen detected consumer clothing item, and choose the gallery item with the highest similarity score?

  2. Also where are the gallery items in the validation dataset? Do you have to construct them? And do the gallery set contain ALL the commerical clothing items?

  3. Since you used the top-k accuracy metric for clothing retrieval, does that mean that you have to choose k gallery items with the highest similarity score for a given chosen detected consumer clothing item?

which features are fed into the matching network?

Hey guys, fantastic work.

I have a question about the paper. You feed the output of ROIAlign into the matching network. I'm having trouble understanding figure 4. How is the input for the matching network of a single image an NxNx256 tensor? N is the number of garment classes, correct? The output of ROIAlign is either 7x7x256 or 14x14x256 (depending on if you take the bbox stream or mask stream). How are you getting NxN?

Thanks!

How do you form the positive or negtive pairs?

Hi,
I have read your paper but I couldn't find how do you form the positive/negative training data pairs? Is there any strategy to select pairs for training such as positive/negative pairs quantity ratio?

Training code?

Can anyone provide training code please
Thanks in advance

Commercial Usage To Build Models

Hi,

In the Google Form received via email, it is mentioned the dataset is available only for non-commerical research purpose. Is there any commercial license available for the dataset, if we need to use the data to create some models for a commercial product.

Some questions on the paper

Thanks for this great work, seems like a valuable contribution to the computer vision community!

I have a few detailed questions on the paper, which I hope you could clarify (apologies if I missed something in the paper):

  1. There are 491K images. How many are consumer, and how many are commercial?
  2. The paper states that there are 801K items and 43.8K identities (ie, 18.3 items per identity on average). But the paper also states that "each identity has 12.7 items". I got confused here, shouldn't this be 18.3 instead?
  3. For the retrieval task, the metric is "top-K accuracy", but no exact definition is given. Is this the exact same definition as ImageNet case? Example: if a correct item is retrieved in position 9, my guess is that for this query the top-1/.../top-5/.../top-8 accuracies are zero and top-9/top-10/.../top-20 accuracies are 1. Is this correct?
  4. For the experiments in the paper, is the network trained from scratch, or is training started from an ImageNet/COCO checkpoint?
  5. "For the retrieval task, each unique detected clothing item in consumer-taken image with highest confidence is selected as query". What happens if the detector fails and selects some non-clothing region (false positive detection)? Is this false positive box query simply ignored in retrieval scoring?
  6. I am a little confused by Table 5 ("Consumer-to-Shop Clothes Retrieval"). Do the different variants/rows ("class", "pose", "mask", "pose+class", "mask+class") correspond to models where only some losses are active? For example: for row corresponding to "pose", does this mean that only Lpose and Lpair are used during training? If yes, then how are boxes detected for the retrieval experiment in this case?
  7. Other questions on Table 5: a few combinations seem to be missing: would you have "pose+mask" and "pose+mask+class" results?
  8. Would you have results where the detector and match networks are trained separately? Ie, a case where first a model detects the boxes, crops the original image, then a separate model extracts features from the cropped boxes and does the matching. I am wondering if this would work better given that small objects could be captured better with a resized crop being fed via a different network.

Thanks in advance for the clarifications!

Update deepfashion2_to_coco

from PIL import Image
import numpy as np

dataset = {
    "info": {},
    "licenses": [],
    "images": [],
    "annotations": [],
    "categories": []
}

lst_name = ['short_sleeved_shirt', 'long_sleeved_shirt', 'short_sleeved_outwear', 'long_sleeved_outwear',
            'vest', 'sling', 'shorts', 'trousers', 'skirt', 'short_sleeved_dress',
            'long_sleeved_dress', 'vest_dress', 'sling_dress']

for idx, e  in enumerate(lst_name):
    dataset['categories'].append({
        'id': idx + 1,
        'name': e,
        'supercategory': "clothes",
        'keypoints': ['%i' % (i) for i in range(1, 295)],
        'skeleton': []
    })

num_images = 191961
sub_index = 0  # the index of ground truth instance
for num in range(1, num_images + 1):
    json_name = './train/annos/' + str(num).zfill(6) + '.json'
    image_name = './train/image/' + str(num).zfill(6) + '.jpg'

    if (num >= 0):
        imag = Image.open(image_name)
        width, height = imag.size
        with open(json_name, 'r') as f:
            temp = json.loads(f.read())
            pair_id = temp['pair_id']

            dataset['images'].append({
                'coco_url': '',
                'date_captured': '',
                'file_name': str(num).zfill(6) + '.jpg',
                'flickr_url': '',
                'id': num,
                'license': 0,
                'width': width,
                'height': height
            })
            for i in temp:
                if i == 'source' or i == 'pair_id':
                    continue
                else:
                    points = np.zeros(294 * 3)
                    sub_index = sub_index + 1
                    box = temp[i]['bounding_box']
                    w = box[2] - box[0]
                    h = box[3] - box[1]
                    x_1 = box[0]
                    y_1 = box[1]
                    bbox = [x_1, y_1, w, h]
                    cat = temp[i]['category_id']
                    style = temp[i]['style']
                    seg = temp[i]['segmentation']
                    landmarks = temp[i]['landmarks']

                    points_x = landmarks[0::3]
                    points_y = landmarks[1::3]
                    points_v = landmarks[2::3]
                    points_x = np.array(points_x)
                    points_y = np.array(points_y)
                    points_v = np.array(points_v)
                    case = [0, 25, 58, 89, 128, 143, 158, 168, 182, 190, 219, 256, 275, 294]
                    idx_i, idx_j = case[cat - 1], case[cat]

                    for n in range(idx_i, idx_j):
                        points[3 * n] = points_x[n - idx_i]
                        points[3 * n + 1] = points_y[n - idx_i]
                        points[3 * n + 2] = points_v[n - idx_i]

                    num_points = len(np.where(points_v > 0)[0])

                    dataset['annotations'].append({
                        'area': w * h,
                        'bbox': bbox,
                        'category_id': cat,
                        'id': sub_index,
                        'pair_id': pair_id,
                        'image_id': num,
                        'iscrowd': 0,
                        'style': style,
                        'num_keypoints': num_points,
                        'keypoints': points.tolist(),
                        'segmentation': seg,
                    })

json_name = '../deepfashion2.json'
with open(json_name, 'w') as f:
    json.dump(dataset, f)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.