shuyu-xjtu / aptm Goto Github PK

The official code of "Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark"

Home Page: https://arxiv.org/abs/2306.02898

License: MIT License

Python 100.00%

aptm's Introduction

APTM

APTM (ACM MM 2023) is a new joint Attribute Prompt Learning and Text Matching Learning framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream.

We also present a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1, 510, 330 image-text pairs, which is about 37.5× larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes.

Extensive experiments validate the effectiveness of the pre-training on MALS, achieving the state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.60%, +7.39%, and +15.90% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively. More details can be found at our paper: Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

News

The OneDrive link of MALS dataset are released！
The APTM and the MALS dataset are released. Welcome to communicate！

MALS

MALS leverages generative models to generate a large-scale dataset including 1.5𝑀 image-text pairs. Each image-text pair in MALS is annotated with one corresponding description and several appropriate attribute labels, indicating that MALS is not only effective for text-image matching and attribute prompt learning, but also explores the feasibility of pre-training for both attribute recognition and image-text matching in one stone. The dataset is released at Baidu Yun [4kq0] and OneDrive [mals].

Note that MALS can only be used for research, any commercial usage is forbidden.

This is the comparison between MALS and other text based person retrieval datasets.

These are examples of our MALS dataset and CUHK-PEDES.

Annotation format:

[{"image": "gene_crop/c_g_a_0/0.jpg",
"caption": "a young boy wearing a black hoodie leaning against a wall with his hands on his hips and his hands on his hips wearing jeans and a baseball cap",
"image_id": "c_g_a_0_0",
"label": [1, 0, ..., 1, 1]},
...
{"image": "gene_crop/c_g_a_0/20217.jpg",
"caption": "a woman in a white top and black pants posing for a picture in front of a brick wall with a pink carpet in front of her",
"image_id": "c_g_a_0_20217",
"label": [0, 1, ..., -1, -1]}]

Models and Weights

The checkpoints have been released at Baidu Yun [b2l8] and Google Drive

Usage

Install Requirements

we use 4 A100 80G GPU for training and evaluation.

Create conda environment.

conda create -n aptm python=3.8
conda activate aptm
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip3 install -r requirements.txt

Datasets Prepare

Download the CUHK-PEDES dataset from here , the PA-100K dataset from here, the RSTPReid dataset from here, and ICFG-PEDES dataset from here. Download the processed json files of the aboves four datasets from here [b2l8]

Download pre-trained models for parameter initialization:

image encoder: swin-transformer-base

text encoder: bert-base

Organize data folder as follows:

|-- data/
|    |-- bert-base-uncased/
|    |-- finetune/
|        |-- gene_attrs/
|            |-- g_4x_attrs.json
|            |-- g_c_g_a_0_attrs.json
|            |-- ...
|        |-- cuhk_train.json
|        |-- ...
|        |-- icfg_train.json
|        |-- ...
|        |-- rstp_train.json
|        |-- ...
|        |-- PA100K_train.json
|        |-- ...
|    |-- swin_base_patch4_window7_224_22k.pth

And organize those datasets in images folder as follows:

|-- images/
|    |-- <CUHK-PEDES>/
|        |-- imgs/
|            |-- cam_a/
|            |-- cam_b/
|            |-- ...
|            |-- train_query/
|            |-- gene_crop/
|                |-- 4x/
|                |-- c_g_a/
|                |-- ...
|                |-- i_g_a_43/
|
|    |-- <ICFG-PEDES>/
|        |-- test/
|        |-- train/
|
|    |-- <pa100k>/
|        |-- release_data/
|
|    |-- <RSTPReid>/

Pretraining

We pretrain our APTM using MALS as follows：

python3 run.py --task "itr_gene" --dist "f4" --output_dir "output/pretrained"

Fine-tuning

We fine-tune our APTM using existing text-based Person Reid datasets. Performance can be improved by replacing the backbone with our pre-trained model. Taking CUHK-PEDES as example:

python3 run.py --task "itr_cuhk" --dist "f4" --output_dir "output/ft_cuhk" --checkpoint "output/pretrained/checkpoint_31.pth"

Evaluation

python3 run.py --task "itr_cuhk" --evaluate --dist "f4" --output_dir "output/ft_cuhk/test" --checkpoint "output/ft_cuhk/checkpoint_best.pth"

Reference

If you use APTM in your research, please cite it by the following BibTeX entry:

@inproceedings{yang2023towards,
  title={Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark},
  author={Yang, Shuyu and Zhou, Yinan and Wang, Yaxiong and Wu, Yujiao and Zhu, Li and Zheng, Zhedong},
  booktitle = {Proceedings of the 2023 {ACM} on Multimedia Conference},
  year={2023}
}

aptm's People

Contributors

Stargazers

Watchers

Forkers

layumi sunmingyang1987 guyuex sssssshf huangwenwenlili shengyupei 1015356144 liviust jie311 wangjl1993 xbdxwyh ledat98

aptm's Issues

PA100K Result

请教这里为什么说对比不公平，不都是 Swin Base 吗？
SOLIDER ：https://github.com/tinyvision/SOLIDER-PersonAttributeRecognition

APL 损失设置

作者您好，
非常感谢贵团队带来的杰出工作APTM。我在尝试基于APTM框架和MALS数据集进行预训练时，发现你们提供的默认配置文件Retrieval_gene.yaml当中将APL损失以及其对应的权重β注释掉了，即配置文件中的attr、t两个变量，也就是说你们默认的配置文件未开启APL损失的计算。目前训练下来发现：
（1）按照你们的默认配置文件（不开启APL损失计算）是能够正常训练的，但预训练的结果无法与你们对比，因为你们提供的预训练模型权重文件的共享链接失效了，无法测试你们aptm框架基于MALS数据集进行预训练的性能。请问你们后续是否能够更新一下模型权重文件的共享链接，非常感谢。
（2）若按照你们论文中的设置，将默认配置文件的APL损失开启，模型无法正常训练，apl损失值出现了nan现象，请问apl损失的计算是需要什么其他的设置遗漏了吗，或者我使用的方式有什么需要修改吗，目前是完全按照你们的代码和github上的配置文件进行复现的。或者说，出现nan现象，对于aptm框架来说是正常的吗？

非常感谢抽空阅读这个issue。

gene_crop/c_g_a_1/文件夹下有多个图像缺失

您好，非常感谢您能够开源代码和数据集！在我预训练的过程中，我发现gene_crop/c_g_a_1/路径下，缺失了多个图像。是不是您上传json的时候，上传了未整理的版本呢？

ModuleNotFoundError : from reTools import evaluation, mAPNo module named 'gnn_reranking'

大佬们，我想知道文本编码器使用bert前六层和交叉编码器使用的bert后六层在代码中哪里体现

不同的normlize是怎么得来的

我注意到你们用在CUHK，ICFG和RSTP上的normlize是不一样的，这些不同的normlize是怎么得来的？

Dataset Drive Link

Hello,
Thanks for your awesome study!
I was wondering if there is any chance you could share a Google Drive link for the dataset for the community of ReID?

itr_pa100k task Test Result

I modified the 'batch_size_train' parameter in the 'Retrieval_pa100k.yaml' file to 12. After running the script, I obtained the following test results. However, the result of mAP in the paper is 82.58.

CUHK-PEDES

请问CUHK-PEDES数据集怎么下载的？点链接进去啥都没有

Zero-shot evaluation for downstream datasets?

Firstly, I would like to express my appreciation for the insightful research presented in the paper.

The paper shows the results:

baseline: training the model without pretraining on MALS
APTM: training the model with pretraining on MALS

I am interested in understanding the zero-shot performance, where no further training is performed after pretraining on MALS. Specifically, I would like to see evaluation results on the downstream datasets(CUHK-PEDES, ICFG-PEDES, and RST-PReid):

zero-shot: no training, just the model pretraining on MALS, then test on downstream datasets

作者们好，请问我如何从模型获取好的特征向量

我使用自己的这个Evaluator进行测试

class Evaluator():
    def __init__(self, img_loader, txt_loader):
        self.img_loader = img_loader  # gallery
        self.txt_loader = txt_loader  # query

    def _compute_embedding(self, model):
        model = model.eval()
        device = next(model.parameters()).device

        qids, gids, qfeats, gfeats = [], [], [], []
        # text
        for pid, caption in self.txt_loader:

            for k, v in caption.items():
                caption[k] = v.to(device)
            with torch.no_grad():
                caption["input_ids"] = caption["input_ids"].squeeze(1)
                text_embeds = model.get_text_embeds(caption["input_ids"], caption["attention_mask"])
                text_feat =  model.text_proj(text_embeds[:, 0, :])
            qids.append(pid.view(-1))  # flatten
            qfeats.append(text_feat)
        qids = torch.cat(qids, 0)
        qfeats = torch.cat(qfeats, 0)
        # image
        for pid, img in self.img_loader:
            img = img.to(device)
            with torch.no_grad():
                image_embeds, image_atts = model.get_vision_embeds(img)
                img_feat = model.vision_proj(image_embeds[:, 0, :])
            gids.append(pid.view(-1))  # flatten
            gfeats.append(img_feat)
        gids = torch.cat(gids, 0)
        gfeats = torch.cat(gfeats, 0)
        return qfeats, gfeats, qids, gids

结果发现CUHK-PEDES的rank1只有55，我使用了bert的tokenizer。
代码中没有直接测试，而是对相似度矩阵进行了某种迭代，如何理解这一过程？我该如何获取好的文本向量和图像向量呢？

`c_g_a_9.zip` is not downloadable

I'm keep failing to download one file c_g_a_9.zip

incorrect predictions in attribute recognition for pa100k images

Thanks for your great job.
I used the "checkpoint_best.pth" model that was released in Google Drive/checkpoints/ft_pa100k.zip to recognize attributes of pa100k images. I changed Retrieval_pa100k.yaml file in 3 keys as below :

pa100k: False
pa100k_only_img_classifier: True
dop: 0.1

It seems that every thing is correct but I don't now why don't get the correct result in output. The model return similar probability (equals to 0.5) for all attributes:

[0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]

additionally when I loaded the model I got this logs:

missing_keys: ['img_cls.1.weight', 'img_cls.1.bias', 'img_cls.2.weight', 'img_cls.2.bias', 'img_cls.2.running_mean', 'img_cls.2.running_var', 'img_cls.4.weight', 'img_cls.4.bias']
vision_encoder missing_keys: []
unexpected_keys: ['temp', 'text_encoder.bert.embeddings.position_ids', 'text_encoder.bert.embeddings.word_embeddings.weight', 'text_encoder.bert.embeddings.position_embeddings.weight', 'text_encoder.bert.embeddings.token_type_embeddings.weight', 'text_encoder.bert.embeddings.LayerNorm.weight', 'text_encoder.bert.embeddings.LayerNorm.bias', 'text_encoder.bert.encoder.layer.0.attention.self.query.weight', 'text_encoder.bert.encoder.layer.0.attention.self.query.bias', 'text_encoder.bert.encoder.layer.0.attention.self.key.weight', 'text_encoder.bert.encoder.layer.0.attention.self.key.bias', 'text_encoder.bert.encoder.layer.0.attention.self.value.weight', 'text_encoder.bert.encoder.layer.0.attention.self.value.bias', 'text_encoder.bert.encoder.layer.0.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.0.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.0.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.0.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.0.output.dense.weight', 'text_encoder.bert.encoder.layer.0.output.dense.bias', 'text_encoder.bert.encoder.layer.0.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.0.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.1.attention.self.query.weight', 'text_encoder.bert.encoder.layer.1.attention.self.query.bias', 'text_encoder.bert.encoder.layer.1.attention.self.key.weight', 'text_encoder.bert.encoder.layer.1.attention.self.key.bias', 'text_encoder.bert.encoder.layer.1.attention.self.value.weight', 'text_encoder.bert.encoder.layer.1.attention.self.value.bias', 'text_encoder.bert.encoder.layer.1.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.1.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.1.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.1.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.1.output.dense.weight', 'text_encoder.bert.encoder.layer.1.output.dense.bias', 'text_encoder.bert.encoder.layer.1.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.1.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.2.attention.self.query.weight', 'text_encoder.bert.encoder.layer.2.attention.self.query.bias', 'text_encoder.bert.encoder.layer.2.attention.self.key.weight', 'text_encoder.bert.encoder.layer.2.attention.self.key.bias', 'text_encoder.bert.encoder.layer.2.attention.self.value.weight', 'text_encoder.bert.encoder.layer.2.attention.self.value.bias', 'text_encoder.bert.encoder.layer.2.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.2.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.2.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.2.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.2.output.dense.weight', 'text_encoder.bert.encoder.layer.2.output.dense.bias', 'text_encoder.bert.encoder.layer.2.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.2.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.3.attention.self.query.weight', 'text_encoder.bert.encoder.layer.3.attention.self.query.bias', 'text_encoder.bert.encoder.layer.3.attention.self.key.weight', 'text_encoder.bert.encoder.layer.3.attention.self.key.bias', 'text_encoder.bert.encoder.layer.3.attention.self.value.weight', 'text_encoder.bert.encoder.layer.3.attention.self.value.bias', 'text_encoder.bert.encoder.layer.3.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.3.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.3.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.3.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.3.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.3.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.3.output.dense.weight', 'text_encoder.bert.encoder.layer.3.output.dense.bias', 'text_encoder.bert.encoder.layer.3.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.3.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.4.attention.self.query.weight', 'text_encoder.bert.encoder.layer.4.attention.self.query.bias', 'text_encoder.bert.encoder.layer.4.attention.self.key.weight', 'text_encoder.bert.encoder.layer.4.attention.self.key.bias', 'text_encoder.bert.encoder.layer.4.attention.self.value.weight', 'text_encoder.bert.encoder.layer.4.attention.self.value.bias', 'text_encoder.bert.encoder.layer.4.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.4.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.4.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.4.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.4.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.4.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.4.output.dense.weight', 'text_encoder.bert.encoder.layer.4.output.dense.bias', 'text_encoder.bert.encoder.layer.4.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.4.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.5.attention.self.query.weight', 'text_encoder.bert.encoder.layer.5.attention.self.query.bias', 'text_encoder.bert.encoder.layer.5.attention.self.key.weight', 'text_encoder.bert.encoder.layer.5.attention.self.key.bias', 'text_encoder.bert.encoder.layer.5.attention.self.value.weight', 'text_encoder.bert.encoder.layer.5.attention.self.value.bias', 'text_encoder.bert.encoder.layer.5.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.5.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.5.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.5.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.5.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.5.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.5.output.dense.weight', 'text_encoder.bert.encoder.layer.5.output.dense.bias', 'text_encoder.bert.encoder.layer.5.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.5.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.6.attention.self.query.weight', 'text_encoder.bert.encoder.layer.6.attention.self.query.bias', 'text_encoder.bert.encoder.layer.6.attention.self.key.weight', 'text_encoder.bert.encoder.layer.6.attention.self.key.bias', 'text_encoder.bert.encoder.layer.6.attention.self.value.weight', 'text_encoder.bert.encoder.layer.6.attention.self.value.bias', 'text_encoder.bert.encoder.layer.6.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.6.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.6.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.6.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.6.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.6.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.6.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.6.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.6.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.6.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.6.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.6.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.6.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.6.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.6.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.6.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.6.output.dense.weight', 'text_encoder.bert.encoder.layer.6.output.dense.bias', 'text_encoder.bert.encoder.layer.6.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.6.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.7.attention.self.query.weight', 'text_encoder.bert.encoder.layer.7.attention.self.query.bias', 'text_encoder.bert.encoder.layer.7.attention.self.key.weight', 'text_encoder.bert.encoder.layer.7.attention.self.key.bias', 'text_encoder.bert.encoder.layer.7.attention.self.value.weight', 'text_encoder.bert.encoder.layer.7.attention.self.value.bias', 'text_encoder.bert.encoder.layer.7.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.7.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.7.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.7.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.7.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.7.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.7.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.7.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.7.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.7.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.7.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.7.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.7.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.7.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.7.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.7.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.7.output.dense.weight', 'text_encoder.bert.encoder.layer.7.output.dense.bias', 'text_encoder.bert.encoder.layer.7.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.7.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.8.attention.self.query.weight', 'text_encoder.bert.encoder.layer.8.attention.self.query.bias', 'text_encoder.bert.encoder.layer.8.attention.self.key.weight', 'text_encoder.bert.encoder.layer.8.attention.self.key.bias', 'text_encoder.bert.encoder.layer.8.attention.self.value.weight', 'text_encoder.bert.encoder.layer.8.attention.self.value.bias', 'text_encoder.bert.encoder.layer.8.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.8.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.8.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.8.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.8.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.8.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.8.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.8.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.8.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.8.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.8.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.8.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.8.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.8.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.8.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.8.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.8.output.dense.weight', 'text_encoder.bert.encoder.layer.8.output.dense.bias', 'text_encoder.bert.encoder.layer.8.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.8.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.9.attention.self.query.weight', 'text_encoder.bert.encoder.layer.9.attention.self.query.bias', 'text_encoder.bert.encoder.layer.9.attention.self.key.weight', 'text_encoder.bert.encoder.layer.9.attention.self.key.bias', 'text_encoder.bert.encoder.layer.9.attention.self.value.weight', 'text_encoder.bert.encoder.layer.9.attention.self.value.bias', 'text_encoder.bert.encoder.layer.9.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.9.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.9.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.9.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.9.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.9.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.9.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.9.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.9.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.9.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.9.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.9.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.9.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.9.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.9.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.9.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.9.output.dense.weight', 'text_encoder.bert.encoder.layer.9.output.dense.bias', 'text_encoder.bert.encoder.layer.9.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.9.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.10.attention.self.query.weight', 'text_encoder.bert.encoder.layer.10.attention.self.query.bias', 'text_encoder.bert.encoder.layer.10.attention.self.key.weight', 'text_encoder.bert.encoder.layer.10.attention.self.key.bias', 'text_encoder.bert.encoder.layer.10.attention.self.value.weight', 'text_encoder.bert.encoder.layer.10.attention.self.value.bias', 'text_encoder.bert.encoder.layer.10.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.10.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.10.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.10.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.10.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.10.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.10.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.10.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.10.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.10.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.10.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.10.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.10.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.10.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.10.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.10.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.10.output.dense.weight', 'text_encoder.bert.encoder.layer.10.output.dense.bias', 'text_encoder.bert.encoder.layer.10.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.10.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.11.attention.self.query.weight', 'text_encoder.bert.encoder.layer.11.attention.self.query.bias', 'text_encoder.bert.encoder.layer.11.attention.self.key.weight', 'text_encoder.bert.encoder.layer.11.attention.self.key.bias', 'text_encoder.bert.encoder.layer.11.attention.self.value.weight', 'text_encoder.bert.encoder.layer.11.attention.self.value.bias', 'text_encoder.bert.encoder.layer.11.attention.output.dense.weight', 'text_encoder.bert.encoder.layer.11.attention.output.dense.bias', 'text_encoder.bert.encoder.layer.11.attention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.11.attention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.11.crossattention.self.query.weight', 'text_encoder.bert.encoder.layer.11.crossattention.self.query.bias', 'text_encoder.bert.encoder.layer.11.crossattention.self.key.weight', 'text_encoder.bert.encoder.layer.11.crossattention.self.key.bias', 'text_encoder.bert.encoder.layer.11.crossattention.self.value.weight', 'text_encoder.bert.encoder.layer.11.crossattention.self.value.bias', 'text_encoder.bert.encoder.layer.11.crossattention.output.dense.weight', 'text_encoder.bert.encoder.layer.11.crossattention.output.dense.bias', 'text_encoder.bert.encoder.layer.11.crossattention.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.11.crossattention.output.LayerNorm.bias', 'text_encoder.bert.encoder.layer.11.intermediate.dense.weight', 'text_encoder.bert.encoder.layer.11.intermediate.dense.bias', 'text_encoder.bert.encoder.layer.11.output.dense.weight', 'text_encoder.bert.encoder.layer.11.output.dense.bias', 'text_encoder.bert.encoder.layer.11.output.LayerNorm.weight', 'text_encoder.bert.encoder.layer.11.output.LayerNorm.bias', 'text_encoder.cls.predictions.bias', 'text_encoder.cls.predictions.transform.dense.weight', 'text_encoder.cls.predictions.transform.dense.bias', 'text_encoder.cls.predictions.transform.LayerNorm.weight', 'text_encoder.cls.predictions.transform.LayerNorm.bias', 'text_encoder.cls.predictions.decoder.weight', 'text_encoder.cls.predictions.decoder.bias', 'vision_proj.weight', 'vision_proj.bias', 'text_proj.weight', 'text_proj.bias', 'itm_head.0.weight', 'itm_head.0.bias', 'itm_head.1.weight', 'itm_head.1.bias', 'itm_head.3.weight', 'itm_head.3.bias']

Total Params: 87022610

would you mind helping me with this?

Issues with Running Evaluation Run.py

I was running the code for just evaluation after downloading the datasets and checkpoints required. However, I am not able to run the evaluation python script. Does anyone have the same problem too? and do you have any solution to this?
To Reproduce the error:
python3 run.py --task "itr_cuhk" --evaluate --dist "f4" --output_dir "output/ft_cuhk/test" --checkpoint "output/ft_cuhk/checkpoint_best.pth"

I am wondering if there is a version error in this issue, i am using PyYAML: 6.0.1, PyTorch: 2.2.1, ruamel.yaml: 0.18.6
ruamel.yaml.clib : 0.2.8

Error

NNODES, 1
NPROC_PER_NODE, 4
MASTER_ADDR, 127.0.0.1
MASTER_PORT, 3000
NODE_RANK, 0
/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
[2024-03-25 15:15:00,947] torch.distributed.run: [WARNING]
[2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] *****************************************
[2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-25 15:15:00,947] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use

yaml = YAML(typ='rt')
yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data,
instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

Traceback (most recent call last):
File "Retrieval.py", line 296, in
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/ruamel/yaml/main.py", line 1037, in error_deprecation
raise AttributeError(s)
AttributeError:
"load()" has been removed, use

yaml = YAML(typ='rt')
yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data,
instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

yaml = YAML(typ='rt')
yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data,
instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

yaml = YAML(typ='rt')
yaml.load(...)

and register any classes that you use, or check the tag attribute on the loaded data,
instead of file "Retrieval.py", line 296

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader)

[2024-03-25 15:15:11,033] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 142176) of binary: /home/default/miniconda3/envs/aptm/bin/python3
Traceback (most recent call last):
File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/default/miniconda3/envs/aptm/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/default/miniconda3/envs/aptm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Retrieval.py FAILED

Failures:
[1]:
time : 2024-03-25_15:15:11
host : default-Pulse-15-B13VFK
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 142177)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-25_15:15:11
host : default-Pulse-15-B13VFK
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 142178)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-25_15:15:11
host : default-Pulse-15-B13VFK
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 142179)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-25_15:15:11
host : default-Pulse-15-B13VFK
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 142176)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

作者们好，请问可以放出训练MALS的log吗

我想对比一下验证自己训练过程有没有问题

预训练模型的zero-shot能力

请问该APTM预训练模型直接用于domain generalization下游ReID任务测试的zero-shot能力如何？用于预训练的生成行人数据是否会与下游真实行人数据存在较大的domain gap呢

使用pre_trained model进行微调

请问为什么使用pre_trained model在cuhk-pedes这些数据集进行微调时，一直会报out of memory的错误，我的设备如下

Adding License file

Can you please add License file?
Thanks

感谢大佬，我有个关于图像生成captions的疑问

您之前回复说是直接使用BLIP的预训练版本来生成captions，您具体是用的哪个BLIP的预训练版本？我用BLIP生成出来的captions好像都很简略，得不到穿着颜色之类的细粒度信息。不知道您具体是如何操作的，是否有对BLIP进行生成细粒度信息的微调。

请教一下生成caption的prompt

我想请教一下，您文章中使用BLIP对图片生成caption的prompt是怎么设置的？方便告诉一下吗

【评估代码的一些疑问】

1, 请问评估代码中的 evaluation_attr_only_img_classifier()、evaluation_attr() 和 evaluation()，三种评估函数有什么区别？
2. 请问 evaluation() 评估函数中的 score_matrix_t2i 与 score_sim_t2i 有什么区别？

pre-trained models

Whether the pre-trained model is open source?

Comparison with BLIP and importance of Attribute Learning

Hi, I have tried APTM it is awesome!

Now, I want to finetune APTM for text-based person retrieval task over custom data which contains more person attributes, like different poses and more objects. I'm trying to understand how important attribute learning is, and whether I can finetune BLIP instead of APTM.

In the paper, it is mentioned that "BLIP is used to produce more fitting captions for every synthetic image and form the final
image-text pairs" in MALS dataset. I have following questions regarding the approach:

Did you use the pretrained version of BLIP or did you finetune it over some person image-text pairs before labelling MARS dataset?
If you're using BLIP generated captions as ground truth and then pretraining APTM, doesn't that mean pretrained APTM is trying to achieve BLIP performance in pretraining phase? Perhaps in the finetuning phase, APTM may be better than BLIP. Is this thought process correct?
Do you have APTM comparison results with BLIP (either pretrained or fine-tuned over some person image-text pairs)?

What approach would you suggest if I want to APTM to understand more person attributes, like different poses and more objects?

Finetune for ITC + ITM + MLM. This would require only image-text pairs. (Can be done be BLIP too)
Finetune for IAC + IAM + MAM + ITC + ITM + MLM. This would require attribute labels to be prepared.

Thanks! Please correct me if I understood anything incorrectly.

多数据集混合微调训练

作者你好！如果把不同数据集混合微调，比如cuhk+icfg，需要做以下工作：

把icfg的caption离线处理成cuhk格式（全部小写，标点符号替换为空格。）
icfg数据集的image_id累加cuhk的最大id
（需要计算新的均值方差吗）

除此之外还需要其他修改吗？这样多数据集的微调模型泛华效果是否会更好呢？期待回复，谢谢！

Issue with Missing JSON Files in gene_attrs Folder after Extracting finetune.zip

I have organized the files according to "Organize data folder as follows," but when I extracted the finetune.zip file, there were no JSON files inside the gene_attrs folder (such as g_4x_attrs.json). How can I resolve this issue?

shuyu-xjtu / aptm Goto Github PK

aptm's Introduction

APTM

News

MALS

Models and Weights

Usage

Install Requirements

Datasets Prepare

Pretraining

Fine-tuning

Evaluation

Reference

aptm's People

Contributors

Stargazers

Watchers

Forkers

aptm's Issues

Total Params: 87022610

Error

Retrieval.py FAILED

Recommend Projects

Recommend Topics

Recommend Org