Giter Club home page Giter Club logo

Comments (10)

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024 1

Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?

I mean you can propose your questions in one issue. It is not necessary to open a new issue every time, as the former issues are not closed. Now let's move on to the question you have. Directly segmenting these facial areas into the e4e encoder may obviously degrade the performance, as they are not a complete face image, which is out of the face distribution. The e4e encoder was trained using complete face images and the parameters are fixed in this work. So your opinion may not work. I did not check this, you can have a try by changing the function of color_parse_map to obtain these specific facial areas.

I just tested the generation effect using segmentation maps with Interfacegan, and the result is significantly worse.

from w-plus-adapter.

csxmli2016 avatar csxmli2016 commented on July 16, 2024

"Thank you for your work and response.

If I input the face segmentation mask obtained from a wild image processed by a segmentation model into e4e to get W+, it might improve the editing effect. However, does W+ itself carry other image information? For example, does the W+ vector obtained from the mask affect the image reconstruction quality? Does the reconstruction effect of diffusion models rely solely on VAE?"

Hi Yixuan, thanks for your interest. This repository has 8 issues, among which, 6 are opened by you. You can propose all questions in one issue. I will receive the notification and reply as soon as possible.
As for your question, the original W+ contains the non-face texture. We think this is not beneficial for the final reconstruction when the text description has an inconsistent background with this region in W+. So we remove the background using Step 2 (see

Step 2: Remove Background
). The removed background in face image can be regarded as white. We have checked that this operation has negligible impact on editing. You can further check the performance when using W+ with or without background (this can be easily achieved by removing the segmentation operation in Step 2
Step 2: Remove Background
).

from w-plus-adapter.

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024

Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?

from w-plus-adapter.

csxmli2016 avatar csxmli2016 commented on July 16, 2024

Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?

I mean you can propose your questions in one issue. It is not necessary to open a new issue every time, as the former issues are not closed.
Now let's move on to the question you have. Directly segmenting these facial areas into the e4e encoder may obviously degrade the performance, as they are not a complete face image, which is out of the face distribution. The e4e encoder was trained using complete face images and the parameters are fixed in this work. So your opinion may not work. I did not check this, you can have a try by changing the function of color_parse_map to obtain these specific facial areas.

from w-plus-adapter.

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024

@csxmli2016 Thank you very much.
①Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions?
②Do wild_image and wild_mask need to undergo the same aug_self data augmentation operation? Will not applying the same data augmentation operation to wild_image affect the calculation of the loss?
③"Given the superior performance of RCA, I am considering applying the W+ adapter to my "photobooth" project. However, I have embedded relevant facial attribute features for the text. Do you recommend that I freeze the weights of RCA and only train the text encoder?"

from w-plus-adapter.

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024

@csxmli2016 In the calculation of Loss_disen, you are using 1-M,that is executing”mask_region = 1 - F.interpolate(batch['wild_masks'], (64, 64), mode='bilinear').repeat(1,4,1,1). ” However, the experimental section of the paper indicates that M represents the facial region. In Equation (6), M is used, so why is 1-M used in the code for calculating Loss_disen instead of directly using wild_mask? Is there an inconsistency?
When running processWildimage.py, the black area represents the facial region and the white area represents the background. In aug_self, nonzero_indices = np.argwhere(mask == 255) is used.

from w-plus-adapter.

csxmli2016 avatar csxmli2016 commented on July 16, 2024

@csxmli2016 Thank you very much. ①Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions? ②Do wild_image and wild_mask need to undergo the same aug_self data augmentation operation? Will not applying the same data augmentation operation to wild_image affect the calculation of the loss? ③"Given the superior performance of RCA, I am considering applying the W+ adapter to my "photobooth" project. However, I have embedded relevant facial attribute features for the text. Do you recommend that I freeze the weights of RCA and only train the text encoder?"

  • Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions?

In-the-wild generation is more difficult than Stage I, as it should generalize to different text descriptions, different face locations, face sizes, etc. So it needs more training time.

  • Do wild_image and wild_mask need to undergo the same aug_self data augmentation operation? Will not apply the same data augmentation operation to wild_image affect the calculation of the loss?

Adopting aug_self is mainly for data augmentation for improving data diversity.

Fine-tuning the RCA with the text encoder may be better.

  • Mask region M

In Eqn. 6, 0 in M represents the face region, while 1 in M represents the background region. In our code, 0 in batch['wild_masks'] represents the background while 1 in batch['wild_masks'] is the face region. So I use M = 1 - batch['wild_masks']. You can save each variant to check them.

from w-plus-adapter.

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024

@csxmli2016 Thank you for your continued responses. After reviewing the information and scripts you provided regarding checkpoints, I did not find where you explicitly save global_step in the checkpoint. Is the global_step automatically saved in the checkpoint? This would help me resume training from where it left off in case of an interruption.
Here is my idea for resuming training after an interruption:
accelerator.load_state(checkpoint_path)
global_step = accelerator.state_dict()["global_step"]
print(f"Resuming training from checkpoint at global step: {global_step}")

from w-plus-adapter.

csxmli2016 avatar csxmli2016 commented on July 16, 2024

@csxmli2016 Thank you for your continued responses. After reviewing the information and scripts you provided regarding checkpoints, I did not find where you explicitly save global_step in the checkpoint. Is the global_step automatically saved in the checkpoint? This would help me resume training from where it left off in case of an interruption. Here is my idea for resuming training after an interruption: accelerator.load_state(checkpoint_path) global_step = accelerator.state_dict()["global_step"] print(f"Resuming training from checkpoint at global step: {global_step}")

See Line 739

accelerator.save_state(save_path)

from w-plus-adapter.

gaoyixuan111 avatar gaoyixuan111 commented on July 16, 2024

@csxmli2016 I'm confused about why you didn't put the unet on the accelerator while training the model. Are the adapter_modules in the trained checkpoints all in torch.float32 precision? I'm currently using LoRA to fine-tune your weights. When I add the statement unet.to(accelerator.device, dtype=torch.float16), I encounter the error "ValueError: Attempting to unscale FP16 gradients." When I add unet.to(accelerator.device, dtype=torch.float32), there is no error, but I'm not sure if this will affect the model's performance.

from w-plus-adapter.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.