Comments (10)
Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?
I mean you can propose your questions in one issue. It is not necessary to open a new issue every time, as the former issues are not closed. Now let's move on to the question you have. Directly segmenting these facial areas into the e4e encoder may obviously degrade the performance, as they are not a complete face image, which is out of the face distribution. The e4e encoder was trained using complete face images and the parameters are fixed in this work. So your opinion may not work. I did not check this, you can have a try by changing the function of
color_parse_map
to obtain these specific facial areas.
I just tested the generation effect using segmentation maps with Interfacegan, and the result is significantly worse.
from w-plus-adapter.
"Thank you for your work and response.
If I input the face segmentation mask obtained from a wild image processed by a segmentation model into e4e to get W+, it might improve the editing effect. However, does W+ itself carry other image information? For example, does the W+ vector obtained from the mask affect the image reconstruction quality? Does the reconstruction effect of diffusion models rely solely on VAE?"
Hi Yixuan, thanks for your interest. This repository has 8 issues, among which, 6 are opened by you. You can propose all questions in one issue. I will receive the notification and reply as soon as possible.
As for your question, the original W+ contains the non-face texture. We think this is not beneficial for the final reconstruction when the text description has an inconsistent background with this region in W+. So we remove the background using Step 2 (see
w-plus-adapter/script/ProcessWildImage.py
Line 138 in b88bc0a
w-plus-adapter/script/ProcessWildImage.py
Line 138 in b88bc0a
from w-plus-adapter.
Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?
from w-plus-adapter.
Thank you very much for your reply. I am new to the field of image editing with diffusion models and hope to learn from and improve upon your work. Your e4e encoder uses the entire facial region as input without segmenting specific facial areas such as eyes, nose, and mouth. If I directly segment these facial areas and then input them into the e4e encoder, will the editing effect improve?
I mean you can propose your questions in one issue. It is not necessary to open a new issue every time, as the former issues are not closed.
Now let's move on to the question you have. Directly segmenting these facial areas into the e4e encoder may obviously degrade the performance, as they are not a complete face image, which is out of the face distribution. The e4e encoder was trained using complete face images and the parameters are fixed in this work. So your opinion may not work. I did not check this, you can have a try by changing the function of color_parse_map
to obtain these specific facial areas.
from w-plus-adapter.
@csxmli2016 Thank you very much.
①Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions?
②Do wild_image
and wild_mask
need to undergo the same aug_self
data augmentation operation? Will not applying the same data augmentation operation to wild_image
affect the calculation of the loss?
③"Given the superior performance of RCA, I am considering applying the W+ adapter to my "photobooth" project. However, I have embedded relevant facial attribute features for the text. Do you recommend that I freeze the weights of RCA and only train the text encoder?"
from w-plus-adapter.
@csxmli2016 In the calculation of Loss_disen, you are using 1-M,that is executing”mask_region = 1 - F.interpolate(batch['wild_masks'], (64, 64), mode='bilinear').repeat(1,4,1,1). ” However, the experimental section of the paper indicates that M represents the facial region. In Equation (6), M is used, so why is 1-M used in the code for calculating Loss_disen instead of directly using wild_mask? Is there an inconsistency?
When running processWildimage.py, the black area represents the facial region and the white area represents the background. In aug_self, nonzero_indices = np.argwhere(mask == 255) is used.
from w-plus-adapter.
@csxmli2016 Thank you very much. ①Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions? ②Do
wild_image
andwild_mask
need to undergo the sameaug_self
data augmentation operation? Will not applying the same data augmentation operation towild_image
affect the calculation of the loss? ③"Given the superior performance of RCA, I am considering applying the W+ adapter to my "photobooth" project. However, I have embedded relevant facial attribute features for the text. Do you recommend that I freeze the weights of RCA and only train the text encoder?"
- Can you explain why the training time in the second stage is three times longer than in the first stage? Is it mainly due to the dataset or the textual descriptions?
In-the-wild generation is more difficult than Stage I, as it should generalize to different text descriptions, different face locations, face sizes, etc. So it needs more training time.
- Do wild_image and wild_mask need to undergo the same aug_self data augmentation operation? Will not apply the same data augmentation operation to wild_image affect the calculation of the loss?
Adopting aug_self
is mainly for data augmentation for improving data diversity.
Fine-tuning the RCA with the text encoder may be better.
- Mask region M
In Eqn. 6, 0 in M represents the face region, while 1 in M represents the background region. In our code, 0 in batch['wild_masks'] represents the background while 1 in batch['wild_masks'] is the face region. So I use M = 1 - batch['wild_masks']. You can save each variant to check them.
from w-plus-adapter.
@csxmli2016 Thank you for your continued responses. After reviewing the information and scripts you provided regarding checkpoints, I did not find where you explicitly save global_step
in the checkpoint. Is the global_step
automatically saved in the checkpoint? This would help me resume training from where it left off in case of an interruption.
Here is my idea for resuming training after an interruption:
accelerator.load_state(checkpoint_path)
global_step = accelerator.state_dict()["global_step"]
print(f"Resuming training from checkpoint at global step: {global_step}")
from w-plus-adapter.
@csxmli2016 Thank you for your continued responses. After reviewing the information and scripts you provided regarding checkpoints, I did not find where you explicitly save
global_step
in the checkpoint. Is theglobal_step
automatically saved in the checkpoint? This would help me resume training from where it left off in case of an interruption. Here is my idea for resuming training after an interruption: accelerator.load_state(checkpoint_path) global_step = accelerator.state_dict()["global_step"] print(f"Resuming training from checkpoint at global step: {global_step}")
See Line 739
Line 739 in b88bc0a
from w-plus-adapter.
@csxmli2016 I'm confused about why you didn't put the unet
on the accelerator
while training the model. Are the adapter_modules
in the trained checkpoints all in torch.float32
precision? I'm currently using LoRA to fine-tune your weights. When I add the statement unet.to(accelerator.device, dtype=torch.float16)
, I encounter the error "ValueError: Attempting to unscale FP16 gradients." When I add unet.to(accelerator.device, dtype=torch.float32)
, there is no error, but I'm not sure if this will affect the model's performance.
from w-plus-adapter.
Related Issues (9)
- The first stage model HOT 1
- How to select attribute directions? HOT 1
- inference editing stage HOT 1
- RCA's residual setup HOT 1
- The training debugging parameters are incorrect due to the shell script file. HOT 2
- 'utilizing BLIP2 to obtain captions HOT 1
- The format of the txt file when running train.py. HOT 1
- Attributes Editing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from w-plus-adapter.