lucidrains / gigagan-pytorch Goto Github PK
View Code? Open in Web Editor NEWImplementation of GigaGAN, new SOTA GAN out of Adobe. Culmination of nearly a decade of research into GANs
License: MIT License
Implementation of GigaGAN, new SOTA GAN out of Adobe. Culmination of nearly a decade of research into GANs
License: MIT License
Hi, I ran into some problems while trying to launch training on multi gpu (one machine):
DistributedDataParallel
missing some fields in gigagan_pytorch.py
: self.D.multiscale_input_resolutions
in line 2159 and self.G.input_image_size
in 2095. For a quick fix I added module
before referencing a field and moved on.RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update
originating from gigagan_pytorch.py
line 2514 (2479). I tried to clone the generator output for a quick fix (as well as sample function output), but it didn't help.I use accelerate for the first time, so no prior experience - while running on single GPU everything works fine.
Since multi gpu training is marked in your todo list as done, I suppose the issue is on my side - but maybe you have some idea what causes those problems? I use torch 2.0.1+cu117
, accelerate 0.22.0
.
Anyway, thanks for great work!
I find the results of the gan upsampler to be quite impressive when compared to SD upscaler and real-esrgan. However, in my personal experience, training a SR model with paired lr-hr data in a supervised manner can sometimes make it challenging to generate realistic textures on the hr image, especially when compared to pure generative models. This often results in achieving real-esrgan like outcomes. What, in your opinion, is the main concept that makes the gan upsampler so effective?
Thank you for your great project! I have a little problem. Can the discriminator accept multi-scale input? I notice that the forward() of discriminator only accept one input, not multi-scale.
in your tests are you tried using this on a video? the area of face upscaling especially in videos is lacking & i am wondering if you have tested if you see much temporal inconsistancys?
I'm sorry for leaving such an annoying issue, but how can I use this as a package for txt2img/upscaling?
Thanks a lot for starting this effort. I had a quick look at the code, and seems like though the GAN network structure is done, there is no training code or reference to data. Can you please explain? Many thanks.
Is ur project a functional Gigagan model that any user can use? if so I'm confused and the steps of how i would even use the model? and what it generates or even upscaling. because if it's an actual working implementation of GigaGan that would be dope
Hello, nice work!~
Look like the code not read text infor together with image yet, right?
Is their any pre-trained model?
Not sure how to run this model for inference to sharpen image. Is there an example or instruction of how to use source code here?
Hi!
Does anyone know the parameters to reproduce the paper closely in terms of model size?
Thanks!
Hello Phil,
Thanks for the work you do. Someone else has been copying your code without respecting your MIT license and including your name, or any reference to you, whatsoever. Please see here:
https://github.com/jianzhnie/GigaGAN
His code updates to nearly identical to yours within a day of you making any updates. As you can see, I made an issue on his project page to address the licensing issue, which he closed without updating the license agreement or making any mention of you.
Here is how to request a DMCA page takedown in the instance of license violations:
https://docs.github.com/en/site-policy/content-removal-policies/guide-to-submitting-a-dmca-takedown-notice
Just wanted to bring this to your attention.
Cheers
How do I implement text-to-image training and generation on the LAION dataset?
The order of resolutions are from high to low in generator:
https://github.com/lucidrains/gigagan-pytorch/blob/9a364dd33cf0ff053ea01041e02bc41949e53609/gigagan_pytorch/gigagan_pytorch.py#L919C9-L919C20
should be reversed I suppose?
Hi @lucidrains, are the weights of gigagan released, specially for upscaler?
If weights not available then on which dataset i can train the upscaler as not mentioned in readme file. And the format of images to make it for the dataloader
Hello lucidrains,
Thanks for your code!
I was wondering, is it possible to upscale an image from 1024px to 4K px with GigaGAN? (Even if it takes a lot of memory and GPU/CPU)
In your code it seems to be a 256px upscale in 512px?
If it is possible to upscale in 1024px what values should I change in your code?
gan = GigaGAN(
train_upsampler = True, # set this to True
generator = dict(
style_network = dict(
dim = 64,
depth = 4
),
dim = 32,
image_size = 256,
input_image_size = 64,
unconditional = True
),
discriminator = dict(
dim_capacity = 16,
dim_max = 512,
image_size = 256,
num_skip_layers_excite = 4,
multiscale_input_resolutions = (128,),
unconditional = True
),
amp = True
).cuda()
dataset = ImageDataset(
folder = '/path/to/your/data',
image_size = 256
)
Should I change the dim, multiscale_input_resolutions? I'm a bit lost. :-)
Regards,
Sparfell
Hi! Thanks for the implementation, it is great! One possible issue I noticed is that the rgb images in each generator block is not collected correctly. If I understand correctly, in this line, the rgbs
should collect rgb
rather than layer_rgb
?
Hi! was reviewing some code parts of your implementation (great as usual!) and noticed that you dont alter gradient computation for models when its not required. for example computing losses for discriminator we dont need gradients in generator model and vice versa, i believe it should save some memory?
I've got a bunch of compute the next couple weeks and thinking to train this on LAION.
Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.
From paper
https://arxiv.org/pdf/2006.04710.pdf
First, token needs to attend itself to ensure Lipschitz!
Second, torch.cdist
is not the correct way to do it. Follow the original paper.
for tied qk
AB = torch.matmul(qk, qk.transpose(-1, -2))
AA = torch.sum(qk ** 2, -1, keepdim=True)
BB = AA.transpose(-1, -2) # Since query and key are tied.
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)
for separate qk
AB = torch.matmul(q, k.transpose(-1, -2))
AA = torch.sum(q ** 2, -1, keepdim=True)
BB = torch.sum(k ** 2, -1, keepdim=True).transpose(-1, -2)
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)
This is basically torch.cdist().square()
, but more efficient and supports double backward for r1 regularization.
Last, I believe the paper only used L2 self attention in discriminator. The generator should still use dot attention.
Hi!
i was running few experiments and noticed that GP is extremely hight in first few 100 steps.
GP > 60000, and then gradually going down to around GP = 20
is it normal behaviour? In my previous experience with StyleGan GP was small in the beginning
Is this ready to start training?
I am very curious about this project, how can you implement things on it without having weigths to check results? How can you make sure wheter is correct implementation?
I am beginning my studies yet, but I have access to a great hardware for training, I don't know if it will be enough, but if does, may I help with first weights.
Hi,
Would you mind to provide an example of how to use this model for SR. What would be the text input ? I assume, since CLIP is used , rather than text input we use image and its embedding (since it is in a joint space ) ?
I'd like to suggest inform on readme about minimum hardware requirements for training.
I am new on this scenario, I am trying to study about all these technologies recently, and I also would like to know about these requirements, because I recently purchased a very simple computer machine to start my studies.
Is this project able to run on my weak machine?
Core i9 12900HK Interposed @ 5,0GHz
DDR4 64GB RAM
1 x RTX 4090 24GB GDDR6X
I know some generatives models that will run pretty on this hardware, but some text models like Llama requires much more RAM/VRAM memory to run.
Im planning aquiring more GPUs in the future, but I think these modest hardware will work very well to start study journey.
So guys, what you can share about which hardware will be pretty well to run training and inferences?
what purpose is line 420 in the code unet_upsampler.py ?
it seems like not using in anywhere and not effect to anywhere
Hi! I know this is a work in progress, but I'm wondering if you could create a basic installation/run section in the readme?
Has anyone trained a sample model of this? I realize full-scale training on LAION will take quite a bit of resources (SD v2 trained for 200k GPU hours), but I'm wondering (1) are there any publicly-available sample trained models yet, and (2) any estimate on the resources required for a full-scale training? I'm guessing GigaGan will require less training than Stable Diffusion did.
Thanks for the great work!
May I know when will the pretrained models be available?
Thanks.
Thanks for updating the implementation frequently! The codebase looks much nicer than when I first looked at it. I took a closer look on the details and would like to ask some questions on specific design choice. Of course I understand that not every specific design comes with a reason, but I would like to know if you have reference/intuition on these things:
Starting a new issue for better visibility to other people. I have some quick questions regarding the implementation of discriminator. The argument to the forward function (images
, rgb
, real_images
) seems to be a bit confusing.
Not sure my understanding is correct. In a training iteration of a general GAN, the discriminator should be called twice, once for a batch real images and once for a batch of generated image. While we can concat real and fake images to call the network once, I think this is not done here as the output logit has only 1 scalar rather than 2.
Then I would like to know what are the expected arguments for the real pass and fake pass? I guess for the fake pass, rgbs
is the output collection of different resolution rgbs from the generator, and images
should be the final output of the generator (generated images at highest resolution); for the real pass, images
is the real image, and 'rgbs' should be None
. The real_images
argument can always be None
.
I may be very wrong, so really appreciate if you can correct and explain.
Two additional questions:
self.resize_image_to(rgb, rgb.shape[-1])
just return rgb
so it is basically concat two rgb
together?I keep getting NaN losses like this after a few hours of of training:
| 1360/1000000 [7:25:23<5127:52:45, 18.49s/it]G: 0.75 | MSG: -0.17 | VG: 0.00 | D: nan | MSD: 1.97 | VD: 0.00 | GP: nan | SSL: nan | CL: 0.00 | MAL: 0.00
0%|โ | 1380/1000000 [7:31:46<5217:53:00, 18.81s/it]G: 0.76 | MSG: -0.17 | VG: 0.00 | D: nan | MSD: 1.97 | VD: 0.00 | GP: nan | SSL: nan | CL: 0.00 | MAL: 0.00
I'm training on about 200k images with the settings on the README.
gan = GigaGAN(
train_upsampler = True,
generator = dict(
style_network = dict(
dim = 64,
depth = 4
),
dim = 32,
image_size = 256,
input_image_size = 64,
unconditional = True
),
discriminator = dict(
dim_capacity = 16,
dim_max = 512,
image_size = 256,
num_skip_layers_excite = 4,
multiscale_input_resolutions = (128,),
unconditional = True
),
amp = True
).cuda()
Here's an image before NaN loss:
Here's an image of NaN loss:
My current batch size is 20 btw.
What should I do? Did you ever manage to train successfully with the provided settings?
Hello, thanks for your coding. While trying this code, I found out that the resolutions of "multiscale_input_resolutions" and "rbgs" were mismatched. My understanding of the paper is that the size of the multi-scale input is smaller than the original image. in your code that you preserve larger sizes than the original image in order to preserve more features. I'm confused about it and hope for your reply.
How long would it take to do a full replication?
And how workable will it be in terms of what is presented in the official article?
As I understand it, you are using existing technology and adapting something as needed?
I am overwhelmed by your enthusiasm and desire to make available what is not publicly available to all.
It deserves a lot of respect, you are making a huge contribution to the evolution of new technologies by making them openly available, overcoming many questions and challenges!
Hello, first of all, thank you for your great work.
I just inserted the self-attention and cross-attention module to stylegan2-ada, following behind the conv1 SynthesisLayer on the 4,8,16,32 resolution. And I use the CLIP to extract txt features and the contrastive loss. However, after a few iterations, the resulting pictures became completely the same color. Could you please give me some advice on how to improve it?
Hi! While training on multi GPU and using gradient accumulation steps > 1 there's no substantial speedup with relation to a single GPU (there is a speedup if the value is equal to 1). I found following threads on huggingface here and here that seem to provide a solution. I even ran a dummy test by just adding a proper argument to Accelerator, and actually the training was much faster (in your class I set the gradient accumulation steps to 1, but for Accelerator to 8, but I didn't make other changes to take into account this modification, so the results weren't particularly useful ๐). If you have time to check if this is interesting for you, I'd be grateful.
Hi Phil,
I noted a few possible discrepancies, and I have made some suggestions as well.
You may want to consider;
Turning off Demodulation for ToRGB
The demodulation is turned on in the ToRGB. In the current implementation, ToRGB usesdemod=True
. This makes it a lot harder for the generator to get the correct feature magnitude at each level of the progressive growing pathway.
Removing sample adaptive kernels in the ToRGB component
I don't believe that SAKS is used by GigaGAN in the ToRGB section. The caption of Figure 4 states "The style code modulates the main generator using our style-adaptive kernel selection, shown on the right". The main generator is the upsampling pathway consisting of style based feature modulation, not the progressive growing section. The output of the ToRGB section is three channels, so adding more kernel probably won't improve representational capacity but lead to oscillations.
Hey there check out
https://github.com/Keep-up-sharma/Dynamic-Layers
Seems like dynamic convulsions leads to better results... this model should benifit from it
It seems that the current implementation of the Generator/Discriminator do not make any use of the primary dim
argument. Looks like capacity
is the main "go to" for altering the overall model size on the z dimension.
Perhaps dim
could be removed, or the term capacity
could be replaced with dim
.
Hi! Noticed that in a recent commit, you change this line, but isn't the original version with >= correct? It returned the rgbs with size greater than input size, while now it returns less than or equal to input size.
to_rgb Conv has only 1 learnable kernel but we add more here:
https://github.com/lucidrains/gigagan-pytorch/blob/9a364dd33cf0ff053ea01041e02bc41949e53609/gigagan_pytorch/gigagan_pytorch.py#L995C20-L995C20
I'd love to try out especially the upscaling, and it would be great if you could get it running on replicate.com โ thank you for your great work!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.