Giter Club home page Giter Club logo

gigagan-pytorch's Issues

Multi GPU training

Hi, I ran into some problems while trying to launch training on multi gpu (one machine):

  • First, there were problems with DistributedDataParallel missing some fields in gigagan_pytorch.py: self.D.multiscale_input_resolutions in line 2159 and self.G.input_image_size in 2095. For a quick fix I added module before referencing a field and moved on.
  • Then I got a following error: RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update originating from gigagan_pytorch.py line 2514 (2479). I tried to clone the generator output for a quick fix (as well as sample function output), but it didn't help.

I use accelerate for the first time, so no prior experience - while running on single GPU everything works fine.
Since multi gpu training is marked in your todo list as done, I suppose the issue is on my side - but maybe you have some idea what causes those problems? I use torch 2.0.1+cu117, accelerate 0.22.0.

Anyway, thanks for great work!

What makes the gan upsampler so effective

I find the results of the gan upsampler to be quite impressive when compared to SD upscaler and real-esrgan. However, in my personal experience, training a SR model with paired lr-hr data in a supervised manner can sometimes make it challenging to generate realistic textures on the hr image, especially when compared to pure generative models. This often results in achieving real-esrgan like outcomes. What, in your opinion, is the main concept that makes the gan upsampler so effective?

Question about discriminator

Thank you for your great project! I have a little problem. Can the discriminator accept multi-scale input? I notice that the forward() of discriminator only accept one input, not multi-scale.

fidelity

in your tests are you tried using this on a video? the area of face upscaling especially in videos is lacking & i am wondering if you have tested if you see much temporal inconsistancys?

How can I use this?

I'm sorry for leaving such an annoying issue, but how can I use this as a package for txt2img/upscaling?

How far is this codebase?

Thanks a lot for starting this effort. I had a quick look at the code, and seems like though the GAN network structure is done, there is no training code or reference to data. Can you please explain? Many thanks.

Confused about this project?

Is ur project a functional Gigagan model that any user can use? if so I'm confused and the steps of how i would even use the model? and what it generates or even upscaling. because if it's an actual working implementation of GigaGan that would be dope

Config to reproduce paper

Hi!
Does anyone know the parameters to reproduce the paper closely in terms of model size?

Thanks!

License Violation Issue

Hello Phil,

Thanks for the work you do. Someone else has been copying your code without respecting your MIT license and including your name, or any reference to you, whatsoever. Please see here:

https://github.com/jianzhnie/GigaGAN

His code updates to nearly identical to yours within a day of you making any updates. As you can see, I made an issue on his project page to address the licensing issue, which he closed without updating the license agreement or making any mention of you.

Here is how to request a DMCA page takedown in the instance of license violations:
https://docs.github.com/en/site-policy/content-removal-policies/guide-to-submitting-a-dmca-takedown-notice

Just wanted to bring this to your attention.

Cheers

Weights of Gigagan Upscaler

Hi @lucidrains, are the weights of gigagan released, specially for upscaler?

If weights not available then on which dataset i can train the upscaler as not mentioned in readme file. And the format of images to make it for the dataloader

the loss became nan after a few train steps

My dataset is about 1w images of someone wearing different cloths, and the text is the captions of the images.
image
image
but the loss became so strange...
image
does anyone know how does it happen?

[Question] About the upscaler

Hello lucidrains,

Thanks for your code!

I was wondering, is it possible to upscale an image from 1024px to 4K px with GigaGAN? (Even if it takes a lot of memory and GPU/CPU)

In your code it seems to be a 256px upscale in 512px?

If it is possible to upscale in 1024px what values should I change in your code?

gan = GigaGAN(
    train_upsampler = True,     # set this to True
    generator = dict(
        style_network = dict(
            dim = 64,
            depth = 4
        ),
        dim = 32,
        image_size = 256,
        input_image_size = 64,
        unconditional = True
    ),
    discriminator = dict(
        dim_capacity = 16,
        dim_max = 512,
        image_size = 256,
        num_skip_layers_excite = 4,
        multiscale_input_resolutions = (128,),
        unconditional = True
    ),
    amp = True
).cuda()

dataset = ImageDataset(
    folder = '/path/to/your/data',
    image_size = 256
)

Should I change the dim, multiscale_input_resolutions? I'm a bit lost. :-)

Regards,

Sparfell

Turn on/off gradients computation between generator/discriminator

Hi! was reviewing some code parts of your implementation (great as usual!) and noticed that you dont alter gradient computation for models when its not required. for example computing losses for discriminator we dont need gradients in generator model and vice versa, i believe it should save some memory?

Training plans?

I've got a bunch of compute the next couple weeks and thinking to train this on LAION.

Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.

L2 attention is implemented wrong!

From paper
https://arxiv.org/pdf/2006.04710.pdf

First, token needs to attend itself to ensure Lipschitz!

image

Second, torch.cdist is not the correct way to do it. Follow the original paper.

image

for tied qk

AB = torch.matmul(qk, qk.transpose(-1, -2))
AA = torch.sum(qk ** 2, -1, keepdim=True)
BB = AA.transpose(-1, -2)    # Since query and key are tied.
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)

for separate qk

AB = torch.matmul(q, k.transpose(-1, -2))
AA = torch.sum(q ** 2, -1, keepdim=True)
BB = torch.sum(k ** 2, -1, keepdim=True).transpose(-1, -2)
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)

This is basically torch.cdist().square(), but more efficient and supports double backward for r1 regularization.

Last, I believe the paper only used L2 self attention in discriminator. The generator should still use dot attention.

Gradient Penalty is very high in the start

Hi!

i was running few experiments and noticed that GP is extremely hight in first few 100 steps.
GP > 60000, and then gradually going down to around GP = 20

is it normal behaviour? In my previous experience with StyleGan GP was small in the beginning

Is this project ready to train?

Is this ready to start training?

I am very curious about this project, how can you implement things on it without having weigths to check results? How can you make sure wheter is correct implementation?

I am beginning my studies yet, but I have access to a great hardware for training, I don't know if it will be enough, but if does, may I help with first weights.

How to use this model for SR ?

Hi,

Would you mind to provide an example of how to use this model for SR. What would be the text input ? I assume, since CLIP is used , rather than text input we use image and its embedding (since it is in a joint space ) ?

Hardware Requirements

I'd like to suggest inform on readme about minimum hardware requirements for training.

I am new on this scenario, I am trying to study about all these technologies recently, and I also would like to know about these requirements, because I recently purchased a very simple computer machine to start my studies.

Is this project able to run on my weak machine?

Core i9 12900HK Interposed @ 5,0GHz
DDR4 64GB RAM
1 x RTX 4090 24GB GDDR6X

I know some generatives models that will run pretty on this hardware, but some text models like Llama requires much more RAM/VRAM memory to run.

Im planning aquiring more GPUs in the future, but I think these modest hardware will work very well to start study journey.

So guys, what you can share about which hardware will be pretty well to run training and inferences?

Has Anyone Trained This Model Yet?

Has anyone trained a sample model of this? I realize full-scale training on LAION will take quite a bit of resources (SD v2 trained for 200k GPU hours), but I'm wondering (1) are there any publicly-available sample trained models yet, and (2) any estimate on the resources required for a full-scale training? I'm guessing GigaGan will require less training than Stable Diffusion did.

Pretrained models

Thanks for the great work!
May I know when will the pretrained models be available?
Thanks.

Some minor questions regarding the design

Thanks for updating the implementation frequently! The codebase looks much nicer than when I first looked at it. I took a closer look on the details and would like to ask some questions on specific design choice. Of course I understand that not every specific design comes with a reason, but I would like to know if you have reference/intuition on these things:

  1. It seems like in the main generator, self attention comes before cross attention, while in the upsampler, cross attention comes before self attention.
  2. This line has a residual connection, but it is already done inside the Transformer class. Same here. Is this something new in transformer literature?
  3. Here in the discriminator, the residual is added after attention block. Does it make more sense to add it right after two conv blocks, since the attention block has its own residual connection?
  4. Very tiny issue. The definition in this is unused.

Questions regarding the discriminator

Starting a new issue for better visibility to other people. I have some quick questions regarding the implementation of discriminator. The argument to the forward function (images, rgb, real_images) seems to be a bit confusing.

Not sure my understanding is correct. In a training iteration of a general GAN, the discriminator should be called twice, once for a batch real images and once for a batch of generated image. While we can concat real and fake images to call the network once, I think this is not done here as the output logit has only 1 scalar rather than 2.

Then I would like to know what are the expected arguments for the real pass and fake pass? I guess for the fake pass, rgbs is the output collection of different resolution rgbs from the generator, and images should be the final output of the generator (generated images at highest resolution); for the real pass, images is the real image, and 'rgbs' should be None. The real_images argument can always be None.

I may be very wrong, so really appreciate if you can correct and explain.

Two additional questions:

  1. this line looks tricky, isn't self.resize_image_to(rgb, rgb.shape[-1]) just return rgb so it is basically concat two rgb together?
  2. Just want to confirm the current implementation has not supported the multi-scale loss yet, as that requires modules to process (HW3) images at multiple resolutions.

NaN losses after hours of training (UPSAMPLER)

I keep getting NaN losses like this after a few hours of of training:
| 1360/1000000 [7:25:23<5127:52:45, 18.49s/it]G: 0.75 | MSG: -0.17 | VG: 0.00 | D: nan | MSD: 1.97 | VD: 0.00 | GP: nan | SSL: nan | CL: 0.00 | MAL: 0.00
0%|โ–Ž | 1380/1000000 [7:31:46<5217:53:00, 18.81s/it]G: 0.76 | MSG: -0.17 | VG: 0.00 | D: nan | MSD: 1.97 | VD: 0.00 | GP: nan | SSL: nan | CL: 0.00 | MAL: 0.00
I'm training on about 200k images with the settings on the README.
gan = GigaGAN(
train_upsampler = True,
generator = dict(
style_network = dict(
dim = 64,
depth = 4
),
dim = 32,
image_size = 256,
input_image_size = 64,
unconditional = True
),
discriminator = dict(
dim_capacity = 16,
dim_max = 512,
image_size = 256,
num_skip_layers_excite = 4,
multiscale_input_resolutions = (128,),
unconditional = True
),
amp = True
).cuda()

Here's an image before NaN loss:
sample-0
Here's an image of NaN loss:
sample-1

My current batch size is 20 btw.
What should I do? Did you ever manage to train successfully with the provided settings?

Some doubts about the multi-scale input and output of the super-resolution part

Hello, thanks for your coding. While trying this code, I found out that the resolutions of "multiscale_input_resolutions" and "rbgs" were mismatched. My understanding of the paper is that the size of the multi-scale input is smaller than the original image. in your code that you preserve larger sizes than the original image in order to preserve more features. I'm confused about it and hope for your reply.

Question about the timing and complexity of GigaGAN replication

How long would it take to do a full replication?
And how workable will it be in terms of what is presented in the official article?
As I understand it, you are using existing technology and adapting something as needed?

I am overwhelmed by your enthusiasm and desire to make available what is not publicly available to all.

It deserves a lot of respect, you are making a huge contribution to the evolution of new technologies by making them openly available, overcoming many questions and challenges!

Insert the attention mechanism in the codes to the stylegan2-ada but fail

Hello, first of all, thank you for your great work.
I just inserted the self-attention and cross-attention module to stylegan2-ada, following behind the conv1 SynthesisLayer on the 4,8,16,32 resolution. And I use the CLIP to extract txt features and the contrastive loss. However, after a few iterations, the resulting pictures became completely the same color. Could you please give me some advice on how to improve it?

Multi GPU with gradient accumulation

Hi! While training on multi GPU and using gradient accumulation steps > 1 there's no substantial speedup with relation to a single GPU (there is a speedup if the value is equal to 1). I found following threads on huggingface here and here that seem to provide a solution. I even ran a dummy test by just adding a proper argument to Accelerator, and actually the training was much faster (in your class I set the gradient accumulation steps to 1, but for Accelerator to 8, but I didn't make other changes to take into account this modification, so the results weren't particularly useful ๐Ÿ˜‰). If you have time to check if this is interesting for you, I'd be grateful.

Possible Discrepancies

Hi Phil,

I noted a few possible discrepancies, and I have made some suggestions as well.

You may want to consider;

  1. Turning off Demodulation for ToRGB
    The demodulation is turned on in the ToRGB. In the current implementation, ToRGB usesdemod=True. This makes it a lot harder for the generator to get the correct feature magnitude at each level of the progressive growing pathway.

  2. Removing sample adaptive kernels in the ToRGB component
    I don't believe that SAKS is used by GigaGAN in the ToRGB section. The caption of Figure 4 states "The style code modulates the main generator using our style-adaptive kernel selection, shown on the right". The main generator is the upsampling pathway consisting of style based feature modulation, not the progressive growing section. The output of the ToRGB section is three channels, so adding more kernel probably won't improve representational capacity but lead to oscillations.

Generator/Discriminator dim Argument

It seems that the current implementation of the Generator/Discriminator do not make any use of the primary dim argument. Looks like capacity is the main "go to" for altering the overall model size on the z dimension.

Perhaps dim could be removed, or the term capacity could be replaced with dim.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.