muslll / neosr Goto Github PK

View Code? Open in Web Editor NEW

94.0 94.0 27.0 5.11 MB

neosr is a framework for training real-world single-image super-resolution networks.

Home Page: https://github.com/muslll/neosr

License: Apache License 2.0

Python 100.00%

image-restoration machine-learning super-resolution

neosr's People

Contributors

Stargazers

Watchers

Forkers

phhofm the-database terrainer pkh31337 kim2091 upscale-community sanbuphy nakedlitttlezombie consceieratus

neosr's Issues

Validation doesn't appear to be working.

As the title says, even with validation on in the config, it doesn't seem to do anything.

YML config for reference: https://pastebin.com/y6ZKPqta.

Wiki feedback

Hi! I read through the Configuration Walkthrough wiki page and have a few questions. Could you answer these questions and maybe add the answers to the wiki?

The compile option is experimental. This option enables pytorch's new torch.compile(), which can speed up training. As of this writing, pytorch 2.0.0 does not have support for compile using Python 3.11. Pytorch version 2.1 is expected to fix this, as well as better Windows support.

NeoSR now requires 2.1, so is the last part still relevant?

The manual_seed option enables deterministic training. If you're not doing precise comparisons, this option should be commented out, otherwise training performance will decrease.

What are "precise comparisons"? Or rather, when should this be used?

The gt_size is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network.

How will it be cropped/which portion will be used? Will it use random offsets?

Real-CUGAN License

It appears to be listed here? https://github.com/bilibili/ailab/blob/main/Real-CUGAN/LICENSE

You have it marked as "Unknown" in your LICENSE file.

I'm also somewhat concerned about having code merged in that has no known license, like DCTLSA. Wouldn't it be preferable to wait until you know what the license is?

SPAN incompatible with official arch code

Hi. I just added SPAN to spandrel (library we use in chainner now) and was unpleasantly surprised that Kim's model didn't work. When you removed eval_conv, you made the models produced by neosr incompatible.

Here's the diff of what you changed (shortened):

 class Conv3XC(nn.Module):
     def __init__(
         self,
         c_in: int,
         c_out: int,
         gain1=1,
         gain2=0,
         s=1,
         bias=True,
         relu=False,
     ):
         super().__init__()
         self.weight_concat = None
         self.bias_concat = None
         self.update_params_flag = False
         self.stride = s
         self.has_relu = relu
         gain = gain1
 
         self.sk = nn.Conv2d(...)
         self.conv = nn.Sequential(...)

-        self.eval_conv = nn.Conv2d(...)
-        self.eval_conv.weight.requires_grad = False
-        self.eval_conv.bias.requires_grad = False  # type: ignore
-        self.update_params()
-
-    def update_params(self):
-        ...
-
     def forward(self, x):
-        if self.training:
-            pad = 1
-            x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
-            out = self.conv(x_pad) + self.sk(x)
-        else:
-            self.update_params()
-            out = self.eval_conv(x)
+        pad = 1
+        x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
+        out = self.conv(x_pad) + self.sk(x)

         if self.has_relu:
             out = F.leaky_relu(out, negative_slope=0.05)
         return out

The issue is self.eval_conv. Its weights and biases are saved in the .pth file. So when I use official arch code to load your model, those weights and biases are missing.

So I suggest re-adding self.eval_conv in __init__ and not using it. This will restore compatibility.

That all said, self.eval_conv should have been non-persistent in the first place. The SPAN authors messed this one up.

Cannot resume training ESRGAN

There is a chance I am doing something wrong, but I have tried everything to resume training of an ESRGAN model it doesn't seem to work properly. I've tried starting from a pretrained model, and also tried without a pretrained model.

Whenever I try to resume (by either setting the resume state and commenting out the pretrain or by using auto-resume) it gives me a lot of warnings like this:

------------------------ neosr ------------------------
Pytorch Version: 2.1.1+cu118
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Training statistics:
        Starting model: 2x_hdphoto
        Number of train images: 1000
        Dataset enlarge ratio: 5
        Batch size per gpu: 8
        World size (gpu number): 1
        Require iter number per epoch: 625
        Total epochs: 256; iters: 160000.
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Number of val images/folders in val_1: 10
2023-12-15 17:59:51,716 INFO: Network [esrgan] is created.
2023-12-15 17:59:52,044 INFO: Network [unet] is created.
2023-12-15 17:59:52,263 INFO: Loading esrgan model from C:\neosr\experiments\2x_hdphoto\models\net_g_26000.pth, with param key: [None].
2023-12-15 17:59:52,310 WARNING: Current net - loaded net:
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv1.bias
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv1.weight
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv2.bias
2023-12-15 17:59:52,310 WARNING:   body.0.rdb1.conv2.weight
.....
2023-12-15 17:59:52,357 WARNING:   conv_up2.bias
2023-12-15 17:59:52,357 WARNING:   conv_up2.weight
2023-12-15 17:59:52,357 WARNING: Loaded net - current net:
2023-12-15 17:59:52,357 WARNING:   params
2023-12-15 17:59:52,419 INFO: Loading unet model from C:\neosr\experiments\2x_hdphoto\models\net_d_26000.pth, with param key: [params].
2023-12-15 17:59:52,419 INFO: Loss [HuberLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [PerceptualLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [GANLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [colorloss] is created.
2023-12-15 17:59:53,873 INFO: Model [default] is created.
2023-12-15 17:59:53,888 INFO: Resuming training from epoch: 41, iter: 26000.
2023-12-15 18:00:10,845 INFO: Using CUDA prefetch dataloader.
2023-12-15 18:00:10,845 INFO: AMP enabled.
2023-12-15 18:00:10,845 INFO: Start training from epoch: 41, iter: 26000

And it SEEMS like it has resumed, as it continues on from there in iterations. But looking at the visualization, I can see that it has started from scratch. All of the prior learning is lost and the generations look like garbage again.

I am pretty new to training upscale models, so there is a good possibility that I am doing something wrong. But I feel like I have followed the instructions in the "configuration walkthrough" carefully.

BTW, I am training on windows, CUDA, Python 3.11, with a RTX 4080

just wanted to express my respect

Thank you for the wonderful dataset and training code. I am a member of the open-source community interested in super-resolution training. However, I cannot buy you coffee in China, so I can only express my gratitude through words. Thank you once again!

Print file path that causes read error upon dataloader crash

This would be extremely helpful for troubleshooting a large dataset. I just had an issue where I was getting errors in my dataset about a corrupted file, and it turns out it was trying to read a my meta_info file as an image all along. I was only able to diagnose this after hacking together a modified dataloader script.

I will share my code if desired, but it is not pretty.

Perceptual Loss and AMP - discussion

Currently, Perceptual Loss and AMP are causing issues. Everything points towards GradScaler zeroing values.
Read more on pytorch forums thread.
A temporary solution was commited, moving perceptual loss completely outside of autocast and doing backward() without GradScaler. This however is not an optimal solution.

Validation fails for HAT models

I was trying to train a 4x HAT model. I have some 1440x1080 images I use for validation (lr: 360x270) and validation fails with a shape error. Other architectures (at least esrgan, omnisr, and swinir) work with the same config.

2023-10-20 14:09:08,553 INFO: Saving models and training states.
  File "neosr/train.py", line 242, in <module>
    train_pipeline(root_path)
  File "neosr/train.py", line 211, in train_pipeline
    model.validation(val_loader, current_iter,
  File "neosr/neosr/models/default.py", line 462, in validation
    self.nondist_validation(
  File "neosr/neosr/models/otf.py", line 198, in nondist_validation
    super(otf, self).nondist_validation(
  File "neosr/neosr/models/default.py", line 377, in nondist_validation
    self.test()
  File "neosr/neosr/models/default.py", line 338, in test
    self.output = self.net_g(img)
                  ^^^^^^^^^^^^^^^
  File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "neosr/neosr/archs/hat_arch.py", line 952, in forward
    x = self.conv_first(x)
                           
  File "neosr/neosr/archs/hat_arch.py", line 929, in forward_features                                                                                                       
    attn_mask = self.calculate_mask(x_size).to(x.device)                                                                                                                                             
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                          
  File "neosr/neosr/archs/hat_arch.py", line 909, in calculate_mask                                                                                                         
    mask_windows = window_partition(img_mask, self.window_size)  # nw, window_size, window_size, 1                                                                                                   
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      
  File "neosr/neosr/archs/hat_arch.py", line 91, in window_partition                                                                                                        
    x = x.view(b, h // window_size, window_size, w // window_size, window_size, c)                                                                                                                   
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                   
RuntimeError: shape '[1, 17, 16, 22, 16, 1]' is invalid for input of size 97920

AMP enabled info spamming the console

Not sure why this is happening, but wanted to report as a potential bug. (It's driving me nuts). I tried figuring out where the AMP printing code is located but no such luck. =/

cv2.error: Caught error in DataLoader worker process 2.

Shortly after starting the training following error occurs:

2023-09-06 18:24:25,932 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
  File "G:\_AI\UPSCALE\neosr\train.py", line 241, in <module>
    train_pipeline(root_path)
  File "G:\_AI\UPSCALE\neosr\train.py", line 215, in train_pipeline
    train_data = prefetcher.next()
                 ^^^^^^^^^^^^^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 97, in next
    self.preload()
  File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 83, in preload
    self.batch = next(self.loader)  # self.batch is a dict
                 ^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
    data.reraise()
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\_utils.py", line 644, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\data\paired_dataset.py", line 87, in __getitem__
    img_gt = imfrombytes(img_bytes, float32=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\_AI\UPSCALE\neosr\neosr\utils\img_util.py", line 133, in imfrombytes
    img = cv2.imdecode(img_np, imread_flags[flag])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.8.0) D:\a\opencv-python\opencv-python\opencv\modules\imgcodecs\src\loadsave.cpp:802: error: (-215:Assertion failed) !buf.empty() in function 'cv::imdecode_'

The strange thing is, seeing the last line, I don't even have that path or drive on my PC??

I followed the installation instructions, so all dependencies should be installed.

Windows 10
Python 3.11.5
Torch 2
CUDA 11.8

Config File attached.
train_realesrgan.txt

OmniSR variable passing bug.

I was training a 2x OmniSR model, and encountered the following error:

Traceback (most recent call last):
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 238, in <module>
    train_pipeline(root_path)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 139, in train_pipeline
    model = build_model(opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\__init__.py", line 28, in build_model
    model = MODEL_REGISTRY.get(opt['model_type'])(opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\default.py", line 35, in __init__
    self.net_g = build_network(opt['network_g'])
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\__init__.py", line 21, in build_network
    net = ARCH_REGISTRY.get(network_type)(**opt)
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 883, in omnisr
    return omnisr_net(
  File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 834, in __init__
    up_scale    = kwargs["upsampling"]
KeyError: 'upsampling'

These were my network_g settings:

network_g: type: omnisr upsampling: 2 window_size: 16

It seems like the solution to fixing this is to revise like 830 of omnisr_arch.py from:

def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, window_size=8, upsampling=4, **kwargs):

def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, **kwargs):

Might be easiest to require window_size and upsampling to be specified in the config.

Exiting early doesn't save latest training state

If you exit early it saves the latest model (net_g_latest and net_d_latest) but not the latest training state, so when you resume training it restarts from the latest regular checkpoint.

Feature request: allow toggling attention implementation via options

Some networks have been modified from their official versions to use scaled_dot_product_attention. I understand this has some benefits, but for the time being it also prevents onnx exporting. There's been some discussion on discord about adding the ability to toggle (I think initially to turn it on for the archs which have off at the moment) and the primary concern about adding the option as I understand it is breaking compatibility with e.g. ChaiNNer. It seems like with an agreed upon convention it should be possible to support both. I'm just opening this issue to track the progress of the topic more easily than finding discord links every time.