muslll / neosr Goto Github PK
View Code? Open in Web Editor NEWneosr is a framework for training real-world single-image super-resolution networks.
Home Page: https://github.com/muslll/neosr
License: Apache License 2.0
neosr is a framework for training real-world single-image super-resolution networks.
Home Page: https://github.com/muslll/neosr
License: Apache License 2.0
As the title says, even with validation on in the config, it doesn't seem to do anything.
YML config for reference: https://pastebin.com/y6ZKPqta.
Hi! I read through the Configuration Walkthrough wiki page and have a few questions. Could you answer these questions and maybe add the answers to the wiki?
The compile option is experimental. This option enables pytorch's new torch.compile(), which can speed up training. As of this writing, pytorch 2.0.0 does not have support for compile using Python 3.11. Pytorch version 2.1 is expected to fix this, as well as better Windows support.
NeoSR now requires 2.1, so is the last part still relevant?
The manual_seed option enables deterministic training. If you're not doing precise comparisons, this option should be commented out, otherwise training performance will decrease.
What are "precise comparisons"? Or rather, when should this be used?
The gt_size is one of the most important options you have to change. It sets the size that each image will be cropped before being sent to the network.
How will it be cropped/which portion will be used? Will it use random offsets?
It appears to be listed here? https://github.com/bilibili/ailab/blob/main/Real-CUGAN/LICENSE
You have it marked as "Unknown" in your LICENSE file.
I'm also somewhat concerned about having code merged in that has no known license, like DCTLSA. Wouldn't it be preferable to wait until you know what the license is?
Hi. I just added SPAN to spandrel (library we use in chainner now) and was unpleasantly surprised that Kim's model didn't work. When you removed eval_conv
, you made the models produced by neosr incompatible.
Here's the diff of what you changed (shortened):
class Conv3XC(nn.Module):
def __init__(
self,
c_in: int,
c_out: int,
gain1=1,
gain2=0,
s=1,
bias=True,
relu=False,
):
super().__init__()
self.weight_concat = None
self.bias_concat = None
self.update_params_flag = False
self.stride = s
self.has_relu = relu
gain = gain1
self.sk = nn.Conv2d(...)
self.conv = nn.Sequential(...)
- self.eval_conv = nn.Conv2d(...)
- self.eval_conv.weight.requires_grad = False
- self.eval_conv.bias.requires_grad = False # type: ignore
- self.update_params()
-
- def update_params(self):
- ...
-
def forward(self, x):
- if self.training:
- pad = 1
- x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
- out = self.conv(x_pad) + self.sk(x)
- else:
- self.update_params()
- out = self.eval_conv(x)
+ pad = 1
+ x_pad = F.pad(x, (pad, pad, pad, pad), "constant", 0)
+ out = self.conv(x_pad) + self.sk(x)
if self.has_relu:
out = F.leaky_relu(out, negative_slope=0.05)
return out
The issue is self.eval_conv
. Its weights and biases are saved in the .pth
file. So when I use official arch code to load your model, those weights and biases are missing.
So I suggest re-adding self.eval_conv
in __init__
and not using it. This will restore compatibility.
That all said, self.eval_conv
should have been non-persistent in the first place. The SPAN authors messed this one up.
There is a chance I am doing something wrong, but I have tried everything to resume training of an ESRGAN model it doesn't seem to work properly. I've tried starting from a pretrained model, and also tried without a pretrained model.
Whenever I try to resume (by either setting the resume state and commenting out the pretrain or by using auto-resume) it gives me a lot of warnings like this:
------------------------ neosr ------------------------
Pytorch Version: 2.1.1+cu118
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Training statistics:
Starting model: 2x_hdphoto
Number of train images: 1000
Dataset enlarge ratio: 5
Batch size per gpu: 8
World size (gpu number): 1
Require iter number per epoch: 625
Total epochs: 256; iters: 160000.
2023-12-15 17:59:51,497 INFO: Dataset [paired] is built.
2023-12-15 17:59:51,497 INFO: Number of val images/folders in val_1: 10
2023-12-15 17:59:51,716 INFO: Network [esrgan] is created.
2023-12-15 17:59:52,044 INFO: Network [unet] is created.
2023-12-15 17:59:52,263 INFO: Loading esrgan model from C:\neosr\experiments\2x_hdphoto\models\net_g_26000.pth, with param key: [None].
2023-12-15 17:59:52,310 WARNING: Current net - loaded net:
2023-12-15 17:59:52,310 WARNING: body.0.rdb1.conv1.bias
2023-12-15 17:59:52,310 WARNING: body.0.rdb1.conv1.weight
2023-12-15 17:59:52,310 WARNING: body.0.rdb1.conv2.bias
2023-12-15 17:59:52,310 WARNING: body.0.rdb1.conv2.weight
.....
2023-12-15 17:59:52,357 WARNING: conv_up2.bias
2023-12-15 17:59:52,357 WARNING: conv_up2.weight
2023-12-15 17:59:52,357 WARNING: Loaded net - current net:
2023-12-15 17:59:52,357 WARNING: params
2023-12-15 17:59:52,419 INFO: Loading unet model from C:\neosr\experiments\2x_hdphoto\models\net_d_26000.pth, with param key: [params].
2023-12-15 17:59:52,419 INFO: Loss [HuberLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [PerceptualLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [GANLoss] is created.
2023-12-15 17:59:53,857 INFO: Loss [colorloss] is created.
2023-12-15 17:59:53,873 INFO: Model [default] is created.
2023-12-15 17:59:53,888 INFO: Resuming training from epoch: 41, iter: 26000.
2023-12-15 18:00:10,845 INFO: Using CUDA prefetch dataloader.
2023-12-15 18:00:10,845 INFO: AMP enabled.
2023-12-15 18:00:10,845 INFO: Start training from epoch: 41, iter: 26000
And it SEEMS like it has resumed, as it continues on from there in iterations. But looking at the visualization, I can see that it has started from scratch. All of the prior learning is lost and the generations look like garbage again.
I am pretty new to training upscale models, so there is a good possibility that I am doing something wrong. But I feel like I have followed the instructions in the "configuration walkthrough" carefully.
BTW, I am training on windows, CUDA, Python 3.11, with a RTX 4080
Thank you for the wonderful dataset and training code. I am a member of the open-source community interested in super-resolution training. However, I cannot buy you coffee in China, so I can only express my gratitude through words. Thank you once again!
This would be extremely helpful for troubleshooting a large dataset. I just had an issue where I was getting errors in my dataset about a corrupted file, and it turns out it was trying to read a my meta_info file as an image all along. I was only able to diagnose this after hacking together a modified dataloader script.
I will share my code if desired, but it is not pretty.
Currently, Perceptual Loss and AMP are causing issues. Everything points towards GradScaler zeroing values.
Read more on pytorch forums thread.
A temporary solution was commited, moving perceptual loss completely outside of autocast and doing backward() without GradScaler. This however is not an optimal solution.
I was trying to train a 4x HAT model. I have some 1440x1080 images I use for validation (lr: 360x270) and validation fails with a shape error. Other architectures (at least esrgan, omnisr, and swinir) work with the same config.
2023-10-20 14:09:08,553 INFO: Saving models and training states.
File "neosr/train.py", line 242, in <module>
train_pipeline(root_path)
File "neosr/train.py", line 211, in train_pipeline
model.validation(val_loader, current_iter,
File "neosr/neosr/models/default.py", line 462, in validation
self.nondist_validation(
File "neosr/neosr/models/otf.py", line 198, in nondist_validation
super(otf, self).nondist_validation(
File "neosr/neosr/models/default.py", line 377, in nondist_validation
self.test()
File "neosr/neosr/models/default.py", line 338, in test
self.output = self.net_g(img)
^^^^^^^^^^^^^^^
File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "neosr/venv/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "neosr/neosr/archs/hat_arch.py", line 952, in forward
x = self.conv_first(x)
File "neosr/neosr/archs/hat_arch.py", line 929, in forward_features
attn_mask = self.calculate_mask(x_size).to(x.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "neosr/neosr/archs/hat_arch.py", line 909, in calculate_mask
mask_windows = window_partition(img_mask, self.window_size) # nw, window_size, window_size, 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "neosr/neosr/archs/hat_arch.py", line 91, in window_partition
x = x.view(b, h // window_size, window_size, w // window_size, window_size, c)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[1, 17, 16, 22, 16, 1]' is invalid for input of size 97920
Shortly after starting the training following error occurs:
2023-09-06 18:24:25,932 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "G:\_AI\UPSCALE\neosr\train.py", line 241, in <module>
train_pipeline(root_path)
File "G:\_AI\UPSCALE\neosr\train.py", line 215, in train_pipeline
train_data = prefetcher.next()
^^^^^^^^^^^^^^^^^
File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 97, in next
self.preload()
File "G:\_AI\UPSCALE\neosr\neosr\data\prefetch_dataloader.py", line 83, in preload
self.batch = next(self.loader) # self.batch is a dict
^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
return self._process_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
data.reraise()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\_utils.py", line 644, in reraise
raise exception
cv2.error: Caught error in DataLoader worker process 2.
Original Traceback (most recent call last):
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "G:\_AI\UPSCALE\neosr\neosr\data\paired_dataset.py", line 87, in __getitem__
img_gt = imfrombytes(img_bytes, float32=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\_AI\UPSCALE\neosr\neosr\utils\img_util.py", line 133, in imfrombytes
img = cv2.imdecode(img_np, imread_flags[flag])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.8.0) D:\a\opencv-python\opencv-python\opencv\modules\imgcodecs\src\loadsave.cpp:802: error: (-215:Assertion failed) !buf.empty() in function 'cv::imdecode_'
The strange thing is, seeing the last line, I don't even have that path or drive on my PC??
I followed the installation instructions, so all dependencies should be installed.
Windows 10
Python 3.11.5
Torch 2
CUDA 11.8
Config File attached.
train_realesrgan.txt
I was training a 2x OmniSR model, and encountered the following error:
Traceback (most recent call last):
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 238, in <module>
train_pipeline(root_path)
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\train.py", line 139, in train_pipeline
model = build_model(opt)
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\__init__.py", line 28, in build_model
model = MODEL_REGISTRY.get(opt['model_type'])(opt)
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\models\default.py", line 35, in __init__
self.net_g = build_network(opt['network_g'])
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\__init__.py", line 21, in build_network
net = ARCH_REGISTRY.get(network_type)(**opt)
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 883, in omnisr
return omnisr_net(
File "D:\Users\Sirosky\Jottacloud\Media\Upscaling\Trainers\neosr\neosr\archs\omnisr_arch.py", line 834, in __init__
up_scale = kwargs["upsampling"]
KeyError: 'upsampling'
These were my network_g
settings:
network_g: type: omnisr upsampling: 2 window_size: 16
It seems like the solution to fixing this is to revise like 830 of omnisr_arch.py
from:
def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, window_size=8, upsampling=4, **kwargs):
to
def __init__(self,num_in_ch=3,num_out_ch=3,num_feat=64, **kwargs):
Might be easiest to require window_size
and upsampling
to be specified in the config.
If you exit early it saves the latest model (net_g_latest
and net_d_latest
) but not the latest training state, so when you resume training it restarts from the latest regular checkpoint.
Some networks have been modified from their official versions to use scaled_dot_product_attention
. I understand this has some benefits, but for the time being it also prevents onnx exporting. There's been some discussion on discord about adding the ability to toggle (I think initially to turn it on for the archs which have off at the moment) and the primary concern about adding the option as I understand it is breaking compatibility with e.g. ChaiNNer. It seems like with an agreed upon convention it should be possible to support both. I'm just opening this issue to track the progress of the topic more easily than finding discord links every time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.