Trying to create an image classifier with ~1000 training samples and 7 classes but it

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Out of memory error with NVIDIA K80 GPU about autokeras HOT 23 CLOSED

waqarws commented on May 22, 2024 1

Out of memory error with NVIDIA K80 GPU

from autokeras.

Comments (23)

sparkdoc commented on May 22, 2024 3

When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a "grid search" sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory.

Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc...). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel.

also, what is the LIMIT_MEMORY parameter?

from autokeras.

aa18514 commented on May 22, 2024 1

AutoKeras is poorly maintained at the minute, I had a similar issue

In "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py" line 301 explicitly cast values in the tuple with the name 'self.padding' as int (before calling the function F.conv2d with appropiate parameters), one way you could do this by is adding the line:

self.padding = (int(self.padding[0]), int(self.padding[1]))

before line 301:

return F.conv2d(input, self.weight, self.bias, self.stride,
                           self.padding, self.dilation, self.groups)

from autokeras.

jageshmaharjan commented on May 22, 2024 1

Sorry for my late reply.
This is the shape of x_train.shape
(1348, 480, 640, 4)
and x_test.shape
(1348, 480, 640, 4)

from autokeras.

haifeng-jin commented on May 22, 2024 1

This issue is fixed in the new release.
Thank you all for the contribution.

from autokeras.

Jangol commented on May 22, 2024

Hi:
You can try to adjust the " constant.MAX_BATCH_SIZE" parameter，the default value is 32 .

from autokeras.

Zvezdin commented on May 22, 2024

I also keep getting these errors on a GTX 1060. I managed to fix it the following way. If training a network crashes, I updated the train function to return 0.0 as accuracy and the maximum possible loss as loss. This way the Bayesian optimization algorithm understands that this path is unreliable (for the current hardware) and tries to find an alternative. If @jhfjhfj1 wants, I can make a pull request with the change.

from autokeras.

haifeng-jin commented on May 22, 2024

@Jangol @Zvezdin
Thank you for your help.

I think the memory is usually big enough to handle most of the datasets.
I am not sure how to clean up the GPU memory usage of a model in pytorch.
I think we should clean the GPU memory on the main process.
Then see if it still crashes or not.

If it still crashes, we will try to feed the data really in batches.

from autokeras.

Zvezdin commented on May 22, 2024

@jhfjhfj1 Thanks for the reply.

I start getting out of memory errors after many successfully trained models. I don't think that it's a memory leak or a dataset error. I'm using a custom dataset and one input sample (without batch) is a 80x90x24 matrix. In my case, I think that AutoKeras decides to create too large models for my 6GB GPU after many successful iterations. In such a case, would you consider my proposed solution as optimal (giving negative or zero feedback if it fails), or else how would you tackle such an issue?

from autokeras.

haifeng-jin commented on May 22, 2024

@Zvezdin Return 0 is not a good solution. It will seriously impact the performance of the GaussianProcessRegressor.
If it could return some special value and not update the GaussianProcessRegressor with such value, it would be better.

from autokeras.

aa18514 commented on May 22, 2024

yes this is a bug, you could get around this by specifying the time_limit to a reasonably small value such as 10 seconds to ensure the run_searcher_once method runs only once.

if you check the following:

user@ubuntu:~$ vim C:\Users\user\AppData\Local\Programs\Python36\lib\site-packages\autokeras\classifier.py

you will find the following piece of code on line 223


if time_limit is None:
    time_limit = 24 * 60 * 60

start_time = time.time()
while time.time() - start_time <= time_limit:
    run_searcher_once(train_data, test_data, self.path)
    if len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM:
        break

you could see the flaw in this piece of code, if time_limit parameter is not specified it defaults to 24 * 60 minutes, the default value of Constant.MAX_MODEL_NUM is 1000, so you keep on looping in the while loop until len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM, also after the train process is complete self.load_searcher().history stores the new trained model which means its length only increases by one...you could get around this by maybe replacing Constant.MAX_MODEL_NUM to a sane value like 1 (or choose the time limit to be low like 10 seonds), I hope this helps....

I banged my head over this for a few hours, there a number of other problematic things in the code, I think?

from autokeras.

jageshmaharjan commented on May 22, 2024

I came across the same issue:

Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 25, in <module>
    clf.fit(x_train, y_train, time_limit=5 *60 *60)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 190, in search
    accuracy, loss, graph = train_results.get()[0]
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Also, I tried to reduce the time limit with
clf.fit(x_train, y_train, time_limit=10)
doesn't solve the problem. I bet its a bug.

from autokeras.

aa18514 commented on May 22, 2024

Yes, this is another bug
try to replace line 178 in (home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py):

"train_results = pool.map_async(train, [(graph, train_data, test_data, self.trainer_args, <br>
                                                os.path.join(self.path, str(model_id) + '.png'), self.verbose)])" <br>

with:

train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose)) <br>

and replace line 190

accuracy, loss, graph = train_results.get()[0] <br>

with:

accuracy, loss, graph = train_results <br>

if you are on windows, torch with CUDA and multiprocessing do not seem to work well together.

Also please try to wrap your code in trainalutokeras_raw.py in:

if __name__ == "__main__"

from autokeras.

jageshmaharjan commented on May 22, 2024

yup, It's wrapped in if __name__ == "__main__" function

yet, there's another error but similar. The stack trace is shown below

Using TensorFlow backend.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 26, in <module>
    clf.fit(x_train, y_train, time_limit=10)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 178, in search
    train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose))
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 326, in train
    verbose).train_model(**trainer_args)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 122, in train_model
    self._train(train_loader, epoch)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 143, in _train
    outputs = self.model(inputs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/graph.py", line 603, in forward
    temp_tensor = torch_layer(edge_input_tensor)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Thanks for the quick response.

from autokeras.

jageshmaharjan commented on May 22, 2024

Yet the same error, despite the changes.

from autokeras.

aa18514 commented on May 22, 2024

what dataset are you using? have you tried MNIST?

from autokeras.

jageshmaharjan commented on May 22, 2024

MNIST works perfectly. The data I am trying is my own dataset.

Here's my PR.

from autokeras.classifier import ImageClassifier
from autokeras.classifier import load_image_dataset
import argparse

if __name__ == '__main__':
  parser = argparse.ArgumentParser(description="parameters for the input program")
  parser.add_argument('--train_csv', type=str, help="training csv data directory")
  parser.add_argument('--train_images', type=str, help="training images directory")
  parser.add_argument('--test_csv', type=str, help="test csv directory")
  parser.add_argument('--test_images', type=str, help="test images directory")
  #parser.add_argument('--dev', type=str, help="dev directory")

  args = parser.parse_args()

  x_train, y_train = load_image_dataset(csv_file_path=args.train_csv, images_path=args.train_images)
  print(x_train.shape)
  print(y_train.shape)

  x_test, y_test = load_image_dataset(csv_file_path=args.test_csv, images_path=args.test_images)
  print(x_train.shape)
  print(y_train.shape)

  clf = ImageClassifier(verbose=True)
  clf.fit(x_train, y_train, time_limit=10)
  clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
  y = clf.evaluate(x_test, y_test)
  print(y)

and my script :
python trainalutokeras_raw.py --train_csv ./train.csv --train_images ./images/train --test_csv ./test.csv --test_images ./images/test

from autokeras.

aa18514 commented on May 22, 2024

could you please report x_train.shape and x_test.shape?

from autokeras.

haifeng-jin commented on May 22, 2024

@tl-yang
Add some code in the second last line in train() function in search.py.
to remove the model from the GPU memory.

Be careful about the import weights in layers.py and the loss returned by train_model() function in utils.py should be a float instead of a tensor.

from autokeras.

haifeng-jin commented on May 22, 2024

@aa18514 Thank you so much for your help!
We are trying hard to fix all the issues.
Do you think replacing multiprocessing by torch.multiprocessing would solve the problem of not working well in windows?
I mean if not considering the problem of training to large models.

Thanks.

from autokeras.

haifeng-jin commented on May 22, 2024

@tl-yang
Please see the file of net_transformer.py. It is where we can add if clause to limit the depth and width of the model.
Currently, we will put two more constants MAX_MODEL_WIDTH, MAX_MODEL_DEPTH in the Constant class instead of passing them through the parameters.

We will change it later if necessary.
Let me know if you have any questions.

Remember to branch out from develop branch.

Thanks.

from autokeras.

haifeng-jin commented on May 22, 2024

@tl-yang you can try the torch.multiprocessing.
I will be set the search space.

Thanks.

from autokeras.

rafaelmarconiramos commented on May 22, 2024

Hello, I have the same problem.
Was this problem fixed in the repository?
I am using a small dataset to try, but I have the same error: out of memory.

What is the recommendation? Update autokeras to a specific branch?

Thanks

from autokeras.

MartinThoma commented on May 22, 2024

How is one supposed to use autokeras.constant.Constant? Is it enough to make

import autokeras
autokeras.constant.Constant.MAX_BATCH_SIZE = 64
autokeras.constant.Constant.MAX_LAYERS = 5

before fitting?

from autokeras.

Out of memory error with NVIDIA K80 GPU about autokeras HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent