Comments (23)
When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a "grid search" sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory.
Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc...). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel.
also, what is the LIMIT_MEMORY parameter?
from autokeras.
AutoKeras is poorly maintained at the minute, I had a similar issue
In "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py" line 301 explicitly cast values in the tuple with the name 'self.padding' as int (before calling the function F.conv2d with appropiate parameters), one way you could do this by is adding the line:
self.padding = (int(self.padding[0]), int(self.padding[1]))
before line 301:
return F.conv2d(input, self.weight, self.bias, self.stride,
self.padding, self.dilation, self.groups)
from autokeras.
Sorry for my late reply.
This is the shape of x_train.shape
(1348, 480, 640, 4)
and x_test.shape
(1348, 480, 640, 4)
from autokeras.
This issue is fixed in the new release.
Thank you all for the contribution.
from autokeras.
Hi:
You can try to adjust the " constant.MAX_BATCH_SIZE" parameter,the default value is 32 .
from autokeras.
I also keep getting these errors on a GTX 1060. I managed to fix it the following way. If training a network crashes, I updated the train function to return 0.0 as accuracy and the maximum possible loss as loss. This way the Bayesian optimization algorithm understands that this path is unreliable (for the current hardware) and tries to find an alternative. If @jhfjhfj1 wants, I can make a pull request with the change.
from autokeras.
@Jangol @Zvezdin
Thank you for your help.
I think the memory is usually big enough to handle most of the datasets.
I am not sure how to clean up the GPU memory usage of a model in pytorch.
I think we should clean the GPU memory on the main process.
Then see if it still crashes or not.
If it still crashes, we will try to feed the data really in batches.
from autokeras.
@jhfjhfj1 Thanks for the reply.
I start getting out of memory errors after many successfully trained models. I don't think that it's a memory leak or a dataset error. I'm using a custom dataset and one input sample (without batch) is a 80x90x24 matrix. In my case, I think that AutoKeras decides to create too large models for my 6GB GPU after many successful iterations. In such a case, would you consider my proposed solution as optimal (giving negative or zero feedback if it fails), or else how would you tackle such an issue?
from autokeras.
@Zvezdin Return 0 is not a good solution. It will seriously impact the performance of the GaussianProcessRegressor.
If it could return some special value and not update the GaussianProcessRegressor with such value, it would be better.
from autokeras.
yes this is a bug, you could get around this by specifying the time_limit to a reasonably small value such as 10 seconds to ensure the run_searcher_once method runs only once.
if you check the following:
user@ubuntu:~$ vim C:\Users\user\AppData\Local\Programs\Python36\lib\site-packages\autokeras\classifier.py
you will find the following piece of code on line 223
if time_limit is None:
time_limit = 24 * 60 * 60
start_time = time.time()
while time.time() - start_time <= time_limit:
run_searcher_once(train_data, test_data, self.path)
if len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM:
break
you could see the flaw in this piece of code, if time_limit parameter is not specified it defaults to 24 * 60 minutes, the default value of Constant.MAX_MODEL_NUM is 1000, so you keep on looping in the while loop until len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM, also after the train process is complete self.load_searcher().history stores the new trained model which means its length only increases by one...you could get around this by maybe replacing Constant.MAX_MODEL_NUM to a sane value like 1 (or choose the time limit to be low like 10 seonds), I hope this helps....
I banged my head over this for a few hours, there a number of other problematic things in the code, I think?
from autokeras.
I came across the same issue:
Traceback (most recent call last):
File "trainalutokeras_raw.py", line 25, in <module>
clf.fit(x_train, y_train, time_limit=5 *60 *60)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
run_searcher_once(train_data, test_data, self.path)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
searcher.search(train_data, test_data)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 190, in search
accuracy, loss, graph = train_results.get()[0]
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/pool.py", line 608, in get
raise self._value
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Also, I tried to reduce the time limit with
clf.fit(x_train, y_train, time_limit=10)
doesn't solve the problem. I bet its a bug.
from autokeras.
Yes, this is another bug
try to replace line 178 in (home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py):
"train_results = pool.map_async(train, [(graph, train_data, test_data, self.trainer_args, <br>
os.path.join(self.path, str(model_id) + '.png'), self.verbose)])" <br>
with:
train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose)) <br>
and replace line 190
accuracy, loss, graph = train_results.get()[0] <br>
with:
accuracy, loss, graph = train_results <br>
if you are on windows, torch with CUDA and multiprocessing do not seem to work well together.
Also please try to wrap your code in trainalutokeras_raw.py in:
if __name__ == "__main__"
from autokeras.
yup, It's wrapped in if __name__ == "__main__"
function
yet, there's another error but similar. The stack trace is shown below
Using TensorFlow backend.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "trainalutokeras_raw.py", line 26, in <module>
clf.fit(x_train, y_train, time_limit=10)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
run_searcher_once(train_data, test_data, self.path)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
searcher.search(train_data, test_data)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 178, in search
train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose))
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 326, in train
verbose).train_model(**trainer_args)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 122, in train_model
self._train(train_loader, epoch)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 143, in _train
outputs = self.model(inputs)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/graph.py", line 603, in forward
temp_tensor = torch_layer(edge_input_tensor)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Thanks for the quick response.
from autokeras.
Yet the same error, despite the changes.
from autokeras.
what dataset are you using? have you tried MNIST?
from autokeras.
MNIST works perfectly. The data I am trying is my own dataset.
Here's my PR.
from autokeras.classifier import ImageClassifier
from autokeras.classifier import load_image_dataset
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="parameters for the input program")
parser.add_argument('--train_csv', type=str, help="training csv data directory")
parser.add_argument('--train_images', type=str, help="training images directory")
parser.add_argument('--test_csv', type=str, help="test csv directory")
parser.add_argument('--test_images', type=str, help="test images directory")
#parser.add_argument('--dev', type=str, help="dev directory")
args = parser.parse_args()
x_train, y_train = load_image_dataset(csv_file_path=args.train_csv, images_path=args.train_images)
print(x_train.shape)
print(y_train.shape)
x_test, y_test = load_image_dataset(csv_file_path=args.test_csv, images_path=args.test_images)
print(x_train.shape)
print(y_train.shape)
clf = ImageClassifier(verbose=True)
clf.fit(x_train, y_train, time_limit=10)
clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
y = clf.evaluate(x_test, y_test)
print(y)
and my script :
python trainalutokeras_raw.py --train_csv ./train.csv --train_images ./images/train --test_csv ./test.csv --test_images ./images/test
from autokeras.
could you please report x_train.shape and x_test.shape?
from autokeras.
@tl-yang
Add some code in the second last line in train() function in search.py.
to remove the model from the GPU memory.
Be careful about the import weights in layers.py and the loss returned by train_model() function in utils.py should be a float instead of a tensor.
from autokeras.
@aa18514 Thank you so much for your help!
We are trying hard to fix all the issues.
Do you think replacing multiprocessing
by torch.multiprocessing
would solve the problem of not working well in windows?
I mean if not considering the problem of training to large models.
Thanks.
from autokeras.
@tl-yang
Please see the file of net_transformer.py. It is where we can add if clause to limit the depth and width of the model.
Currently, we will put two more constants MAX_MODEL_WIDTH, MAX_MODEL_DEPTH in the Constant class instead of passing them through the parameters.
We will change it later if necessary.
Let me know if you have any questions.
Remember to branch out from develop branch.
Thanks.
from autokeras.
@tl-yang you can try the torch.multiprocessing.
I will be set the search space.
Thanks.
from autokeras.
Hello, I have the same problem.
Was this problem fixed in the repository?
I am using a small dataset to try, but I have the same error: out of memory.
What is the recommendation? Update autokeras to a specific branch?
Thanks
from autokeras.
How is one supposed to use autokeras.constant.Constant
? Is it enough to make
import autokeras
autokeras.constant.Constant.MAX_BATCH_SIZE = 64
autokeras.constant.Constant.MAX_LAYERS = 5
before fitting?
from autokeras.
Related Issues (20)
- Can the models not be saved during the trial?
- problem: Higher number of epochs lead to worse results? HOT 1
- Bug: Error "I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_15' with dtype double and shape"
- Feature: Any GPU optimization features?
- Bug: KeyError: 'text_block_1/max_tokens' HOT 5
- Feature: Model Agnostic Explanability - Permutation Importance and Counterfactual
- Bug: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 205. in Chinese path, not in English path
- Bug: TypeError: can only concatenate str (not "NoneType") to st HOT 1
- Bug: Unable to predict data using the model trained (Node: 'model/multi_category_encoding/string_lookup_22/None_Lookup/LookupTableFindV2' Table not initialized. [[{{node model/multi_category_encoding/string_lookup_22/None_Lookup/LookupTableFindV2}}]] [Op:__inference_predict_function_10129638] )
- Bug: HOT 4
- AttributeError: module 'numpy' has no attribute 'object' when using autokeras with structured data HOT 4
- Autokers textinput text classifier does not work.
- ModuleNotFoundError: No module named 'tensorflow.keras.layers.experimental'Bug: HOT 4
- Bug: import autokeras as ak --> ModuleNotFoundError: No module named 'tensorflow.keras.layers.experimental' HOT 7
- Bug: FatalTypeError: Expected the model-building function, or HyperModel.build() to return a valid Keras Model instance
- Bug: StructuredDataClassifier and StructuredDataRegressor gone? HOT 5
- Bug: Autokeras/TF fails with CUDA_ERROR_ILLEGAL_ADDRESS/CUDA_ERROR_INVALID_HANDLE when max_trials is not 1
- Bug: "Best val_accuracy So Far" is mistaken for "Best val_loss So Far"
- Feature: 希望AutoKeras可以基于目前最新的Keras3,支持Keras3的新特性!
- Why was the TimeSeries parts of the package removed in the latest version?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autokeras.