Dr. Wang, thank you so much for your wonderful work. When I run last step : python

Resource exhausted: OOM when allocating tensor about dkn HOT 4 OPEN

hwwang55 commented on June 6, 2024 1

Resource exhausted: OOM when allocating tensor

from dkn.

Comments (4)

zhhhzhang commented on June 6, 2024 1

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))
But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

how do you solve this problem finally? thanks

from dkn.

hwwang55 commented on June 6, 2024

Hi! This is kind of weird because the default batch size is not that large. Reducing the batch size might help.

from dkn.

skeletonli commented on June 6, 2024

Thank you for your reply.
I had tried to set batch_size to 64 and even 32, but it still get error.
I found than the problem appear in the code in train.py of function train():

        # evaluation
        **train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size))**

It loads all the train_data into the feed_dict.

In addition when I use nvidia-smi to find out how gpu exhausted, when running the codes
def train(args, train_data, test_data):
model = DKN(args)
with tf.Session() as sess:
...

My gpus almost ues all the memory as show behide:

How can I solve the problem,please?

train_data size:14747
test_data size:408
word_embs size :401650
entity_embs size:91000

from dkn.

skeletonli commented on June 6, 2024

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

from dkn.

Resource exhausted: OOM when allocating tensor about dkn HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent