Giter Club home page Giter Club logo

Comments (4)

zhhhzhang avatar zhhhzhang commented on June 6, 2024 1

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

how do you solve this problem finally? thanks

from dkn.

hwwang55 avatar hwwang55 commented on June 6, 2024

Hi! This is kind of weird because the default batch size is not that large. Reducing the batch size might help.

from dkn.

skeletonli avatar skeletonli commented on June 6, 2024

Thank you for your reply.
I had tried to set batch_size to 64 and even 32, but it still get error.
I found than the problem appear in the code in train.py of function train():

        # evaluation
        **train_auc = model.eval(sess, get_feed_dict(model, train_data, 0, train_data.size))**

It loads all the train_data into the feed_dict.

In addition when I use nvidia-smi to find out how gpu exhausted, when running the codes
def train(args, train_data, test_data):
model = DKN(args)
with tf.Session() as sess:
...

My gpus almost ues all the memory as show behide:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 51C P8 16W / 275W | 9863MiB / 11264MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 51C P2 64W / 275W | 9429MiB / 11264MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

How can I solve the problem,please?

train_data size:14747
test_data size:408
word_embs size :401650
entity_embs size:91000

from dkn.

skeletonli avatar skeletonli commented on June 6, 2024

I tried to use :
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.allow_growth=True
config.log_device_placement=True

Althought the Gpu memory use less, but when runing eval, it still crash , shows omm.
Thu Jul 18 10:41:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 419.17 Driver Version: 419.17 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... WDDM | 00000000:05:00.0 On | N/A |
| 0% 56C P2 64W / 275W | 8936MiB / 11264MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... WDDM | 00000000:09:00.0 On | N/A |
| 0% 49C P8 17W / 275W | 602MiB / 11264MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

So I tried to use one Gpu to train and anthor gpu to eval, using the codes below:

with tf.device('/gpu:0'):
with tf.Session(config=config) as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

        for step in range(args.n_epochs):
            # training
            start_list = list(range(0, train_data.size, args.batch_size))
            np.random.shuffle(start_list)
            for start in start_list:
                end = start + args.batch_size
                model.train(sess, get_feed_dict(model, train_data, start, end))

            
            config2 = tf.ConfigProto(device_count = {'GPU': 1},log_device_placement=True)
            config2.gpu_options.allow_growth=True 
            with tf.Session(config=config2) as sess2:
                    sess2.run(tf.global_variables_initializer())
                    sess2.run(tf.local_variables_initializer())
                    # evaluation
                    train_auc = model.eval(sess2, get_feed_dict(model, train_data, 0, int(train_data.size)))
                    test_auc = model.eval(sess2, get_feed_dict(model, test_data, 0, test_data.size))
                    print('epoch %d    train_auc: %.4f    test_auc: %.4f' % (step, train_auc, test_auc))

But it not work, gpu 0 is still use for eval, showing "W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[442410,128,9,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

from dkn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.