overlordgolddragon / keras-adamw Goto Github PK

View Code? Open in Web Editor NEW

166.0 6.0 25.0 148 KB

Keras/TF implementation of AdamW, SGDW, NadamW, Warm Restarts, and Learning Rate multipliers

License: MIT License

Python 99.93% Shell 0.07%

keras optimizers adamw adamwr nadam sgd learning-rate-multipliers warm-restarts tensorflow

keras-adamw's People

Contributors

Stargazers

Watchers

keras-adamw's Issues

AttributeError: can't set attribute

Hi,

I'm using keras - 2.3.1
and TF - '1.15.2-dlenv_tfe'

my code:

        model = Model(inputs=inp, outputs=out)
        
        optimizer = AdamW(model, lr=1e-4)

        model.compile(loss='mse', optimizer=optimizer)

And I'm getting:

     68         with K.name_scope(self.__class__.__name__):
     69             self.iterations = K.variable(0, dtype='int64', name='iterations')
---> 70             self.lr = K.variable(lr, name='lr')
     71             self.beta_1 = K.variable(beta_1, name='beta_1')
     72             self.beta_2 = K.variable(beta_2, name='beta_2')

AttributeError: can't set attribute

Error when using SGDW in a complex project

I was trying to use the SGDW with a project but it seems to be causing an error
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation resample_p6/conv2d/kernel/Initializer/random_uniform/sub: Could not satisfy explicit device specification '' because the node {{colocation_node resample_p6/conv2d/kernel/Initializer/random_uniform/sub}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].

The error seems to be caused only when the SGDW optimizer and not the AdamW one (I haven't tried the NAdamW one).

The project I tried to apply the SGDW is EfficentDet which is quite complex project. Nevertheless, this shouldn't happen. I am not sure which is the cause of the problem. Also, when used in a small network as the one provided in the example.py there doesn't seem to be any problem.

Warm Restart

Thank you for developing AdamW!

I have a question about warm restart. Is it necessary to force set "t_cur = 0" after the end of the training of each epoch? (var 1)
Or is 't_cur' automatically set to 0 after reaching 'total_iterations'? (var 2)

(var 1)
def on_epoch_end(...):
...
K.set_value(self.model.optimizer.t_cur, 0) # WARM RESTART
...

(var 2)
trainset_size = 1000
batch_size = 64
optimizer = AdamW(..., total_iterations=15, batch_size=batch_size)

And correct me if I'm wrong. Best practice for setting 'total_iterations' is:
total_iterations = int(trainset_size / batch_size)

AdaBelief

Thank you very much for your work on this project! It really is an excellent contribution to provide an up-to-date version of AdamW that allows layer-dependent learning rates. I'm wondering what your thoughts are about AdaBelief and if you'd want to add it as an option to this package?

WeightDecay is incorrectly normalized

Hey,

first of all thank you for this library, it's great and works great in general

I just wanted to point out, that I think the weight decay is wrongly normalized based on the batch size - from the original paper, the normalized weight decay formula is as follows:

λ = λ_norm * sqrt(b / BT), where b is the batch size, B is the number of epochs, and most importantly T is total number of training samples
The code assumes that total_iterations is set to be equal to BT, however the iterations are counted based on the number of batch updates, which is equal to step_size * epochs

This is missing the total number of samples in each batch b, to get back to the orginally used BT: sqrt (b / b * total_iterations ) = sqrt (1 / total_iterations )
.
Therefore, the batch_size should be set to 1, if setting total_iterations or total_iterations_wd as described in the examples here.

Import issue while using tensorflow.keras

Hi,

Very good work for implementing this :)
However, the TF_KERAS environnement doesn't work to select the rightAdamW from the right optimizers file ... I set the variable at the beginning of my code and debug mode shows it is present.
Therefore, I'm doing from keras_adamw.optimizers_v2 import AdamW and it works :)

It seems that using the direct import from keras_adamw import AdamW, it goes in the the optimizers225.py

Thanks for your work

keras.legacy no longer present

Looks like keras.legacy is no longer part of 2.4:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-c6757b174f2e> in <module>()
----> 1 from keras_adamw import AdamW
      2 import tensorflow_hub as hub
      3 import os
      4 
      5 # Load compressed models from tensorflow_hub

1 frames
/usr/local/lib/python3.6/dist-packages/keras_adamw/optimizers.py in <module>()
      1 import numpy as np
      2 from keras import backend as K
----> 3 from keras.legacy import interfaces
      4 from keras.optimizers import Optimizer
      5 from .utils import _init_weight_decays, _apply_weight_decays, _check_args

ModuleNotFoundError: No module named 'keras.legacy'

AttributeError: 'L1' object has no attribute 'l2'

I have the following code:

lr_multipliers = {'lstm_1': 0.5}
optimizer = AdamW(lr=1e-4, model=model_AdamW, lr_multipliers=lr_multipliers,
use_cosine_annealing=True, total_iterations=24)
model_AdamW.compile(optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

and got error: AttributeError: 'L1' object has no attribute 'l2'

IMPORTANT: upgrade to 1.23

1.2 and 1.21 use erroneous decay formula, decaying l1 as l2 and vice versa; this is fixed in 1.23. Pardon the mishap.

SGDW doesn't work

Hi!
Thanks for all your effort. This code really helps when implementing a custom optimizer.

There seems to be an issue with SGDW. The sample code from the README works fine with AdamW, but crashes when using SGDW:

import os; os.environ["TF_KERAS"]='1'
import numpy as np
from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2
import keras_adamw

ipt   = Input(shape=(120, 4))
x     = LSTM(60, activation='relu', name='lstm_1',
             kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out   = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = keras_adamw.SGDW(lr=1e-4, model=model)
model.compile(optimizer, loss='binary_crossentropy')

for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4) # dummy data
        y = np.random.randint(0, 2, (10, 1)) # dummy labels
        loss = model.train_on_batch(x, y)
        print("Iter {} loss: {}".format(iteration + 1, "%.3f" % loss))
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

returns

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-d2bca98bfb4f> in <module>()
     21         x = np.random.rand(10, 120, 4) # dummy data
     22         y = np.random.randint(0, 2, (10, 1)) # dummy labels
---> 23         loss = model.train_on_batch(x, y)
     24         print("Iter {} loss: {}".format(iteration + 1, "%.3f" % loss))
     25     print("EPOCH {} COMPLETED\n".format(epoch + 1))

8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    966           except Exception as e:  # pylint:disable=broad-except
    967             if hasattr(e, "ag_error_metadata"):
--> 968               raise e.ag_error_metadata.to_exception(e)
    969             else:
    970               raise

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1814 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:508 apply_gradients
        "name": name,
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:592 _distributed_apply  **
        var, apply_grad_to_update_var, args=(grad,), group=False))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2013 update
        return self._update(var, fn, args, kwargs, group)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2659 _update
        return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2665 _update_non_slot
        result = fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:567 apply_grad_to_update_var  **
        update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
    /usr/local/lib/python3.6/dist-packages/keras_adamw/optimizers_v2.py:672 _resource_apply_dense
        m = K.zeros(K.int_shape(var))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:1333 zeros
        return variable(v, dtype=dtype, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:845 variable
        constraint=constraint)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:261 __call__
        return cls._variable_v2_call(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:255 _variable_v2_call
        shape=shape)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2562 creator
        return next_creator(**kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py:511 invalid_creator_scope
        "tf.function-decorated function tried to create "

    ValueError: tf.function-decorated function tried to create variables on non-first call.

I'm using tensorflow-gpu version 2.2 and tf.keras.

ValueError: Could not interpret optimizer identifier: <keras_adamw.optimizers_v2.AdamW object at 0x0000021E2F81D220>

Hello:
When I run the example, I had the error "ValueError: Could not interpret optimizer identifier: <keras_adamw.optimizers_v2.AdamW object at 0x0000021E2F81D220>". It is may be the different between keras and tf.keras. But when I change the lib as "
import numpy as np
import os
os.environ['TF_KERAS'] = '1'
from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2

from keras_adamw import AdamW"
And I run under the tensorflow environment

Usage & concept questions

It works perfectly with me. Thank you for sharing and developing this repo. I think this idea really works (at least for my problem).

Thanks,
Chong

Note to users

Currently the implementation isn't fully compatible with tf.keras, tf.python.keras, or TensorFlow 2.0

I'm working on addressing all. TF 2 brought with it sweeping changes complicating compatibility, and some bugs. You can "watch" the repo to be notified of the next release (update).

Actual lr seems fixed during training

I am a bit confused about the actual optimizers lr at each batch.

I have noticed that you there is a (now closed) issue regarding the Usage & concept questions where you refer to the actual lr (learning rate) being lr*eta_t.

But if I use your example as basis and include a plotting of the lr at each batch there does not appear to be any fluctuation of actual lr regardless of the values eta_t is assigned to.

from tensorflow.keras import backend as K
import os
os.environ["TF_KERAS"] = '1'
os.environ["TF_EAGER"] = '0'

from tensorflow.keras.layers import Input, Dense, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l1, l2, l1_l2

import numpy as np
import matplotlib.pyplot as plt

from keras_adamw import AdamW
from keras_adamw.utils import K_eval

USE_CPU = True

if USE_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = ''

ipt = Input(shape=(120, 4))
x = LSTM(60, activation='relu', name='lstm_1',
         kernel_regularizer=l1(1e-4), recurrent_regularizer=l2(2e-4))(ipt)
out = Dense(1, activation='sigmoid', kernel_regularizer=l1_l2(1e-4, 2e-4))(x)
model = Model(ipt, out)

lr_multipliers = {'lstm_1': 0.5}

optimizer = AdamW(lr=1e-4, model=model, lr_multipliers=lr_multipliers,
                  use_cosine_annealing=True, total_iterations=24)
model.compile(optimizer, loss='binary_crossentropy')

eta_history = []
lr_history = []
for epoch in range(3):
    for iteration in range(24):
        x = np.random.rand(10, 120, 4)  # dummy data
        y = np.random.randint(0, 2, (10, 1))  # dummy labels
        loss = model.train_on_batch(x, y)
        eta_t = K_eval(model.optimizer.eta_t, K)
        eta_history.append(eta_t)
        t_cur = K_eval(model.optimizer.t_cur, K)
        lr = K_eval(model.optimizer.lr, K)  # K.eval(model.optimizer.lr)
        lr_history.append(lr)
        eta_max = K_eval(model.optimizer.eta_max, K)
        eta_min = K_eval(model.optimizer.eta_min, K)

        print('Iter {} t_cur: {} - lr: {} - eta_max: {} - eta_min: {}'.format(iteration + 1, t_cur, lr, eta_max, eta_min))
        print("Iter {} loss: {} - eta_t: {}".format(iteration + 1, "%.3f" % loss, eta_t))
        if iteration == (24 - 2):
            K.set_value(model.optimizer.t_cur, -1)  # WARM RESTART
    print("EPOCH {} COMPLETED\n".format(epoch + 1))

plt.plot(eta_history, linewidth=2)
plt.xlim(0, len(eta_history))
plt.ylim(0, 1.05)
plt.ylabel('eta_t', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()
plt.close()

plt.plot(lr_history, linewidth=2)
plt.xlim(0, len(lr_history))
plt.ylim(0.9*np.min(lr_history), 1.1*np.max(lr_history))
plt.ylabel('lr', weight='bold', fontsize=15)
plt.xlabel('Train iterations', weight='bold', fontsize=15)
plt.gcf().set_size_inches(10, 5)
plt.show()

Iter 1 t_cur: 1 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 1 loss: 0.691 - eta_t: 0.9953429698944092
Iter 2 t_cur: 2 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 2 loss: 0.694 - eta_t: 0.9814586639404297
Iter 3 t_cur: 3 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 3 loss: 0.704 - eta_t: 0.9586056470870972
Iter 4 t_cur: 4 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 4 loss: 0.689 - eta_t: 0.927209734916687
Iter 5 t_cur: 5 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 5 loss: 0.682 - eta_t: 0.8878556489944458
Iter 6 t_cur: 6 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 6 loss: 0.708 - eta_t: 0.8412765264511108
Iter 7 t_cur: 7 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 7 loss: 0.684 - eta_t: 0.788340151309967
Iter 8 t_cur: 8 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 8 loss: 0.691 - eta_t: 0.7300325036048889
Iter 9 t_cur: 9 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 9 loss: 0.701 - eta_t: 0.6674398183822632
Iter 10 t_cur: 10 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 10 loss: 0.690 - eta_t: 0.6017280220985413
Iter 11 t_cur: 11 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 11 loss: 0.699 - eta_t: 0.5341211557388306
Iter 12 t_cur: 12 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 12 loss: 0.699 - eta_t: 0.46587878465652466
Iter 13 t_cur: 13 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 13 loss: 0.687 - eta_t: 0.39827197790145874
Iter 14 t_cur: 14 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 14 loss: 0.713 - eta_t: 0.3325602114200592
Iter 15 t_cur: 15 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 15 loss: 0.709 - eta_t: 0.2699674367904663
Iter 16 t_cur: 16 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 16 loss: 0.688 - eta_t: 0.21165981888771057
Iter 17 t_cur: 17 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 17 loss: 0.692 - eta_t: 0.15872341394424438
Iter 18 t_cur: 18 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 18 loss: 0.687 - eta_t: 0.1121443510055542
Iter 19 t_cur: 19 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 19 loss: 0.684 - eta_t: 0.07279029488563538
Iter 20 t_cur: 20 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 20 loss: 0.693 - eta_t: 0.04139435291290283
Iter 21 t_cur: 21 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 21 loss: 0.699 - eta_t: 0.018541336059570312
Iter 22 t_cur: 22 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 22 loss: 0.699 - eta_t: 0.00465703010559082
Iter 23 t_cur: 23 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 23 loss: 0.678 - eta_t: 0.0
Iter 24 t_cur: 0 - lr: 9.999999747378752e-05 - eta_max: 1.0 - eta_min: 0.0
Iter 24 loss: 0.696 - eta_t: 1.0
EPOCH 1 COMPLETED

Comparison Against Adam

Is it possible for you to benchmark your implementation of AdamW against Tensorflow's implementation of Adam on multiple datasets? It would be useful information for users to decide whether AdamW is the right choice. I would be interested in the differences in the time it takes for every epoch step.

Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432

Hi,

I am using keras-adamw with bert-for-tf2 under the AMD rocm environment, and sometimes I get an error like the following one:

File "bert-decept.py", line 543, in
history = fit_model(model, data, BATCH_SIZE, EPOCHS, tensorboard_callback, model_checkpoint_callback,
File "bert-decept.py", line 438, in fit_model
history = model.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py", line 766, in fit
return func.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 649, in fit
return fit_loop(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 386, in model_iteration
batch_outs = f(ins_batch)
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3631, in call
fetched = self._callable_fn(*array_vals,
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432
[[{{node bert_1/encoder/layer_7/attention/self/query/Tensordot}}]]
[[Func/training_2/AdamW/gradients/gradients/bert_1/encoder/layer_7/output/dropout_62/cond_grad/StatelessIf/then/_11515/input/_23174/_9837]]
(1) Invalid argument: Input to reshape is a tensor with 4300800 values, but the requested shape has 19268370432
[[{{node bert_1/encoder/layer_7/attention/self/query/Tensordot}}]]
0 successful operations.
0 derived errors ignored.

File "bert-decept.py", line 543, in
history = fit_model(model, data, BATCH_SIZE, EPOCHS, tensorboard_callback, model_checkpoint_callback,
File "bert-decept.py", line 438, in fit_model
history = model.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_v1.py", line 766, in fit
return func.fit(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 649, in fit
return fit_loop(
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 386, in model_iteration
batch_outs = f(ins_batch)
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3631, in call
fetched = self._callable_fn(*array_vals,
File "/home/papadako/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Size 0 must be non-negative, not -1737945760
[[{{node bert/encoder/layer_5/attention/self/query/Tensordot/Reshape}}]]
[[Func/training/AdamW/gradients/gradients/bert/encoder/layer_9/output/dropout_30/cond_grad/StatelessIf/then/_696/input/_2295/_3389]]
(1) Invalid argument: Size 0 must be non-negative, not -1737945760
[[{{node bert/encoder/layer_5/attention/self/query/Tensordot/Reshape}}]]

At least in my non-experienced eyes it seems like an invalid pointer reference, so probably not a problem related to adamw but probably to rocm. Does anyone have any idea/nsight about what might be the problem?

Best regards
Panagiotis

AttributeError: module 'tensorflow.python.ops' has no attribute 'control_dependencies'

I'm using Tensorflow 2.3, seems the control_dependencies already moved into different module: tf.control_dependencies

wdyt?

overlordgolddragon / keras-adamw Goto Github PK

keras-adamw's People

Contributors

Stargazers

Watchers

Forkers

keras-adamw's Issues

Recommend Projects

Recommend Topics

Recommend Org