Giter Club home page Giter Club logo

keras-transformer's Introduction

Keras-Transformer

Keras-transformer is a Python library implementing nuts and bolts, for building (Universal) Transformer models using Keras, and equipped with examples of how it can be applied.

The library supports:

  • positional encoding and embeddings,
  • attention masking,
  • memory-compressed attention,
  • ACT (adaptive computation time),
  • a general implementation of BERT (because the Transformer is mainly applied to NLP tasks).

It allows you to piece together a multi-step Transformer model in a flexible way, for example:

transformer_block = TransformerBlock(
    name='transformer',
    num_heads=8,
    residual_dropout=0.1,
    attention_dropout=0.1,
    use_masking=True)
add_coordinate_embedding = TransformerCoordinateEmbedding(
    transformer_depth,
    name='coordinate_embedding')
    
output = transformer_input # shape: (<batch size>, <sequence length>, <input size>)
for step in range(transformer_depth):
    output = transformer_block(
        add_coordinate_embedding(output, step=step))

All pieces of the model (like self-attention, activation function, layer normalization) are available as Keras layers, so, if necessary, you can build your version of Transformer, by re-arranging them differently or replacing some of them.

The (Universal) Transformer is a deep learning architecture described in arguably one of the most impressive DL papers of 2017 and 2018: the "Attention is all you need" and the "Universal Transformers" by Google Research and Google Brain teams.

The authors brought the idea of recurrent multi-head self-attention, which has inspired a big wave of new research models that keep coming ever since. These models demonstrate new state-of-the-art results in various NLP tasks, including translation, parsing, question answering, and even some algorithmic tasks.

Installation

To install the library you need to clone the repository

git clone https://github.com/kpot/keras-transformer.git

then switch to the cloned directory and run pip

cd keras-transformer
pip install .

Please note that the project requires Python >= 3.6.

Language modelling examples with BERT and GPT

This repository contains simple examples showing how Keras-transformer works. It's not a rigorous evaluation of the model's capabilities, but rather a demonstration on how to use the code.

The code trains simple language-modeling networks on the WikiText-2 dataset and evaluates their perplexity. The model is either a vanilla Transformer, or an Adaptive Universal Transformer (by default) with five layers, each can be trained using either:

  • Generative pre-training (GPT), which involves using masked self-attention to prevent the model from "looking into the future".
  • BERT, which doesn't restrict self-attention, allowing the model to fill the gaps using both left and right context.

To launch the code, you will first need to install the requirements listed in example/requirements.txt. Assuming you work from a Python virtual environment, you can do this by running

pip install -r example/requirements.txt

You will also need to make sure you have a backend for Keras. For instance, you can install Tensorflow (the sample was tested using Tensorflow and PlaidML as backends):

pip install tensorflow

Now you can launch the GPT example as

python -m example.run_gpt --save lm_model.h5

to see all command line options and their default values, try

python -m example.run_gpt --help

If all goes well, after launching the example you should see the perplexity falling with each epoch.

Building vocabulary: 100%|█████████████████████████████████| 36718/36718 [00:04<00:00, 7642.33it/s]
Learning BPE...Done
Building BPE vocabulary: 100%|███████████████████████████████| 36718/36718 [00:06<00:00, 5743.74it/s]
Train on 9414 samples, validate on 957 samples
Epoch 1/50
9414/9414 [==============================] - 76s 8ms/step - loss: 7.0847 - perplexity: 1044.2455
    - val_loss: 6.3167 - val_perplexity: 406.5031
...

After 200 epochs (~5 hours) of training on GeForce 1080 Ti, I've got validation perplexity about 51.61 and test perplexity 50.82. The score can be further improved, but that is not the point of this demo.

BERT model example can be launched similarly

python -m example.run_bert --save lm_model.h5 --model vanilla

but you will need to be patient. BERT easily achieves better performance than GPT, but requires much more training time to converge.

keras-transformer's People

Contributors

kpot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keras-transformer's Issues

what are the input for a language model?

Hello:
Very nice code. I am trying to train my own language model using gpt. A language model, as the paper "all you need is attention" said, is built by predicting the next word with all their previous words. But I found that the input of your gpt model are not like that.
For exampke: if I have a sentence: [5,242,5,6,354], I suppose the input x should be [5],[5,242],[5,242,5],[5,242,5,6], and y should be [[242],[5],[6],[354]]; But the input x in your model is [5,242,5,6,354], and y is [[242],[5],[6],[354]], and I could not found any dimension or reshape of x in your code.
Could you please explain this differance? Thank you

ImportError: cannot import name 'Activations' on GPT example

I followed the installation instructions on the readme, using Tensorflow as the backend, but trying to run the GPT example gives this error:

  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/beau101023/transformer_project/keras-transformer/example/run_gpt.py", line 18, in <module>
    from .models import (
  File "/home/beau101023/transformer_project/keras-transformer/example/models.py", line 9, in <module>
    from keras_transformer.transformer import TransformerACT, TransformerBlock
  File "/home/beau101023/transformer_project/keras-transformer/keras_transformer/transformer.py", line 9, in <module>
    from keras.layers import Layer, Add, activations, Dropout
ImportError: cannot import name 'activations'

I am running this on a ubuntu 18.04 LTS instance in virtualbox, Python version 3.6.8, in a virtual environment.

Running python -m example.run_gpt --help gives the same error.

Variable input length

Hello,

Thanks for this great project.
I would like to use variable input length i.e.
input_tensor = Input([None, input_dim], name='X_%s'%input_feat)
but this generate an error in attention code:
pre_q, pre_k, pre_v = [
K.reshape(
# K.slice(qkv, (0, i * d_model), (-1, d_model)),
qkv[:, i * d_model:(i + 1) * d_model],
(-1, seq_len, self.num_heads, d_model // self.num_heads))
for i in range(3)]

Can you please suggest a solution for this problem?

Shape order for the input passed to the transformer

Hi Kirill, thanks for the great work! It's great to have this in keras!

I'm trying to use the transformer, but I'm not sure if I'm doing the shapes correctly. I have some data in the form of 10 sequences, 40 length vectors, (something like 10 timesteps of vector that have 40 features) so I'm using a keras Model and I have my layer as inputNiveles = Input(shape=(10, 40), dtype='float', name="input_niveles"). If I purposedly put a wrong number of heads in the transformer, the error I get is this one:
"The size of the last dimension of the input (40) must be evenly divisible by the numberof the attention heads 11"
But, are the heads not supposed to act at the level of sequence, not features making the error say something like input(10) must be....?
Is the transformer expecting the number of steps or sequence be the last dimension?
I'm also using the coordinate embedding layer.
` add_coordinate_embedding2 = TransformerCoordinateEmbedding(transformer_depth, name='coordinate_embedding2')

transformer_block2 = TransformerBlock(name='transformer2',num_heads=10,residual_dropout=0.0,attention_dropout=0.0,use_masking=True)    

nivelesOut = inputNiveles

for step in range(transformer_depth):
    nivelesOut = transformer_block2(        
        add_coordinate_embedding2(nivelesOut, step=step))

nivelesOut = Flatten(name="aplane_niveles")(nivelesOut)`

Thank you very much Kirill

Transformer encoder layer instead of Bidirectional LSTM

So I want to change below Keras bidirectional LSTM layer into Transformer encoder:

lstmLayer = keras.layers.Bidirectional( keras.layers.CuDNNLSTM(args.rnnSize, return_sequences = True, recurrent_initializer = 'glorot_uniform' ) )(inputLayer)

so can this be accomplished using your library? The rest of the code remains same, I just want to replace bidirectional LSTM layers with Transformer.

I would really appreciate your help. Thanks.

Creating a seq2seq type of architecture

Hi, thanks for this excellent repo. It would be great if you can show a small example of using the transformer block for any type of text to text conversion. I am trying to use this repo for benchmarking transformer vs classical lstm + attention for general text to text tasks.

Expected Input shape?

Hi,
Thanks for implementing this library. For the universal GPT transformer example you provided, what is the expected input like?
For your reference, I'm trying to solve a text summarization problem and my input is an article of words. From my understanding, I need to map each word to an index and pass that as input.

So, if my input is "an apple a day keeps the doctor away", my encoded sequence will look something like [52, 34, 25, 23, 56, 12, 57, 45]

I am doing this for every article, so my input shape will be (train_set_size, max_seq_length, 1).

However, I'm getting error for this input shape. How do I fix this?

Also, what is the expected output shape?

Thanks.

decoder and encoder model

Hey!

nice code man!

Can you reproduce the results of the original code? If I understand correctly you only implemented the encoder side?

Best,
Luca

bug in TransformerACT

    def initialize_control_tensors(self, halting):
        self.zeros_like_halting = K.zeros_like(
            halting, name='zeros_like_halting')
        self.ones_like_halting = K.ones_like(
            halting, name='ones_like_halting')
        self.remainder = self.zeros_like_halting    # <----- When `remainder` changes, which will also change `zeros_like_halting`.
        self.active_steps = self.ones_like_halting  # <----- 
        self.halt_budget = self.ones_like_halting - self.halt_epsilon

This problem makes training unstable or impossible.

using keras model_from_json fails in some cases

AddCoordinateEncoding is loaded instead of AddPositionalEncoding.
Apply the following patch to position.py

@@ -135,5 +135,5 @@
 get_custom_objects().update({
     'TransformerCoordinateEmbedding': TransformerCoordinateEmbedding,
     'AddCoordinateEncoding': AddCoordinateEncoding,
-    'AddPositionalEncoding': AddCoordinateEncoding,
+    'AddPositionalEncoding': AddPositionalEncoding,
 })

Using the model without ReusableEmbedding and TiedOutputEmbedding

Hi, I just want to ask how to use the model without using these embedding layers. I tried to write it like this and got an exception.

x = np.array([[1, 1, 1]] * 1000)
y = np.array([[1]] * 1000)

input_layer = Input(shape=x.shape[1:], dtype='int32', name="input_layer")

transformer_layer = TransformerBlock(name="transformer", num_heads=2, use_masking=True, vanilla_wiring=False,
                                     residual_dropout=0.1, attention_dropout=0.1)

coordinate_embedding_layer = TransformerCoordinateEmbedding(1, name='coordinate_embedding')

transformer_act_layer = TransformerACT(name='adaptive_computation_time')

output_avg = Lambda(lambda in_x: K.mean(in_x, axis=1), name='avg_layer')
output_softmax_layer = Softmax(name='output_softmax')

next_step_input = input_layer
next_step_input = coordinate_embedding_layer(next_step_input, step=0)
next_step_input = transformer_layer(next_step_input)
next_step_input, act_output = transformer_act_layer(next_step_input)

transformer_act_layer.finalize()

final_output = output_softmax_layer(inputs=output_avg(act_output))

model = Model(inputs=[input_layer], outputs=[final_output])

optimizer = Adam(
    lr=0.1, beta_1=0.6, beta_2=0.999)

model.compile(
    optimizer,
    loss=sparse_categorical_crossentropy,
    metrics=["accuracy"])

I got exception like this

  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 558, in make_tensor_proto
    str_values = [compat.as_bytes(x) for x in proto_values]
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 558, in <listcomp>
    str_values = [compat.as_bytes(x) for x in proto_values]
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
    (bytes_or_text,))
TypeError: Expected binary or unicode string, got None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "try.py", line 64, in <module>
    main()
  File "try.py", line 35, in main
    next_step_input = coordinate_embedding_layer(next_step_input, step=0)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "/home/gregory112/go/src/keras-transformer/keras_transformer/position.py", line 113, in build
    trainable=True)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/keras/engine/base_layer.py", line 249, in add_weight
    weight = K.variable(initializer(shape),
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/keras/initializers.py", line 112, in __call__
    dtype=dtype, seed=self.seed)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 4139, in random_uniform
    dtype=dtype, seed=seed)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/ops/random_ops.py", line 239, in random_uniform
    shape = _ShapeTensor(shape)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/ops/random_ops.py", line 44, in _ShapeTensor
    return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2
    as_ref=False)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 304, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 245, in constant
    allow_broadcast=True)
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/home/gregory112/go/src/keras-transformer/venv/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 562, in make_tensor_proto
    "supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'tuple'> to Tensor. Contents: (None, 3). Consider casting elements to a supported type.

In my case, those ones in the x (the input dataset) will all be one-hot encoded so I think I can skip the ReusableEmbedding layer part. How can I do this?

Loading pre-trained BERT

Does your code support loading a pre-trained version of BERT, in order to use it for fine-tuning ?

How to generate text use the trained model

Hello:
I appreciate your code, because it is easy-reading. But I don't know how to generate text with the trained model, I only know how to evaluate the trained accuracy. So could you please tell me , thank you.

Correct text generation with GPT

Hello.

Following examples I've successfully trained universal transformer GPT model on custom training corpus. Validation perplexity values are good, but I'm struggling to generate readable text with the model.

By padding arbitrary input up to the maximum sequence length, sequence is passing through the model and I take logits for a next token. So, if an input was 5 tokens long, it's padded up to the MAX_SEQ_LEN (256 in my case). Then, I'm using 6-th token logits to do a top k sampling, replace sequence with new token and passing again.

Am doing something wrong? Is padding can be an issue?

compression window error

It will throw error as follow when I set compression_window_size to integer.

Failed to convert object of type <class 'tuple'> to Tensor. Contents: (-1, 16, None). Consider casting elements to a supported type.

create tag?

currently myself and I suspect many others are installing this via master - could a release get tagged and pushed up for more reproducibility?

Wrong predictions when reloading model

Hello,

Thanks for the easy implementation!

I have problems reloading the model and using it for a production style pipeline.
I can clearly see through training and development that the models learns and gets better, but once i reload the model the predictions are gibberish.

model = load_model('model.h5', custom_objects={'TransformerCoordinateEmbedding': TransformerCoordinateEmbedding})

Anyone else encountering these problems?

Regards
Andreas

How can I get the output word embeddings?

I trained a dataset using the vanilla transformer, but the .h5 file does not have the output word embeddings generated by the model. Is there any way I can access that?

Also, how can I use pre-trained word vectors instead of the BPE encoding to start with?

What is significance of transformer_depth?

Hi,

Thank you so much for this repo!! It's amazing and very useful. Can you please help me understand the parameter transformer_depth? And what is the difference between num_heads and transformer_depth?

How to load the model?

When I use a model.save() to save the model, I use a keras.models.load_model to load the model and report an error,How to load the model?

There was a typo in the diagram showing the arrangement of the layers in Universal Transformer paper

I saw that the TransformerBlock was designed with two modes, vanilla and non vanilla wiring. And as documented, the vanilla wiring is used for the plain transformer and non vanilla is used for universal transformer. The fact is, there is no difference between the position of the dropout in vanilla transformer and the universal one.

"We apply dropout [33] to the output of each sub-layer, it is added to the sub-layer input and normalized"

This stays the same with Universal Transformer. If you look at the figure in the universal transformer, there was a typo in the picture. Refer to this issue from tensor2tensor:

tensorflow/tensor2tensor#1215

This is the typo diagram:
https://images.app.goo.gl/gnjZLc4RVTndh7Fd7

This is the correct diagram, from the presentation of Mostafa Dehghani himself, as refered in that github issue:
http://mostafadehghani.com/wp-content/uploads/2018/08/Universal_Transformers.pdf/#page=11

So I guess the correct implementation is to use vanilla_wiring=True all the time? Just for curiosity (and to help my research too), as written in the documentation of TransformerBlock, why do you think it is more reasonable to use the wiring that is currently in the old diagram?

Need explanation about how the model works

I tried to recreate the model that you made in run_gpt.py with this simple code.

x = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]])
y = np.array([[1], [1], [1], [1], [1]])
y = np.expand_dims(y, axis=-1)
print(x.shape) # (5, 3)
print(y.shape) # (5, 1, 1)

input_layer = Input(shape=x.shape[1:], dtype='int32', name="input_layer")

transformer_layer = TransformerBlock(name="transformer", num_heads=2, use_masking=True, vanilla_wiring=False, residual_dropout=0.1, attention_dropout=0.1)

reusable_embedding_layer = ReusableEmbedding(10, 2, input_length=x.shape[1], name="reusable_embedding")

output_layer = TiedOutputEmbedding(projection_dropout=0.1, name='word_prediction_logits')

coordinate_embedding_layer = TransformerCoordinateEmbedding(1, name='coordinate_embedding')

transformer_act_layer = TransformerACT(name='adaptive_computation_time')

output_softmax_layer = Softmax(name='output_softmax')

next_step_input, embedding_matrix = reusable_embedding_layer(input_layer)
next_step_input = coordinate_embedding_layer(next_step_input, step=0)
next_step_input = transformer_layer(next_step_input)
next_step_input, act_output = transformer_act_layer(next_step_input)

transformer_act_layer.finalize()

final_output = output_softmax_layer(inputs=output_layer(inputs=[act_output, embedding_matrix]))

model = Model(inputs=[input_layer], outputs=[final_output])

optimizer = Adam(
        lr=0.1, beta_1=0.6, beta_2=0.999)

model.compile(
        optimizer,
        loss=sparse_categorical_crossentropy,
        metrics=["accuracy"])

model.summary(150)

model.fit(x, y) # This will return exception because x and y have different sizes

# print(model.predict([1, 1, 1]))

A simple universal transformer with ACT and depth=1. Looking at the model summary, the final output seems to be always tied to the number of elements in one sequence (in this case 3), and the number of possible classes (i.e vocabulary size), in this case 10. Why is it not outputting an output of 1x10? I mean, each position. I just don't get it. I don't understand the meaning of the output if interpreted like this. Output of the softmax are probability distribution, yes, I get it. But why is it 3x10 and not 1x10?

In case of an LSTM, the output is one per time step. I can just use the last output of the last timestep for example. So I thought with this model it should be the same. And there are other things that I think is a bit different with the paper (especially the paper described that the output of the encoder is with dimension of 512). This, that you implemented in the run_gpt.py I think is only the encoder?

Can you help me clearing things up?

Example in simple time-series

hello,

It would be nice if you could provide some simple examples of how to apply these models in a simple multivariable time-series scenario (with simple values in a sequence, with no embeddings, no words/text, etc), if possible.

Like multivariable time-series classification (or regression). For generic use.

Multivariable time-series classification:
input (time-steps , variables )
output (class)

Thanks in advance!

multi-head behaviour

According to the "attention is all you need" paper, each head performs attention over the entire set of queries, keys and values, but with their own personal weights. Your implementation divides the last dimension over the heads. Why?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.