google-research / retvec Goto Github PK

View Code? Open in Web Editor NEW

275.0 10.0 20.0 10.8 MB

RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.

License: Apache License 2.0

Python 32.04% Jupyter Notebook 40.14% PureBasic 24.74% Shell 0.05% HTML 1.64% TypeScript 1.22% JavaScript 0.17%

deep-learning python tensorflow text-classification natural-language-processing nlp

retvec's Introduction

RETVec: Resilient & Efficient Text Vectorizer

Overview

RETVec is a next-gen text vectorizer designed to be efficient, multilingual, and provide built-in adversarial resilience using robust word embeddings trained with similarity learning. You can read the paper here.

RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.

RETVec's speed and size (~200k instead of millions of parameters) also makes it a great choice for on-device and web use cases. It is natively supported in TensorFlow Lite via custom ops in TensorFlow Text, and we provide a JavaScript implementation of RETVec which allows you to deploy web models via TensorFlow.js.

Please see our example colabs on how to get started with training your own models with RETVec. train_retvec_model_tf.ipynb is a great starting point for training a TF model using RETVec.

Demos

To see RetVec in action, visit our demos.

Getting started

Installation

You can use pip to install the latest TensorFlow version of RETVec:

pip install retvec

RETVec has been tested on TensorFlow 2.6+ and python 3.8+.

Basic Usage

You can use RETVec as the vectorization layer in any TensorFlow model with just a single line of code. RETVec operates on raw strings with pre-processing options built-in (e.g. lowercasing text). For example:

import tensorflow as tf
from tensorflow.keras import layers

# Define the input layer, which accepts raw strings
inputs = layers.Input(shape=(1, ), name="input", dtype=tf.string)

# Add the RETVec Tokenizer layer using the RETVec embedding model -- that's it!
x = RETVecTokenizer(sequence_length=128)(inputs)

# Create your model like normal
# e.g. a simple LSTM model for classification with NUM_CLASSES classes
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

Then you can compile, train and save your model like usual! As demonstrated in our paper, models trained using RETVec are more resilient against adversarial attacks and typos, as well as computationally efficient. RETVec also offers support in TFJS and TF Lite, making it perfect for on-device mobile and web use cases.

Colabs

Detailed example colabs for RETVec can be found at under notebooks. These are a good way to get started with using RETVec. You can run the notebooks in Google Colab by clicking the Google Colab button. If none of the examples are similar to your use case, please let us know!

We have the following example colabs:

Training RETVec-based models using TensorFlow: train_retvec_model_tf.ipynb for GPU/CPU training, and train_tpu.ipynb for a TPU-compatible training example.
Converting RETVec models into TF Lite models to run on-device: tf_lite_retvec.ipynb
(Coming soon!) Using RETVec JS to deploy RETVec models in the web using TensorFlow.js

Citing

Please cite this reference if you use RETVec in your research:

@article{retvec2023,
    title={RETVec: Resilient and Efficient Text Vectorizer},
    author={Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin},
    year={2023},
    eprint={2302.09207}
}

Contributing

To contribute to the project, please check out the contribution guidelines. Thank you!

Disclaimer

This is not an official Google product.

retvec's People

Contributors

Stargazers

Watchers

Forkers

invernizzi darcstar-solutions-tech cherrera0001 austin-starks billyliggins niublibing maxvfischer fajnytomaszek jeffersonscampos techthiyanes jakkapob8504 mbispham yunchanssd userlugard elllyetza paperwave marseko prnake rakhithjk

retvec's Issues

Create React/Svelte web component for RETVec

TensorFlow Lite Colab

Colab on how to use RETVec with TF Lite for on-device use cases

Update README

Create a comprehensive README with installation instructions and links to demo colabs, paper, etc.

ValueError raised upon calling RetVecTokenizer function

Hi!

I am trying to use RetVec as an embedding layer for a email spam classification project. When the RetVecTokenizer function is called, I get a Value Error which is described below:

File format not supported: filepath=/root/.keras/retvec-v1. Keras 3 only supports V3 .keras files and legacy H5 format files (.h5 extension). Note that the legacy SavedModel format is not supported by load_model() in Keras 3. In order to reload a TensorFlow SavedModel as an inference-only layer in Keras 3, use keras.layers.TFSMLayer(/root/.keras/retvec-v1, call_endpoint='serving_default') (note that your call_endpoint might have a different name).

Here is the code I wrote:

inputs = layers.Input(shape=(1, ), name="token", dtype=tf.string)

lstm_1 = tf.keras.layers.LSTM(20, dropout=0.2, return_sequences=True)(x)
lstm_2 = tf.keras.layers.LSTM(20, dropout=0.2, return_sequences=True)(lstm_1)
flatten = tf.keras.layers.Flatten()(lstm_2)
dropout = tf.keras.layers.Dropout(0.2, name="dropout")(flatten)
dense = tf.keras.layers.Dense(1, activation="sigmoid")(dropout)

model = tf.keras.Model(inputs=[inputs], outputs=[dense])

How to resolve this?

Fix import

https://github.com/google-research/retvec/blob/1615625edcff1ae5f517bcfbd14210371ce01c57/setup.py#LL62C14-L62C19

Try to remove unused / non critical dependencies like twine.

[Version 0.1.0] Refactoring for initial launch

Rename package from tensorflow_retvec back to RetVec
Clean up retvec layers
Fix tests

Android demo app for emotion classification.

I attempted to save the tflite_retvec model as a tflite model for deployment on an Android device, but I am encountering difficulties in running it. Can you please share an Android demo app to understand how it performs on Android devices.

why the val_acc just 0.51?

in your notbooke, why the val_acc just 0.51? it seems not good

I apply the model to Chinese Sentiment classification data. the performance also seems not good, I don't know why

Nive

GPU memory abnormal increase

Using the retvec model as embeding methods, but the GPU memory abnormal increase to 40G.
tensorflow 2.15.0
keras 2.15.0
tf-keras 2.15.0
retvec 1.0.1

Using methods:

def get_retvec_tokenizer(model_path):
    with tf.device('/CPU:0'):
        inputs = layers.Input(shape=(1,), dtype=tf.string)
        outputs = RETVecTokenizer(model=model_path)(inputs)
        retvec = tf.keras.Model(inputs=inputs, outputs=outputs)

    return retvec

Init the model state:

Using the model state inference:

Is the code to generate the augmented data available anywhere?

In the paper, the authors write

Augmentations Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length. This is consistent with an observed maximum human error frequency of around 20% [11]. We use 22 distinct typo augmentations, which can be grouped into four categories: deletion, insertion, substitution, and transposition. For each token, we randomly select a target augmentation percentage between 0-25%, and for each augmentation step we randomly apply an augmentation from one of the four typo categories. The full list of augmentations used is reported in Appendix D.

Is the code to apply these augmentations available anywhere? I'd like to use & adapt it for my specific use-case.