Giter Club home page Giter Club logo

retvec's Introduction

RETVec: Resilient & Efficient Text Vectorizer

Overview

RETVec is a next-gen text vectorizer designed to be efficient, multilingual, and provide built-in adversarial resilience using robust word embeddings trained with similarity learning. You can read the paper here.

RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more. The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TF model without the need for a separate pre-processing step.

RETVec's speed and size (~200k instead of millions of parameters) also makes it a great choice for on-device and web use cases. It is natively supported in TensorFlow Lite via custom ops in TensorFlow Text, and we provide a JavaScript implementation of RETVec which allows you to deploy web models via TensorFlow.js.

Please see our example colabs on how to get started with training your own models with RETVec. train_retvec_model_tf.ipynb is a great starting point for training a TF model using RETVec.

Demos

To see RetVec in action, visit our demos.

Getting started

Installation

You can use pip to install the latest TensorFlow version of RETVec:

pip install retvec

RETVec has been tested on TensorFlow 2.6+ and python 3.8+.

Basic Usage

You can use RETVec as the vectorization layer in any TensorFlow model with just a single line of code. RETVec operates on raw strings with pre-processing options built-in (e.g. lowercasing text). For example:

import tensorflow as tf
from tensorflow.keras import layers

# Define the input layer, which accepts raw strings
inputs = layers.Input(shape=(1, ), name="input", dtype=tf.string)

# Add the RETVec Tokenizer layer using the RETVec embedding model -- that's it!
x = RETVecTokenizer(sequence_length=128)(inputs)

# Create your model like normal
# e.g. a simple LSTM model for classification with NUM_CLASSES classes
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

Then you can compile, train and save your model like usual! As demonstrated in our paper, models trained using RETVec are more resilient against adversarial attacks and typos, as well as computationally efficient. RETVec also offers support in TFJS and TF Lite, making it perfect for on-device mobile and web use cases.

Colabs

Detailed example colabs for RETVec can be found at under notebooks. These are a good way to get started with using RETVec. You can run the notebooks in Google Colab by clicking the Google Colab button. If none of the examples are similar to your use case, please let us know!

We have the following example colabs:

  • Training RETVec-based models using TensorFlow: train_retvec_model_tf.ipynb for GPU/CPU training, and train_tpu.ipynb for a TPU-compatible training example.
  • Converting RETVec models into TF Lite models to run on-device: tf_lite_retvec.ipynb
  • (Coming soon!) Using RETVec JS to deploy RETVec models in the web using TensorFlow.js

Citing

Please cite this reference if you use RETVec in your research:

@article{retvec2023,
    title={RETVec: Resilient and Efficient Text Vectorizer},
    author={Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin},
    year={2023},
    eprint={2302.09207}
}

Contributing

To contribute to the project, please check out the contribution guidelines. Thank you!

Disclaimer

This is not an official Google product.

retvec's People

Contributors

marinazhang avatar ebursztein avatar invernizzi avatar kartynnik avatar dependabot[bot] avatar

Stargazers

Vitaley Zaretskey avatar Bander Alsulami avatar nChieeF avatar  avatar li zheng avatar Shailendra Paliwal avatar Brian Wilcox avatar Tianrui Liu avatar Alex Rigler avatar Nerses  Nersesyan avatar  avatar Lukas Kreussel avatar Dennis avatar Sean O'Riordain  avatar Artur Daveyan avatar HLB avatar sheng avatar Dom avatar Shubham Gupta avatar Andreas Jörg avatar Amrish Jhingoer avatar Manpreet Singh avatar Martin Krasser avatar Carl Bray avatar  avatar Michal Barnišin avatar MazzMa avatar  avatar Jon K avatar Christopher Massey avatar Tom Conder avatar Nathan Hanna avatar  avatar Maxime avatar mbuke_repo avatar  avatar Salma Khaled avatar Rick Farmer avatar Jose Selvi avatar Sarthak Gupta avatar chenzi avatar Kazuhiro Kubota avatar  avatar Matt Keas avatar Tales :D avatar  avatar Lichi Li avatar Adam Sitanc avatar Carlo Moro avatar  avatar Simon Luger avatar  avatar Andris Reinman avatar  avatar Sid Wood avatar Dan Timbrell avatar Andrea De Pasquale avatar Jason Hiew avatar BitPop avatar Greg Wittel avatar Praveen avatar  avatar Daniel Hails avatar Stephen Vorwerk avatar MissSwedenStudies avatar  avatar Katsuya Oda avatar  avatar Jonas Oppenlaender avatar James Chen avatar  avatar Kazuaki Hiraga avatar Technetium1 avatar Kenzi NOIKE avatar Javadz avatar Guilherme Euzébio avatar - [ ] [[FEATURE REQ] Add TableCheckpointStore | azure-sdk-for-net#40830](https://github.com/Azure/azure-sdk-for-net/issues/40830) - [x] Google - [x] Chrome - [x] Microsoft - [x] Safari - [x] Adsense - [x] Ads - [x] API - [x] SDK - [x] Google chrome avatar Ole Morud avatar Salinas avatar Lucas Couto avatar Jason Pan avatar Spotlight avatar Elf avatar Anthony avatar Yasutaka ATARASHI avatar  avatar Thomas Quinot avatar Softervintage avatar Eser avatar  avatar Matheus Lenzi dos Santos avatar Opeyemi avatar matts avatar Luís Aurélio Casoni avatar Lukas Cerny avatar Lorenzo Chini avatar George Starcher avatar  avatar David Tinker avatar  avatar

Watchers

Mihai Maruseac avatar Owen Vallis avatar Dan Pollack avatar Zakharov Roman avatar  avatar Rob P. avatar  avatar  avatar  avatar - [ ] [[FEATURE REQ] Add TableCheckpointStore | azure-sdk-for-net#40830](https://github.com/Azure/azure-sdk-for-net/issues/40830) - [x] Google - [x] Chrome - [x] Microsoft - [x] Safari - [x] Adsense - [x] Ads - [x] API - [x] SDK - [x] Google chrome avatar

retvec's Issues

Update README

Create a comprehensive README with installation instructions and links to demo colabs, paper, etc.

ValueError raised upon calling RetVecTokenizer function

Hi!

I am trying to use RetVec as an embedding layer for a email spam classification project. When the RetVecTokenizer function is called, I get a Value Error which is described below:

File format not supported: filepath=/root/.keras/retvec-v1. Keras 3 only supports V3 .keras files and legacy H5 format files (.h5 extension). Note that the legacy SavedModel format is not supported by load_model() in Keras 3. In order to reload a TensorFlow SavedModel as an inference-only layer in Keras 3, use keras.layers.TFSMLayer(/root/.keras/retvec-v1, call_endpoint='serving_default') (note that your call_endpoint might have a different name).

Here is the code I wrote:

inputs = layers.Input(shape=(1, ), name="token", dtype=tf.string)

lstm_1 = tf.keras.layers.LSTM(20, dropout=0.2, return_sequences=True)(x)
lstm_2 = tf.keras.layers.LSTM(20, dropout=0.2, return_sequences=True)(lstm_1)
flatten = tf.keras.layers.Flatten()(lstm_2)
dropout = tf.keras.layers.Dropout(0.2, name="dropout")(flatten)
dense = tf.keras.layers.Dense(1, activation="sigmoid")(dropout)

model = tf.keras.Model(inputs=[inputs], outputs=[dense])

How to resolve this?

Android demo app for emotion classification.

I attempted to save the tflite_retvec model as a tflite model for deployment on an Android device, but I am encountering difficulties in running it. Can you please share an Android demo app to understand how it performs on Android devices.

why the val_acc just 0.51?

in your notbooke, why the val_acc just 0.51? it seems not good

I apply the model to Chinese Sentiment classification data. the performance also seems not good, I don't know why

GPU memory abnormal increase

Using the retvec model as embeding methods, but the GPU memory abnormal increase to 40G.
tensorflow 2.15.0
keras 2.15.0
tf-keras 2.15.0
retvec 1.0.1

Using methods:

def get_retvec_tokenizer(model_path):
    with tf.device('/CPU:0'):
        inputs = layers.Input(shape=(1,), dtype=tf.string)
        outputs = RETVecTokenizer(model=model_path)(inputs)
        retvec = tf.keras.Model(inputs=inputs, outputs=outputs)

    return retvec

Init the model state:
image

Using the model state inference:
image

Is the code to generate the augmented data available anywhere?

In the paper, the authors write

Augmentations Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length. This is consistent with an observed maximum human error frequency of around 20% [11]. We use 22 distinct typo augmentations, which can be grouped into four categories: deletion, insertion, substitution, and transposition. For each token, we randomly select a target augmentation percentage between 0-25%, and for each augmentation step we randomly apply an augmentation from one of the four typo categories. The full list of augmentations used is reported in Appendix D.

Is the code to apply these augmentations available anywhere? I'd like to use & adapt it for my specific use-case.

Separate code for better maintainability

Keep only what is need for inference under the tf/
mode training code / layer and utils under training/
create a tokenizer class (we can do this after) that is for when not using layers. Move the code for this from layer to that class.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.