Giter Club home page Giter Club logo

bert-subword-tokenizer-wrapper's Introduction

BERT Subword Tokenizer for Machine Translation

This repository implements a wrapper code for generating a Wordpiece Vocabulary and BERT Tokenizer model from a dataset using tensorflow-text package. The tokenizers generated with this wrapper script are used in the research article: Power Law Graph Transformer for Machine Translation and Representation Learning

Detailed explanation of subword tokenizer and wordpiece vocabulary generation can be found at Subword Tokenizers @ tensorflow.org

Key features

  • Generates a Wordpiece Vocabulary and BERT Tokenizer from a tensorflow dataset for machine translation.
  • Simple interface that takes in all the arguments and generates Vocabulary and Tokenizer model.

Sample Run:

Sample run generates Vocabulary and Tokenizer model from tensorflow dataset for PT-EN machine translation task from tensorflow dataset: ted_hrlr_translate/pt_to_en

Initialize model parameters for bert vocabulary generator and tokenizer:

import make_vocab_tokenizer as mvt

reserved_tokens= ["[PAD]", "[UNK]", "[START]", "[END]"]
bert_tokenizer_params={"lower_case":True}
bert_vocab_args={
                "vocab_size":15000,
                "reserved_tokens":reserved_tokens,
                "bert_tokenizer_params":bert_tokenizer_params,
                "learn_params":{}
            }

Generate vocabulary and tokenizer model:

 make_vocab_tok = mvt.bert_src_tgt_tokenizer(
                 src_lang='pt', 
                 tgt_lang='en',
                 BATCH_SIZE = 1024,
                 dataset_file='ted_hrlr_translate/pt_to_en',
                 train_percent=None,
                 src_vocab_path="./ted_hrlr_translate_pt_vocab.txt",
                 tgt_vocab_path="./ted_hrlr_translate_en_vocab.txt",
                 model_name = "./ted_hrlr_translate_pt_en_tokenizer",
                 load_tokenizer_model=False,
                 make_tokenizer=True,
                 bert_tokenizer_params=bert_tokenizer_params,
                 reserved_tokens=reserved_tokens, 
                 bert_vocab_args=bert_vocab_args
                 ) 

bert-subword-tokenizer-wrapper's People

Contributors

burcgokden avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.