Giter Club home page Giter Club logo

captoken's Introduction

Tokenizers will tokenize capitalized versions of words differently than lowercased ones ("Hello" does not share its representation with "hello"). This results in a fair amount of redundancy in a tokenizer.

At its core, a language model is simply a model of humans pressing a keyboard. What if we took that literally, and "placed in" the SHIFT and Caps Lock button presses needed to produce text? That way, capitalization could be represented seperately from the words themselves.

We do this with four specialized tokens <shift>, <capss>, <capse> and <bksp> (which are always tokenized as whole, never broken) to absorb this syntactic information, leaving all the other words lowercased.

Current Status

There are four self contained notebooks in this repository, inside notebooks:

  1. Intro.ipynb - This should be the first notebook examined: it contains a fully worked example using captoken.
  2. GPT2 Tokenizer.ipynb - contains baseline experiments on the GPT-2 tokenizer, to quantify the number of redundant tokens.
  3. Train Tokenizer - This trains the two 16k vocabulary SentencePiece tokenizers on the Wikipedia dataset.
  4. Newsgroups.ipynb - This (and the Caps variant) performs the experiments and plots using the two trained tokenizers.

captoken's People

Contributors

irhum avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.