Tokenizers will tokenize capitalized versions of words differently than lowercased ones ("Hello" does not share its representation with "hello"). This results in a fair amount of redundancy in a tokenizer.
At its core, a language model is simply a model of humans pressing a keyboard. What if we took that literally, and "placed in" the SHIFT and Caps Lock button presses needed to produce text? That way, capitalization could be represented seperately from the words themselves.
We do this with four specialized tokens <shift>, <capss>, <capse> and <bksp> (which are always tokenized as whole, never broken) to absorb this syntactic information, leaving all the other words lowercased.
There are four self contained notebooks in this repository, inside notebooks
:
Intro.ipynb
- This should be the first notebook examined: it contains a fully worked example usingcaptoken
.GPT2 Tokenizer.ipynb
- contains baseline experiments on the GPT-2 tokenizer, to quantify the number of redundant tokens.Train Tokenizer
- This trains the two 16k vocabulary SentencePiece tokenizers on the Wikipedia dataset.Newsgroups.ipynb
- This (and theCaps
variant) performs the experiments and plots using the two trained tokenizers.