For all I can tell, this will likely use the sentencepiece defaults (i.e., 8000 tokens

Vocabulary size when training the tokenizer about lit-llama HOT 4 CLOSED

kno10 commented on May 13, 2024

Vocabulary size when training the tokenizer

from lit-llama.

Comments (4)

lantiga commented on May 13, 2024

Good point, in the tests we are loading the tokenizer from the checkpoint, but one should be able to set it from there as well.

Thanks, doing that rn

from lit-llama.

awaelchli commented on May 13, 2024

Yes we could add that parameter. Note that we haven't actually used this train() function for llama, it is just there for the train.py skeleton script which for now runs on shakespeare, and tokenizer.train() is applied to that data. Just an FYI but the suggestion sounds good

from lit-llama.

kno10 commented on May 13, 2024

Fair enough - I'd assume that it makes more sense to pre-train the tokenizer independently anyway (on a manual sample of suitable size). It may be fine to use a size that is suitable for such demo examples such as the Shakespeare demo.

from lit-llama.

lantiga commented on May 13, 2024

Done! Yes for Shakespeare I set it to a lower number explicitly. Thanks for opening the issue @kno10

from lit-llama.

Vocabulary size when training the tokenizer about lit-llama HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent