belladoreai / llama-tokenizer-js Goto Github PK

View Code? Open in Web Editor NEW

303.0 3.0 21.0 1.57 MB

JS tokenizer for LLaMA and LLaMA 2

Home Page: https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/

License: MIT License

JavaScript 37.26% HTML 22.34% CSS 22.91% Python 17.49%

javascript llama llm tokenizer

llama-tokenizer-js's People

Contributors

Stargazers

Watchers

llama-tokenizer-js's Issues

Discussion: using this with sqlite-vss

https://github.com/asg017/sqlite-vss is a Bring-your-own-vectors database, it is compatable with any embedding or vector data you have. Consider using OpenAI's Embeddings API, HuggingFace's Inference API, sentence-transformers, or any of these open source model. You can insert vectors into vss0 tables as JSON or raw bytes.

So I just need to separate long article into 1000 words chunk, and passing them to llama-tokenizer-js, then store to sqlite-vss. Later I can pass query words to llama-tokenizer-js, then use it as a vector to search things in the sqlite-vss?

(can I regard tokens as a vector that can use in the vector db?)

CommonJS

Hey, it would be really awesome if we could have a commonjs version published to npm too, to avoid having to worry about battling cjs/esm issues :)

Typescript support

Importing it in a typescript file gives

TSError: ⨯ Unable to compile TypeScript: server.ts:5:33 - error TS7016: Could not find a declaration file for module 'llama-tokenizer-js'. '.../node_modules/llama-tokenizer-js/llama-tokenizer.js' implicitly has an 'any' type. Try npm i --save-dev @types/llama-tokenizer-jsif it exists or add a new declaration (.d.ts) file containingdeclare module 'llama-tokenizer-js';``

@types/llama-tokenizer-js doesn't exist

Code Llama support

Summary

It would be great to support Code Llama tokenization.

Context

I know that the Code Llama model is based on Llama 2. However, I haven't found any clear evidence for this. If the current implementation already supports Code Llama tokenization, it would be great to clearly state this in the README.

`performance` not available

The performance is not always available, such as in the Next.js environment. I created a fix to only use it when necessary.

CJK language

Seems they are not properly parsed.

Contributions

Hi there,

There are new models coming out everyday that I'd like to test and this tool is quite useful in making sure the instruction and response remain within the model's context window.

However, in order not to repeat your work and instead extend it and contribute back any changes/additions, the following part of the code needs to be explained:

1. base64-encoded vocabulary
1. base64-encoded merge binary

How do you compose the base64 vocab and merge binary? It's the only opaque part in this project. Everything else looks good and easy to work with.

If you document that part, I can generate those base64 strings for whatever model I happen to be working with and create a pull request for each. You'd have to add a way to select model via some config, e.g. config.modelName

Thanks

Colors

Cool stuff and thanks for sharing!

I can get the tokens in an array and get the length of that array to find out the number of tokens.

But how do I know which token corresponds to which characters so I can color them the way you do in the demo? Just wondering if that's a secret functionality that is not part of this open source package...

Thanks,

Integration into 🤗 Transformers.js (+ generalization/minor improvements)

First of all, I just wanted to thank you for this implementation! It is significantly faster than my original version of the Llama tokenizer (which would often just break for extremely large input texts). I've decided to generalize your implementation so that it can be applied to other BPE-based tokenizers (e.g., GPT-4). Here's my current PR for it to be added to Transformers.js.

Here's an example playground for some other tokenizers: demo.

I also made a few fixes/improvements to your version:

Caching system (only noticeable for non-Llama tokenizers; up to 2x speedup for GPT-4).
Instead of your x * 100000 + leftNode.origPos "trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.
I modified the algorithm to operate with tokens (instead of token ids), mainly to allow interoperability with the current implementation.
top not defined in

llama-tokenizer-js/llama-tokenizer.js

Line 140 in a9c05be

this._heap[top] = value;
Documentation improvements

Feel free to use any of this in your version too!

Convert Tokenizer to a Class

Love this project, solid work! Would be better DX and more maintainable codebase if LlamaTokenizer was a class though!🙏

belladoreai / llama-tokenizer-js Goto Github PK

llama-tokenizer-js's People

Contributors

Stargazers

Watchers

Forkers

llama-tokenizer-js's Issues

Summary

Context

Recommend Projects

Recommend Topics

Recommend Org