Giter Club home page Giter Club logo

llama-tokenizer-js's People

Contributors

belladoreai avatar imoneoi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

llama-tokenizer-js's Issues

Discussion: using this with sqlite-vss

https://github.com/asg017/sqlite-vss is a Bring-your-own-vectors database, it is compatable with any embedding or vector data you have. Consider using OpenAI's Embeddings API, HuggingFace's Inference API, sentence-transformers, or any of these open source model. You can insert vectors into vss0 tables as JSON or raw bytes.

So I just need to separate long article into 1000 words chunk, and passing them to llama-tokenizer-js, then store to sqlite-vss. Later I can pass query words to llama-tokenizer-js, then use it as a vector to search things in the sqlite-vss?

(can I regard tokens as a vector that can use in the vector db?)

CommonJS

Hey, it would be really awesome if we could have a commonjs version published to npm too, to avoid having to worry about battling cjs/esm issues :)

Typescript support

Importing it in a typescript file gives

TSError: โจฏ Unable to compile TypeScript: server.ts:5:33 - error TS7016: Could not find a declaration file for module 'llama-tokenizer-js'. '.../node_modules/llama-tokenizer-js/llama-tokenizer.js' implicitly has an 'any' type. Try npm i --save-dev @types/llama-tokenizer-jsif it exists or add a new declaration (.d.ts) file containingdeclare module 'llama-tokenizer-js';``

@types/llama-tokenizer-js doesn't exist

Code Llama support

Summary

It would be great to support Code Llama tokenization.

Context

I know that the Code Llama model is based on Llama 2. However, I haven't found any clear evidence for this. If the current implementation already supports Code Llama tokenization, it would be great to clearly state this in the README.

`performance` not available

The performance is not always available, such as in the Next.js environment. I created a fix to only use it when necessary.

#7

Contributions

Hi there,

There are new models coming out everyday that I'd like to test and this tool is quite useful in making sure the instruction and response remain within the model's context window.

However, in order not to repeat your work and instead extend it and contribute back any changes/additions, the following part of the code needs to be explained:

    1. base64-encoded vocabulary
    1. base64-encoded merge binary

How do you compose the base64 vocab and merge binary? It's the only opaque part in this project. Everything else looks good and easy to work with.

If you document that part, I can generate those base64 strings for whatever model I happen to be working with and create a pull request for each. You'd have to add a way to select model via some config, e.g. config.modelName

Thanks

Colors

Hi

Cool stuff and thanks for sharing!

I can get the tokens in an array and get the length of that array to find out the number of tokens.

But how do I know which token corresponds to which characters so I can color them the way you do in the demo? Just wondering if that's a secret functionality that is not part of this open source package...

Thanks,

Integration into ๐Ÿค— Transformers.js (+ generalization/minor improvements)

First of all, I just wanted to thank you for this implementation! It is significantly faster than my original version of the Llama tokenizer (which would often just break for extremely large input texts). I've decided to generalize your implementation so that it can be applied to other BPE-based tokenizers (e.g., GPT-4). Here's my current PR for it to be added to Transformers.js.

Here's an example playground for some other tokenizers: demo.

I also made a few fixes/improvements to your version:

  1. Caching system (only noticeable for non-Llama tokenizers; up to 2x speedup for GPT-4).
  2. Instead of your x * 100000 + leftNode.origPos "trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.
  3. I modified the algorithm to operate with tokens (instead of token ids), mainly to allow interoperability with the current implementation.
  4. top not defined in
    this._heap[top] = value;
  5. Documentation improvements

Feel free to use any of this in your version too!

Convert Tokenizer to a Class

Love this project, solid work! Would be better DX and more maintainable codebase if LlamaTokenizer was a class though!๐Ÿ™

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.