belladoreai / llama-tokenizer-js Goto Github PK
View Code? Open in Web Editor NEWJS tokenizer for LLaMA and LLaMA 2
Home Page: https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/
License: MIT License
JS tokenizer for LLaMA and LLaMA 2
Home Page: https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/
License: MIT License
https://github.com/asg017/sqlite-vss is a Bring-your-own-vectors database, it is compatable with any embedding or vector data you have. Consider using OpenAI's Embeddings API, HuggingFace's Inference API,
sentence-transformers
, or any of these open source model. You can insert vectors intovss0
tables as JSON or raw bytes.
So I just need to separate long article into 1000 words chunk, and passing them to llama-tokenizer-js, then store to sqlite-vss. Later I can pass query words to llama-tokenizer-js, then use it as a vector to search things in the sqlite-vss?
(can I regard tokens as a vector that can use in the vector db?)
Hey, it would be really awesome if we could have a commonjs version published to npm too, to avoid having to worry about battling cjs/esm issues :)
Importing it in a typescript file gives
TSError: โจฏ Unable to compile TypeScript: server.ts:5:33 - error TS7016: Could not find a declaration file for module 'llama-tokenizer-js'. '.../node_modules/llama-tokenizer-js/llama-tokenizer.js' implicitly has an 'any' type. Try
npm i --save-dev @types/llama-tokenizer-jsif it exists or add a new declaration (.d.ts) file containing
declare module 'llama-tokenizer-js';``
@types/llama-tokenizer-js
doesn't exist
It would be great to support Code Llama tokenization.
I know that the Code Llama model is based on Llama 2. However, I haven't found any clear evidence for this. If the current implementation already supports Code Llama tokenization, it would be great to clearly state this in the README.
The performance
is not always available, such as in the Next.js
environment. I created a fix to only use it when necessary.
Seems they are not properly parsed.
Hi there,
There are new models coming out everyday that I'd like to test and this tool is quite useful in making sure the instruction and response remain within the model's context window.
However, in order not to repeat your work and instead extend it and contribute back any changes/additions, the following part of the code needs to be explained:
How do you compose the base64 vocab and merge binary? It's the only opaque part in this project. Everything else looks good and easy to work with.
If you document that part, I can generate those base64 strings for whatever model I happen to be working with and create a pull request for each. You'd have to add a way to select model via some config, e.g. config.modelName
Thanks
Hi
Cool stuff and thanks for sharing!
I can get the tokens in an array and get the length of that array to find out the number of tokens.
But how do I know which token corresponds to which characters so I can color them the way you do in the demo? Just wondering if that's a secret functionality that is not part of this open source package...
Thanks,
First of all, I just wanted to thank you for this implementation! It is significantly faster than my original version of the Llama tokenizer (which would often just break for extremely large input texts). I've decided to generalize your implementation so that it can be applied to other BPE-based tokenizers (e.g., GPT-4). Here's my current PR for it to be added to Transformers.js.
Here's an example playground for some other tokenizers: demo.
I also made a few fixes/improvements to your version:
x * 100000 + leftNode.origPos
"trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.top
not defined in llama-tokenizer-js/llama-tokenizer.js
Line 140 in a9c05be
Feel free to use any of this in your version too!
Love this project, solid work! Would be better DX and more maintainable codebase if LlamaTokenizer
was a class though!๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.