Giter Club home page Giter Club logo

tiktoken-rs's Introduction

tiktoken-rs

Github Contributors Github Stars CI

crates.io status crates.io downloads Rust dependency status

Rust library for tokenizing text with OpenAI models using tiktoken.

This library provides a set of ready-made tokenizer libraries for working with GPT, tiktoken and related OpenAI models. Use cases covers tokenizing and counting tokens in text inputs.

This library is built on top of the tiktoken library and includes some additional features and enhancements for ease of use with rust code.

Examples

For full working examples for all supported features, see the examples directory in the repository.

Usage

  1. Install this tool locally with cargo
cargo add tiktoken-rs

Then in your rust code, call the API

Counting token length

use tiktoken_rs::p50k_base;

let bpe = p50k_base().unwrap();
let tokens = bpe.encode_with_special_tokens(
  "This is a sentence   with spaces"
);
println!("Token count: {}", tokens.len());

Counting max_tokens parameter for a chat completion request

use tiktoken_rs::{get_chat_completion_max_tokens, ChatCompletionRequestMessage};

let messages = vec![
    ChatCompletionRequestMessage {
        content: Some("You are a helpful assistant that only speaks French.".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Hello, how are you?".to_string()),
        role: "user".to_string(),
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Parlez-vous francais?".to_string()),
        role: "system".to_string(),
        name: None,
        function_call: None,
    },
];
let max_tokens = get_chat_completion_max_tokens("gpt-4", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

Counting max_tokens parameter for a chat completion request with async-openai

Need to enable the async-openai feature in your Cargo.toml file.

use tiktoken_rs::async_openai::get_chat_completion_max_tokens;
use async_openai::types::{ChatCompletionRequestMessage, Role};

let messages = vec![
    ChatCompletionRequestMessage {
        content: Some("You are a helpful assistant that only speaks French.".to_string()),
        role: Role::System,
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Hello, how are you?".to_string()),
        role: Role::User,
        name: None,
        function_call: None,
    },
    ChatCompletionRequestMessage {
        content: Some("Parlez-vous francais?".to_string()),
        role: Role::System,
        name: None,
        function_call: None,
    },
];
let max_tokens = get_chat_completion_max_tokens("gpt-4", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

tiktoken supports these encodings used by OpenAI models:

Encoding name OpenAI models
cl100k_base ChatGPT models, text-embedding-ada-002
p50k_base Code models, text-davinci-002, text-davinci-003
p50k_edit Use for edit models like text-davinci-edit-001, code-davinci-edit-001
r50k_base (or gpt2) GPT-3 models like davinci

See the examples in the repo for use cases. For more context on the different tokenizers, see the OpenAI Cookbook

Encountered any bugs?

If you encounter any bugs or have any suggestions for improvements, please open an issue on the repository.

Acknowledgements

Thanks @spolu for the original code, and .tiktoken files.

License

This project is licensed under the MIT License.

tiktoken-rs's People

Contributors

davirain-su avatar dependabot[bot] avatar dreaming-codes avatar fiji-flo avatar imalsogreg avatar j178 avatar jackbackes avatar jbgriesner avatar josephtlyons avatar kelvich avatar kojix2 avatar mikaelsouza avatar tristan-jl avatar ursachec avatar zurawiki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tiktoken-rs's Issues

Token counts are incorrect when request includes function messages

When using the relatively recent "Functions" feature of the ChatGPT API, it seems like tiktoken-rs underestimates the total number of tokens in the request. Here's a minimal example request:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a friendly chatbot.\n"
    },
    {
      "role": "assistant",
      "content": "Hello, I am a friendly chatbot!\n"
    },
    {
      "role": "user",
      "content": "What is the weather in New York?"
    },
    {
      "content": "",
      "function_call": {
        "arguments": "{\n  \"city\": \"New York\"\n}",
        "name": "get_weather"
      },
      "role": "assistant"
    },
    {
      "role": "function",
      "name": "get_weather",
      "content": "{\"temperature\": 72, \"conditions\": \"partly_cloudy\"}"
    }
  ],
  "model": "gpt-4-0613",
  "temperature": 0,
  "stream": false
}

I get this response from OpenAI:

{
    // ...
    "usage": {
        "prompt_tokens": 78,
        "completion_tokens": 19,
        "total_tokens": 97
    }
}

...indicating the request consumed 78 tokens for the prompt. However, tiktoken_rs::num_tokens_from_messages returns a value of 66 tokens.

Support for other LLMs?

Is this tokenizer library limited to certain models? The reason I ask is that I've been trying to add a new local LLM (codellama:7b-code) to the Zed editor, and while I can get it to go through the motions of working...

SCR-20240224-nflr

...under the surface there are signs that it's not as simple as I'd hoped...

SCR-20240224-nfzl

It's been suggested to me that the problem is because the Zed project uses your tiktoken library and that the library would need updating. I'm operating at the very limits of my abilities here, so any thoughts on this would be really appreciated πŸ˜‰

Any chance of isolating some dependencies with features?

In particular async-openai is fairly heavy in terms of dependencies for what it's used for here and it blocked compiling my attempted binding for ruby tiktoken_ruby on some platforms because of some sort of difficulty with openssl.

Also side note, thanks for putting together this little library, it made a ruby wrapper much easier gave hints about how it all needs to fit together

async-openai 0.18 support

OpenAI's spec introduced some new capabilities that led to some larger changes in the async-openai crate. I'm more than happy to contribute this work to tiktoken-rs but I wanted to open the dialogue first about one of the larger structural changes:

The biggest thing (imo) is the structure of ChatCompletionRequestMessage was changed to support a different data type per role.

This makes getting message content less trivial, but seemed like a necessity since OpenAI now supports user messages including images when invoking certain models, so this will require a data structure change. I'm not very familiar with this space outside of general application, so looking for input in terms of how these new user messages should be handled in terms of token counting.

BPE memory leak

I dont know if im using it wrong but when creating a new BPE it creates around 20MB of memory and never releases it, on top of that the async_openai::get_max_tokens_chat_message function creates a new bpe in it so big memory usage that never releases after every call

cl100k usage

Greetings.

How can we use cl100k with this library? I can see that the file for it is in the source directory, but there is no function to load it. Would it be the same as loading the others? If so I can do it myself, I'm just new to this topic and I don't know if there's something more complex that would have to be done.

Thanks.

Update to async-openai 0.16.2

Following the recent API update of ChatGPT, Async OpenAI has undergone numerous modifications in its codebase, necessitating significant adjustments to align this library with the updated version.

How could you add parallelism to make the encoding faster?

On line 140-141 of lib.rs, there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

Currently the python version gets me the token length in ~0.26 seconds while this crate takes ~1.8 seconds so I propose we should add back threading to speed up the process.

Now I am still a bit new to Rust so this post is more to bring suggestions on how would we go integrating threading?

Expose `_decode_native`?

Hey, thanks for the library.

My use case is that I want to split text into tokens, and then split it into chunks based on that. Splitting naively may result in invalid utf-8 (if the token is not at a unicode boundary).

There are different ways to deal with this, in perfect world, I would know which tokens end a unicode character, and only consider those, but that's probably not feasible without major changes.

Alternatively, if I have access to the output of _decode_native, I can for example trim the last token(s) or use from_utf8_lossy.

What are your thoughts?

use tiktoken encode throw error.

use version 0.1

fn main() {
    let bpe = tiktoken_rs::tiktoken::cl100k_base().unwrap();
    println!("{:?}", bpe.encode_with_special_tokens("hello world"))
}

it ouputs

[15339, 1917]

use version 0.2

fn main() {
    let bpe = tiktoken_rs::cl100k_base().unwrap();
    println!("{:?}", bpe.encode_with_special_tokens("hello world"))
}

it throws error:

dyld[65262]: Library not loaded: @rpath/libpython3.8.dylib
  Referenced from: /Projects/rust/test/target/debug/test
  Reason: tried: '/Projects/rust/test/target/debug/deps/libpython3.8.dylib' (no such file), '/Projects/rust/test/target/debug/libpython3.8.dylib' (no such file), '/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/x86_64-apple-darwin/lib/libpython3.8.dylib' (no such file), '/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/libpython3.8.dylib' (no such file), '/lib/libpython3.8.dylib' (no such file), '/usr/local/lib/libpython3.8.dylib' (no such file), '/usr/lib/libpython3.8.dylib' (no such file)
fish: Job 1, 'cargo run' terminated by signal SIGABRT (Abort)

Use case: splitting text into tokens.

As far as I am able to tell, tiktoken-rs is able to encode into a vector of token references, and then you can decode those token references back into your original text. However my use case is to split my text into decoded tokens. I don't see a way to do this via the api. Having a method like "split_by_tokens" would be super helpful.

Make the crate independent of async_openai

The crate seem to use async_openai only for messages, but this a pretty small thing that should not require to bring in a whole crate to its dependencies. I just want to use this crate with other implementation of openai api wrapper, not async_openai, but it's the only things that comes in the way

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.