Giter Club home page Giter Club logo

jtokkit's Introduction

๐Ÿš€ JTokkit - Java Tokenizer Kit

License: MIT GitHub Workflow Status Maven Central javadoc

Welcome to JTokkit, a Java tokenizer library designed for use with OpenAI models.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
assertEquals("hello world", enc.decode(enc.encode("hello world")));

// Or get the tokenizer corresponding to a specific OpenAI model
enc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);

๐Ÿ“– Introduction

JTokkit aims to be a fast and efficient tokenizer designed for use in natural language processing tasks using the OpenAI models. It provides an easy-to-use interface for tokenizing input text, for example for counting required tokens in preparation of requests to the GPT-3.5 model. This library resulted out of the need to have similar capacities in the JVM ecosystem as the library tiktoken provides for Python.

๐Ÿค– Features

โœ… Implements encoding and decoding via r50k_base, p50k_base, p50k_edit and cl100k_base

โœ… Easy-to-use API

โœ… Easy extensibility for custom encoding algorithms

โœ… Zero Dependencies

โœ… Supports Java 8 and above

โœ… Fast and efficient performance

๐Ÿ”จ Handling of special tokens during encoding (not started)

๐Ÿ“Š Performance

JTokkit is between 2-3 times faster than a comparable tokenizer.

benchmark

For details on the benchmark, see the benchmark directory.

๐Ÿ› ๏ธ Installation

You can install JTokkit by adding the following dependency to your Maven project:

<dependency>
    <groupId>com.knuddels</groupId>
    <artifactId>jtokkit</artifactId>
    <version>0.3.0</version>
</dependency>

Or alternatively using Gradle:

dependencies {
    implementation 'com.knuddels:jtokkit:0.3.0'
}

๐Ÿ”ฐ Getting Started

To use JTokkit, simply create a new EncodingRegistry and use getEncoding to retrieve the encoding you want to use. You can then use the encode and decode methods to encode and decode text.

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
List<Integer> encoded = enc.encode("This is a sample sentence.");
// encoded = [2028, 374, 264, 6205, 11914, 13]
        
String decoded = enc.decode(encoded);
// decoded = "This is a sample sentence."

// Or get the tokenizer based on the model type
Encoding secondEnc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);
// enc == secondEnc

The EncodingRegistry and Encoding classes are thread-safe and can be freely shared among components.

โžฐ Extending JTokkit

You may want to extend JTokkit to support custom encodings. To do so, you have two options:

  1. Implement the Encoding interface and register it with the EncodingRegistry
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding customEncoding = new CustomEncoding();
registry.registerEncoding(customEncoding);
  1. Add new parameters for use with the existing BPE algorithm
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
        "custom-name",
        Pattern.compile("some custom pattern"),
        encodingMap,
        specialTokenEncodingMap
);
registry.registerGptBytePairEncoding(params);

Afterwards you can use the custom encodings alongside the default ones and access them by using registry.getEncoding("custom-name"). See the JavaDoc for more details.

๐Ÿ“„ License

JTokkit is licensed under the MIT License. See the LICENSE file for more information.

jtokkit's People

Contributors

tox-p avatar renovate[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.