Giter Club home page Giter Club logo

aici's Introduction

Artificial Intelligence Controller Interface (AICI)

The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct output of a Large Language Model (LLM) in real time. Controllers are flexible programs capable of implementing constrained decoding, dynamic editing of prompts and generated text, and coordinating execution across multiple, parallel generations. Controllers incorporate custom logic during the token-by-token decoding and maintain state during an LLM request. This allows diverse Controller strategies, from programmatic or query-based decoding to multi-agent conversations to execute efficiently in tight integration with the LLM itself.

The purpose of AICI is to make it easy to build and experiment with both existing and entirely new Controller strategies for improving LLM generations. By abstracting away implementation details of the underlying LLM inference and serving engine, AICI aims to simplify the development of Controllers, make it easier to write fast Controllers, and ease compatibility across LLM inference and serving engines.

AICI is designed for both local and cloud execution, including (eventually) multi-tenant LLM deployments. Controllers are implemented as light-weight WebAssembly (Wasm) modules which run on the same machine as the LLM inference engine, utilizing the CPU while the GPU is busy with token generation. AICI is one layer in the inference stack, and is designed to allow control libraries such as Guidance, LMQL, and others to run on top of it and gain both efficiency and performance improvements, as well as portability across LLM inference and serving engines.

AICI currently integrates with llama.cpp, HuggingFace Transformers, and rLLM (custom tch-based LLM inference engine), with vLLM in the works.

AICI is:

  • Flexible: Controllers can be written in any language that can compile to Wasm (Rust, C, C++, ...), or be interpreted inside Wasm (Python, JavaScript, ...)
  • Secure: Controllers are sandboxed and cannot access the filesystem, network, or any other resources
  • Fast: Wasm modules are compiled to native code and run in parallel with the LLM inference engine, inducing only a minimal overhead to the generation process

AICI is a prototype, designed and built at Microsoft Research.

Table of Contents

QuickStart: Example Walkthrough

In this quickstart, we'll guide you through the following steps:

  • Set up rLLM Server and AICI Runtime.
  • Build and deploy a Controller.
  • Use AICI to control LLM output, so you can customize a LLM to follow specific rules when generating text.

Development Environment Setup

To compile AICI components, you need to set up your development environment for Rust. For this quickstart you also need Python 3.11 or later to create a controller.

Windows WSL / Linux / macOS

Note

Windows users: please use WSL2 or the included devcontainer. Adding native Windows support is tracked here.

MacOS users: please make sure you have XCode command line tools installed by running xcode-select -p and, if not installed, run xcode-select --install.

CUDA: the CUDA build relies on specific libtorch installation. It's highly recommended you use the included devcontainer.

If you're using devcontainer, you can skip to the next section.

Using the system package manager, install the necessary tools for building code in the repository, including git, cmake and ccache.

For instance in WSL / Ubuntu using apt:

sudo apt-get install --assume-yes --no-install-recommends \
    build-essential cmake ccache pkg-config libssl-dev libclang-dev clang llvm-dev git-lfs

or using Homebrew on macOS:

brew install git cmake ccache

Then install Rust, Rustup and Cargo, following the instructions provided here and here:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

After installation, verify that the rustup --version command is accessible by running it from the terminal. If the command isn't recognized, try opening a new terminal session.

Next install wasm32-wasi Rust component:

rustup target add wasm32-wasi

If you already had Rust installed, or are getting complaints from Cargo about outdated versions, run:

rustup update

Last, to work with Python controllers and scripts (like this tutorial), run this command to install the required packages:

pip install pytest pytest-forked ujson posix_ipc numpy requests

Build and start rLLM server and AICI Runtime

The rLLM server has two backends, one based on libtorch and CUDA (rllm-cuda), and the other based on llama.cpp (rllm-llamacpp).

The rllm-cuda backend only works with NVidia GPUs with compute capability 8.0 or later (A100 and later; RTX 30x0 and later) and requires a fiddly setup of libtorch -- it's strongly recommended to use the included devcontainer. While this guide focuses on the rllm-llamacpp backend, the build steps are the same for rllm-cuda, modulo the folder name.

After dev env setup above, clone the AICI repository and proceed with the next steps outlined below.

Use the following command to build and run aicirt and rllm-llamacpp:

cd rllm/rllm-llamacpp
./server.sh phi2

You can pass other model names as argument (run ./server.sh without arguments to see available models). You can also use a HuggingFace URL to .gguf file or a local path to a .gguf file. (For rllm-cuda use HuggingFace model id or path to folder).

./server.sh orca

You can find more details about rllm-llamacpp here.

The rLLM server provides a HTTP interface, utilized for configuration tasks and processing requests. You can also use this interface to promptly verify its status. For instance, if you open http://127.0.0.1:4242/v1/models, you should see:

{
  "object": "list",
  "data": [
    {
      "object": "model",
      "id": "TheBloke/phi-2-GGUF",
      "created": 946810800,
      "owned_by": "owner"
    }
  ]
}

confirming that the selected model is loaded.

Control AI output using AICI controllers

AICI allows hosting custom logic, called Controllers, that initiate, terminate, and interact with LLMs token generation. Controllers take input arguments, process them, and return a result with logs, LLM tokens, and variables.

The repository includes some examples, in particular:

  • jsctrl: a controller that accepts JavaScript code as input for execution. This code can interact with the model to generate text and tokens.
  • pyctrl: a controller that accepts Python code as input for execution. This code can also interact with the model to generate text and tokens.

In this example we'll utilize pyctrl to manage token generation using a simple Python script. If you want, you can build and upload pyctrl, however by default the server will automatically download the latest release of pyctrl from GitHub.

In general, controllers require building and deployment, while scripts (Python or JavaScript) are sent with each request.

The following illustrates the relationship between the rLLM server, the AICI runtime, and the controller:

erDiagram
    Host    ||--|{ CPU : ""
    Host    ||--|{ GPU : ""
    
    CPU     ||--|| "rLLM Server" : execute
    CPU     ||--|{ "AICI Runtime" : execute

    "AICI Runtime" ||--|| "Controller" : instantiate

    GPU     ||--|{ "LLM token generation" : execute

Controlling the LLM token generation

Suppose we aim for a model to generate a list, adhering to a specific format and containing only five items.

Typically, achieving this involves prompt engineering, crafting the prompt precisely with clear instructions, such as:

What are the five most popular types of vehicles?
Return the result as a numbered list.
Do not add explanations, only the list.

The prompt would also vary depending on the model in use, given that each model tends to add explanations and understands instructions in different ways.

With AICI, we shift control back to code, and we can simplify the prompt to:

What are the most popular types of vehicles?

using code to:

  1. Limit the list to 5 items
  2. Prevent the model from adding some initial explanation
  3. Format to a numbered list
  4. Stop the model from adding some text after the list.

Let's create a list-of-five.py python file with the following content:

import pyaici.server as aici

# Force the model to generate a well formatted list of 5 items, e.g.
#   1. name 1
#   2. name 2
#   3. name 3
#   4. name 4
#   5. name 5
async def main():
    
    # This is the prompt we want to run.
    # Note how the prompt doesn't mention a number of vehicles or how to format the result.
    prompt = "What are the most popular types of vehicles?\n"

    # Tell the model to generate the prompt string, ie. let's start with the prompt "to complete"
    await aici.FixedTokens(prompt)

    # Store the current position in the token generation process
    marker = aici.Label()

    for i in range(1,6):
      # Tell the model to generate the list number
      await aici.FixedTokens(f"{i}.")

      # Wait for the model to generate a vehicle name and end with a new line
      await aici.gen_text(stop_at = "\n")

    await aici.FixedTokens("\n")

    # Store the tokens generated in a result variable
    aici.set_var("result", marker.text_since())

aici.start(main())

Running the script is not too different from sending a prompt. In this case, we're sending control logic and instructions all together.

To see the final result, execute the following command:

./aici.sh run list-of-five.py

Result:

Running with tagged AICI Controller: gh:microsoft/aici/pyctrl
[0]: FIXED 'What are the most popular types of vehicles?\n'
[0]: FIXED '1.'
[0]: GEN ' Cars\n'
[0]: FIXED '2.'
[0]: GEN ' Motorcycles\n'
[0]: FIXED '3.'
[0]: GEN ' Bicycles\n'
[0]: FIXED '4.'
[0]: GEN ' Trucks\n'
[0]: FIXED '5.'
[0]: GEN ' Boats\n'
[0]: FIXED '\n'
[DONE]
[Response] What are the most popular types of vehicles?
1. Cars
2. Motorcycles
3. Bicycles
4. Trucks
5. Boats

response saved to tmp/response.json
Usage: {'sampled_tokens': 16, 'ff_tokens': 37, 'cost': 69}
Timing: {'http_response': 0.05193686485290527, 'data0': 0.05199289321899414, 'first_token': 0.0658726692199707, 'last_token': 0.1784682273864746}
Tokens/sec: {'prompt': 861.0913072488067, 'sampling': 89.65181217019571}
Storage: {'result': '1. Cars\n2. Motorcycles\n3. Bicycles\n4. Trucks\n5. Boats\n\n'}

Comprehensive Guide: Exploring Further

This repository contains a number of components, and which ones you need depends on your use case.

You can use an existing controller module. We provide PyCtrl and JsCtrl that let you script controllers using server-side Python and JavaScript, respectively. The pyaici package contains aici command line tool that lets you upload and run scripts with any controller (we also provide REST API definition for the curious).

πŸ§‘β€πŸ’»Python code samples for scripting PyCtrl and a JavaScript Hello World for JSCtrl

We anticipate libraries will be built on top of controllers. We provide an example in promptlib - a client-side Python library that generates interacts with DeclCtrl via the pyaici package.

πŸ§‘β€πŸ’» Example notebook that uses PromptLib to interact with DeclCtrl.

The controllers can be run in a cloud or local AICI-enabled LLM inference engine. You can run the provided reference engine (rLLM) locally with either libtorch+CUDA or llama.cpp backend.

To develop a new controller, use a Rust starter project that shows usage of aici_abi library, which simplifies implementing the low-level AICI interface.

πŸ§‘β€πŸ’»Sample code for a minimal new controller to get you started

To add AICI support to a new LLM inference engine, you will need to implement LLM-side of the protocol that talks to AICI runtime.

Finally, you may want to modify any of the provided components - PRs are most welcome!

Architecture

AICI abstracts LLM inference engine from the controller and vice-versa, as in the picture below. The rounded nodes are aspirational. Additional layers can be built on top - we provide promptlib, but we strongly believe that Guidance, LMQL, SGLang, Outlines, jsonformer, LMFE, etc. can also run on top of AICI (either with custom controllers or utilizing PyCtrl or JsCtrl).

graph TD
    PyCtrl -- AICI --> aicirt[AICI-runtime]
    JsCtrl -- AICI --> aicirt
    guidance([GuidanceCtrl]) -- AICI --> aicirt
    lmql([LMQL Ctrl]) -- AICI --> aicirt
    aicirt -- POSIX SHM --> rLLM
    aicirt -- POSIX SHM --> llama[llama.cpp]
    aicirt -- POSIX SHM --> pyaici
    pyaici -- Python --> vLLM(vLLM)
    pyaici -- Python --> hf[HF Transformers]

The pyaici package makes it easier to integrate AICI with Python-based LLM inference engines. Take a look at integration with HuggingFace Transformers, though note that it doesn't support forking (generation of multiple sequences in parallel). The vLLM REST server is currently out of date. Please use the rLLM-cuda or rLLM-llama.cpp for now.

Security

  • aicirt runs in a separate process, and can run under a different user than the LLM engine
  • Wasm modules are sandboxed by Wasmtime
  • Wasm only have access to aici_host_* functions, implemented in hostimpl.rs
  • aicirt also exposes a partial WASI interface; however almost all the functions are no-op, except for fd_write which shims file descriptors 1 and 2 (stdout and stderr) to print debug messages
  • each Wasm module runs in a separate process, helping with Spectre/Meltdown mitigation and allowing limits on CPU usage

In particular, Wasm modules cannot access the filesystem, network, or any other resources. They also cannot spin threads or access any timers (this is relevant for Spectre/Meltdown attacks).

Performance

Most of computation in AICI Controllers occurs on the CPU, in parallel with the logit generation on the GPU. The generation occurs in steps, where logits are generated in parallel for a new token for each sequence in a batch (typically between 1 and 50). This involves reading the whole model and KV caches for sequences in the batch from the GPU memory. For optimal batch throughput, the model and KV caches should utilize a major fraction of the GPU memory, and reading the whole memory takes about 40ms on A100 GPU (80GB).

Thus, each step of generation takes on the order of 20-50ms. With careful engineering, this is more than enough to compute the set of allowed tokens in Rust compiled to Wasm. These can be combined either natively in Rust, or via Python or JavaScript interpreters we provide.

For example, computing allowed token set in the 32000-strong vocabulary of Llama model takes:

  • about 2.0ms for Yacc grammar of the C programming language
  • about 0.3ms for a regular expression
  • about 0.2ms for a substring constraint, from 4kB string

The above numbers are for a single sequence, however each sequence is processed in separate process, and thus if there is more cores than sequences (which is typical), they do not change. They also include overhead of calling into Python interpreter implemented in Wasm, and then back into Rust-generated Wasm code for the constraint itself. They are all well within the 20-50ms budget, so do not affect the generation time at all.

There is also some overhead in the critical path of sampling. It comes down to about 0.3ms per generation step when executing 10 sequences in parallel (this is irrespective of the constraint used). The overhead goes up to around 0.7ms for 40 sequences (though it has not been fully optimized yet).

WebAssembly is designed to have minimal overhead, compared to native code. In our experience, highly optimized Rust code is less than 2x slower when run in Wasmtime than native. This is 10-100x better than JavaScript or Python.

All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM.

Flexibility

The low-level interface that AICI runtime provides allows for:

  • interaction with the LLM inference engine before, during, and after every generated token
  • constraining decoding to a set of tokens
  • backtracking KV-cache to a previous state
  • fast-forwarding several tokens at a time (if they are known)
  • forking generation into multiple branches
  • communication between forks via shared variables
  • utility functions for converting between tokens and byte strings

It can be utilized from any language that compiles to Wasm.

This repository provides a Rust library that makes it easy to implement controllers in Rust, and provides efficient implementations of specific constraints (regular expressions, yacc grammars, substrings). We also provide Python and JavaScript interpreters that allow to glue these constraints together. All of these can be easily extended.

Acknowledgements

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

aici's People

Contributors

blmarket avatar dependabot[bot] avatar dluc avatar eltociear avatar emrekiciman avatar kevinmingtarja avatar microsoftopensource avatar mmoskal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aici's Issues

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.πŸ”’

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

βœ… Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

OR

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : This repository will ship as Open Source or go public
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

Need more help? πŸ–οΈ

check on pipelining

Check if vllm does any pipelining - if so airicrt will have trouble as it assumes one set of requests at a time.

llama CPU phi2 model wrong generation

   Compiling rquickjs v0.4.0 (https://github.com/mmoskal/rquickjs?rev=5b0e3b24d5021d3cd4981d3693fd7bd1a106314c#5b0e3b24)

    Finished release [optimized + debuginfo] target(s) in 35.89s

built: /Users/markus/src/aici/target/wasm32-wasi/release/aici_jsctrl.wasm, 3.14 MiB

upload module... 3213kB -> 10087kB id:3508abaf

[0]: FIXED "Ultimate answer is to the life, universe and everything is "

[0]: GEN-OPT {regex: /\d\d/}

[0]: regex constraint: "\\d\\d"

[0]: dfa: 160 bytes

[0]: GEN "𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘"

[0]: JsCtrl: done

[DONE]

[Response] Ultimate answer is to the life, universe and everything is 𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘
 
response saved to tmp/response.json

Usage: {'sampled_tokens': 21, 'ff_tokens': 32, 'cost': 74}

Storage: {}

Allow tagging wasm modules

In particular I want pyvm-latest and pyvm-2023-12-23-08-12 kind of tags so ppl don't have to upload their own.

add "prompted virtualized repl agent" as pseudo model

dockerized application reproducibly built by output of the generative script of repl agent name with passed parameters and ENV variables

ability to use one demand spawn off instances of such applications repl environments (for example, docker's interactive bash) as conversational models

very low temperature causes crash in sampling

logits tensor is float16, we use -100 to ban a token. Temperature setting below around 0.0003 causes overflow and the following crash:

  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample
    parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 356, in _sample_from_generation_tokens
    next_token_ids = torch.multinomial(probs,
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

rx matching empty string

We currently fail here, for simple rx like '\d*'

impl FunctionalRecognizer<RecRxState> for RecRx {
    fn initial(&self) -> RecRxState {
        self.dfa
            .universal_start_state(regex_automata::Anchored::Yes)
            .expect("dfa has no universal start state; make sure it doesn't match empty string")
    }

use processes not threads for workers

Advantages:

  • OS-level Spectre mitigations
  • we can limit the time externally (via kill(2)); don't have to use epochs in WASM - 28% faster
  • we can also do tighter limits on memory and time (also during compilation of module)
  • we can use real fork(2) for forking search

The worker would first compile the WASM module (if needed) and then run it. The worker is for a single WASM module only.

Things to sort out:

  • communication mechanism (possibly existing MessageChannel - name it after process PID, delete first)
  • is it somehow possible to drop some privileges before running WASM?

better folder structure

Proposed structure:

top
    README.md
    aici.sh

top (compliance)
    CODE_OF_CONDUCT.md
    LICENSE
    NOTICE.md
    SECURITY.md
    SUPPORT.md
    TRANSPARENCY.md

top (rust)
    Cargo.lock
    Cargo.toml
    rustfmt.toml
    target/

top (python)
    pytest.ini

docs/ ?
    REST.md
    proxy.md (move to README.md)

aici/
    aici_abi/
    aicirt/

controllers/
    declctrl/
    jsctrl/
    pyctrl/
    uppercase/

rllm/
    rllm-cpp/ -> rllm-llama-cpp/
    rllm-cuda/
    rllm-lib/
    llama-cpp-low/
    tch-cuda/

scripts/
    harness/ -> py/

py/
    promptlib/
    pyaici/
    tests/
    vllm/
    setup.py

scheduler

Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target
latency for a request?

backtracking and fast-forward in vllm

fast-forward (FF) - generating multiple zero-entropy tokens, extending existing generation (or prompt)

SequenceGroup (SG) - a bunch of sequences being generated from a single user req, sharing the prompt; they can be only swapped in and out as a whole

  • we may want to create a new SG when WASM module requests a fork
  • each model forward pass is either for all prompt tokens or all generation tokens; this seems deeply engrained into CUDA kernels
  • it's unclear how to FF - it may be easier to create a new SG and recompute everything?
  • backtracking should be easy (just cut some variable and free blocks) but FF typically follows so unclear if this is useful
  • KV cache sits in blocks; each block is 16 tokens long (there is maybe 1 megabyte (+- order of magnitude) of data per token though)

async side channel comms

The side MessageChannel is used to compile modules. It should be also used to initialize recognizer outside of any steps.

It also needs to be made async.

add more lr grammars

latex
markdown
typst
rust
ipynb
python
js,ts,html,css,json,xml,toml,etc formats

Warning(s) when starting rLLM-cpp

Steps

cd rllm-cpp
./cpp-server.sh phi2

Result

  • The server builds and starts
  • The log contains one warning: WARN [llama_cpp_low] llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 )
  • This log entry looks like a warning but is logged as INFO 7 times: INFO [hf_hub] Token file not found "/Users/tester/.cache/huggingface/token"

System

  • system: macOS 14.3, Apple M3
  • cargo 1.75.0
  • cmake version 3.28.2
  • ccache version 4.9

prompt sharing for faster page attn

Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the kernel could be modified to only read these prompt entries once.

integrate with llama.cpp

It seems it would be relatively easy. The idea would be to implement a Rust server that calls into llama.cpp library.

The following Rust crates implement bindings:

and probably more. Either one of these needs to be updated (likely) to expose the needed llama.cpp APIs:

  • llama_kv_cache_seq_rm
  • llama_kv_cache_seq_cp
  • batch APIs
  • sampling APIs

With these it looks like we can do forking, backtracking, fast-forwarding, and also constraints.

rich AICI interface

Draft here:

pub type TokenId = u32;

pub enum AiciInstruction {
    /// Stop the current sequence.
    /// Similar to strong bias to EOS.
    Stop,

    /// Sample next token in the current sequence, using given bias.
    /// `bias.len()` is size of vocabulary.
    LogitBias { bias: Vec<f32> },

    /// First pop `backtrack` tokens,
    /// then force next tokens to be generated to be `ff_tokens`.
    /// `backtrack` can be 0, and `ff_tokens` can be empty but not both.
    Splice {
        backtrack: u32,
        ff_tokens: Vec<TokenId>,
    },

    /// Fork the current sequence into `num_children` sequences (including current one).
    /// `resume_fork(0)` will be called on this VM, while children will be resumed
    /// with `resume_fork(1)` ... `resume_fork(num_children - 1)`
    /// (thus, `Fork {1}` will not create any new sequences).
    Fork { num_children: u32 },

    /// Wait until all listed variables are available for reading,
    /// and all listed sequences have finished executing.
    WaitAll {
        variables: Vec<(SeqId, String)>,
        finished: Vec<SeqId>,
    },
}

pub struct SeqId {
    // ...
}

pub trait AiciHost {
    /// Log a string.
    fn println(&self, msg: &str);

    /// Initialize a new instances of TokTrie, which can iterate over all tokens,
    /// turn token-ids to byte buffers, give vocab size, id of EOS token etc.
    fn read_token_trie(&self) -> TokTrie;

    /// Read argument passed by the user (typically JSON).
    fn read_arg(&self) -> Vec<u8>;

    /// Tokenize given UTF8 string.
    fn tokenize(&self, s: &str) -> Vec<TokenId>;

    /// Return Id of the current sequence.
    fn sequence_id(&self) -> SeqId;

    /// Return list of all sequences that are currently running.
    fn running_sequences(&self) -> Vec<SeqId>;

    /// Read variable from a given sequence.
    fn read_variable(&self, seq: SeqId, name: &str) -> Option<Vec<u8>>;

    /// Write variable in the current sequence.
    fn write_variable(&self, name: &str, val: &[u8]);
}

pub trait AiciHighLevelVm {
    /// Create a new instance of constraint.
    fn new(host: impl AiciHost) -> Self;

    /// Executed once, at the beginning.
    fn process_prompt(&mut self, prompt: Vec<TokenId>) -> AiciInstruction;

    /// Executed for every sampled token.
    fn append_token(&mut self, token: TokenId) -> AiciInstruction;

    /// Executed in response to AiciInstruction::Fork.
    fn resume_fork(&mut self, seqs_in_group: Vec<SeqId>, self_id: SeqId) -> AiciInstruction;
}

SeqId should start at 0 for every group

Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is used, which is something we may want to hide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.