microsoft / aici Goto Github PK

AICI: Prompts as (Wasm) Programs

License: MIT License

Dockerfile 0.17% Shell 2.23% Rust 64.33% JavaScript 1.75% C 3.78% Python 19.83% Jupyter Notebook 4.06% Yacc 0.83% C++ 0.72% TypeScript 2.29%

ai rust wasm wasmtime inference language-model llm llm-framework llm-inference llm-serving

aici's Issues

gc unused module instances

both in ModuleRegistry and Stepper - if unused for too long, just delete them

join post+pre into one RPC call to aicirt

This will help with latency

rx matching empty string

We currently fail here, for simple rx like '\d*'

impl FunctionalRecognizer<RecRxState> for RecRx {
    fn initial(&self) -> RecRxState {
        self.dfa
            .universal_start_state(regex_automata::Anchored::Yes)
            .expect("dfa has no universal start state; make sure it doesn't match empty string")
    }

don't import numpy in aici cmd line

right now it's imported by comms.py which is imported from init.py

make sure there's no race writing files in create_module()

If two reqs try to add the same module in parallel, make sure we get the right result from fs::write()

`stop_at` constraint on `Gen`

stop_at is a string, not a token, that can occur anywhere inside the token; do we keep the final token?

have bump.sh update version field in cargo.toml

And also setup.py

aicirt native Windows support

shared memory/futex implementation
figure out what to do with fork() in aicirt

Docs:

WIP for shm only: win.patch

very low temperature causes crash in sampling

logits tensor is float16, we use -100 to ban a token. Temperature setting below around 0.0003 causes overflow and the following crash:

  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample
    parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 356, in _sample_from_generation_tokens
    next_token_ids = torch.multinomial(probs,
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

support for Python 3.8 in aici command line

prompt sharing for faster page attn

Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the kernel could be modified to only read these prompt entries once.

better folder structure

Proposed structure:

top
    README.md
    aici.sh

top (compliance)
    CODE_OF_CONDUCT.md
    LICENSE
    NOTICE.md
    SECURITY.md
    SUPPORT.md
    TRANSPARENCY.md

top (rust)
    Cargo.lock
    Cargo.toml
    rustfmt.toml
    target/

top (python)
    pytest.ini

docs/ ?
    REST.md
    proxy.md (move to README.md)

aici/
    aici_abi/
    aicirt/

controllers/
    declctrl/
    jsctrl/
    pyctrl/
    uppercase/

rllm/
    rllm-cpp/ -> rllm-llama-cpp/
    rllm-cuda/
    rllm-lib/
    llama-cpp-low/
    tch-cuda/

scripts/
    harness/ -> py/

py/
    promptlib/
    pyaici/
    tests/
    vllm/
    setup.py

GPU acceleration on mac M1

add command line flag to specify how many layers to offload to GPU
verify that it works on M1

rewrite promptlib to use pyctrl not declctrl

Allow tagging wasm modules

In particular I want pyvm-latest and pyvm-2023-12-23-08-12 kind of tags so ppl don't have to upload their own.

rich AICI interface

Draft here:

pub type TokenId = u32;

pub enum AiciInstruction {
    /// Stop the current sequence.
    /// Similar to strong bias to EOS.
    Stop,

    /// Sample next token in the current sequence, using given bias.
    /// `bias.len()` is size of vocabulary.
    LogitBias { bias: Vec<f32> },

    /// First pop `backtrack` tokens,
    /// then force next tokens to be generated to be `ff_tokens`.
    /// `backtrack` can be 0, and `ff_tokens` can be empty but not both.
    Splice {
        backtrack: u32,
        ff_tokens: Vec<TokenId>,
    },

    /// Fork the current sequence into `num_children` sequences (including current one).
    /// `resume_fork(0)` will be called on this VM, while children will be resumed
    /// with `resume_fork(1)` ... `resume_fork(num_children - 1)`
    /// (thus, `Fork {1}` will not create any new sequences).
    Fork { num_children: u32 },

    /// Wait until all listed variables are available for reading,
    /// and all listed sequences have finished executing.
    WaitAll {
        variables: Vec<(SeqId, String)>,
        finished: Vec<SeqId>,
    },
}

pub struct SeqId {
    // ...
}

pub trait AiciHost {
    /// Log a string.
    fn println(&self, msg: &str);

    /// Initialize a new instances of TokTrie, which can iterate over all tokens,
    /// turn token-ids to byte buffers, give vocab size, id of EOS token etc.
    fn read_token_trie(&self) -> TokTrie;

    /// Read argument passed by the user (typically JSON).
    fn read_arg(&self) -> Vec<u8>;

    /// Tokenize given UTF8 string.
    fn tokenize(&self, s: &str) -> Vec<TokenId>;

    /// Return Id of the current sequence.
    fn sequence_id(&self) -> SeqId;

    /// Return list of all sequences that are currently running.
    fn running_sequences(&self) -> Vec<SeqId>;

    /// Read variable from a given sequence.
    fn read_variable(&self, seq: SeqId, name: &str) -> Option<Vec<u8>>;

    /// Write variable in the current sequence.
    fn write_variable(&self, name: &str, val: &[u8]);
}

pub trait AiciHighLevelVm {
    /// Create a new instance of constraint.
    fn new(host: impl AiciHost) -> Self;

    /// Executed once, at the beginning.
    fn process_prompt(&mut self, prompt: Vec<TokenId>) -> AiciInstruction;

    /// Executed for every sampled token.
    fn append_token(&mut self, token: TokenId) -> AiciInstruction;

    /// Executed in response to AiciInstruction::Fork.
    fn resume_fork(&mut self, seqs_in_group: Vec<SeqId>, self_id: SeqId) -> AiciInstruction;
}

scheduler

Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target
latency for a request?

backtracking and fast-forward in vllm

fast-forward (FF) - generating multiple zero-entropy tokens, extending existing generation (or prompt)

SequenceGroup (SG) - a bunch of sequences being generated from a single user req, sharing the prompt; they can be only swapped in and out as a whole

we may want to create a new SG when WASM module requests a fork
each model forward pass is either for all prompt tokens or all generation tokens; this seems deeply engrained into CUDA kernels
it's unclear how to FF - it may be easier to create a new SG and recompute everything?
backtracking should be easy (just cut some variable and free blocks) but FF typically follows so unclear if this is useful
KV cache sits in blocks; each block is 16 tokens long (there is maybe 1 megabyte (+- order of magnitude) of data per token though)

integrate with llama.cpp

It seems it would be relatively easy. The idea would be to implement a Rust server that calls into llama.cpp library.

The following Rust crates implement bindings:

and probably more. Either one of these needs to be updated (likely) to expose the needed llama.cpp APIs:

llama_kv_cache_seq_rm
llama_kv_cache_seq_cp
batch APIs
sampling APIs

With these it looks like we can do forking, backtracking, fast-forwarding, and also constraints.

update comms.py for new interface

"pre_process" and "post_process" have been combined into "post_pre_process"

allow aici tool to download controllers from github

impose memory size limits on wasm module instances

check on pipelining

Check if vllm does any pipelining - if so airicrt will have trouble as it assumes one set of requests at a time.

async side channel comms

The side MessageChannel is used to compile modules. It should be also used to initialize recognizer outside of any steps.

It also needs to be made async.

document interface between aicirt and LLM engine

add samples to jsctrl readme

also explain how to use 'aici jsinit'

use processes not threads for workers

Advantages:

OS-level Spectre mitigations
we can limit the time externally (via kill(2)); don't have to use epochs in WASM - 28% faster
we can also do tighter limits on memory and time (also during compilation of module)
we can use real fork(2) for forking search

The worker would first compile the WASM module (if needed) and then run it. The worker is for a single WASM module only.

Things to sort out:

communication mechanism (possibly existing MessageChannel - name it after process PID, delete first)
is it somehow possible to drop some privileges before running WASM?

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : This repository will ship as Open Source or go public

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams. 🗨️

use pyaici.cli.build_rust and pyaici.rest.upload_wasm

This

aici/promptlib/promptlib/aici.py

Line 27 in 5fd57e4

def _compile_wasm(wasm_runner_buildsh, scriptargs=["build"]):

should instead use the functions from the title

Create separate devcontainer files for rllm and pyvm

Pyvm wouldn't require cuda

Rllm would be quicker to build than vllm

Vllm would allow building all

impose timeout on wasm module instance execution

provide feedback on the probability mass dropped by logit bias

pre = softmax(logits)
logits += bias
post = softmax(logits)
dropped = sum(max(0, pre[i] - post[i]) for i in range(len(post)))

if dropped is close to 1 we're going against the model

return non 500 on "instantiate" timeout

The timeout is 1000ms, and we seem to be hitting it sometimes on macOS M3

cc @dluc

add more lr grammars

latex
markdown
typst
rust
ipynb
python
js,ts,html,css,json,xml,toml,etc formats

allow passing binary data to controllers

Possibly via POST /v1/run?controller=xyz ?

split out common rllm from rllm-cuda

We should use a trait for Tensor and Model types. Right now it uses cfg()

Warning(s) when starting rLLM-cpp

Steps

cd rllm-cpp
./cpp-server.sh phi2

Result

The server builds and starts
The log contains one warning: WARN [llama_cpp_low] llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 )
This log entry looks like a warning but is logged as INFO 7 times: INFO [hf_hub] Token file not found "/Users/tester/.cache/huggingface/token"

System

system: macOS 14.3, Apple M3
cargo 1.75.0
cmake version 3.28.2
ccache version 4.9

allow JSON-encoded "files" to pyctrl/jsctrl

Right now we take one "file". We should allow multiple to allow for sub-modules and arguments

file c.y missing

The README links to https://github.com/microsoft/aici/tree/main/aici_ast_runner/c.y but the file does not exist in the repository

re-compile modules in cache when needed

For example, when wasmtime version is updated.

https://docs.rs/wasmtime/latest/wasmtime/struct.Engine.html#method.detect_precompiled

introduce new REST API for controller running

Right now we extend OpenAI completion.

remove "prompt"
let the program influence sampling (top_k, top_p, argmax etc)

`max_words`, `max_bytes` constraints on `Gen`

Similar to max_tokens but in different units.

warning when unusual tokenization detected in output

turn LLM output to string, encode and compare token ids

have aici_init() return max number of forks

vllm uses max number of possible forks in a sequenace group for scheduling

also that max should be limited

implement separate stream for memory copy

While at it, also measure mem transfer speed and see how many KV entries can be transferred in a single inference round

add "prompted virtualized repl agent" as pseudo model

dockerized application reproducibly built by output of the generative script of repl agent name with passed parameters and ENV variables

ability to use one demand spawn off instances of such applications repl environments (for example, docker's interactive bash) as conversational models

`one_of` constraint on `Gen`

one_of = ["foo", "bar", "baz"]

Similar to

rx = r"foo|bar|baz"

but may be tokenized more efficiently

llama CPU phi2 model wrong generation

   Compiling rquickjs v0.4.0 (https://github.com/mmoskal/rquickjs?rev=5b0e3b24d5021d3cd4981d3693fd7bd1a106314c#5b0e3b24)

    Finished release [optimized + debuginfo] target(s) in 35.89s

built: /Users/markus/src/aici/target/wasm32-wasi/release/aici_jsctrl.wasm, 3.14 MiB

upload module... 3213kB -> 10087kB id:3508abaf

[0]: FIXED "Ultimate answer is to the life, universe and everything is "

[0]: GEN-OPT {regex: /\d\d/}

[0]: regex constraint: "\\d\\d"

[0]: dfa: 160 bytes

[0]: GEN "𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘"

[0]: JsCtrl: done

[DONE]

[Response] Ultimate answer is to the life, universe and everything is 𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘
 
response saved to tmp/response.json

Usage: {'sampled_tokens': 21, 'ff_tokens': 32, 'cost': 74}

Storage: {}

make sure we don't 500 when invalid json is passed as module_arg

SeqId should start at 0 for every group

Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is used, which is something we may want to hide.