Giter Club home page Giter Club logo

aici's Issues

rx matching empty string

We currently fail here, for simple rx like '\d*'

impl FunctionalRecognizer<RecRxState> for RecRx {
    fn initial(&self) -> RecRxState {
        self.dfa
            .universal_start_state(regex_automata::Anchored::Yes)
            .expect("dfa has no universal start state; make sure it doesn't match empty string")
    }

very low temperature causes crash in sampling

logits tensor is float16, we use -100 to ban a token. Temperature setting below around 0.0003 causes overflow and the following crash:

  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample
    parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
  File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 356, in _sample_from_generation_tokens
    next_token_ids = torch.multinomial(probs,
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

prompt sharing for faster page attn

Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the kernel could be modified to only read these prompt entries once.

better folder structure

Proposed structure:

top
    README.md
    aici.sh

top (compliance)
    CODE_OF_CONDUCT.md
    LICENSE
    NOTICE.md
    SECURITY.md
    SUPPORT.md
    TRANSPARENCY.md

top (rust)
    Cargo.lock
    Cargo.toml
    rustfmt.toml
    target/

top (python)
    pytest.ini

docs/ ?
    REST.md
    proxy.md (move to README.md)

aici/
    aici_abi/
    aicirt/

controllers/
    declctrl/
    jsctrl/
    pyctrl/
    uppercase/

rllm/
    rllm-cpp/ -> rllm-llama-cpp/
    rllm-cuda/
    rllm-lib/
    llama-cpp-low/
    tch-cuda/

scripts/
    harness/ -> py/

py/
    promptlib/
    pyaici/
    tests/
    vllm/
    setup.py

Allow tagging wasm modules

In particular I want pyvm-latest and pyvm-2023-12-23-08-12 kind of tags so ppl don't have to upload their own.

rich AICI interface

Draft here:

pub type TokenId = u32;

pub enum AiciInstruction {
    /// Stop the current sequence.
    /// Similar to strong bias to EOS.
    Stop,

    /// Sample next token in the current sequence, using given bias.
    /// `bias.len()` is size of vocabulary.
    LogitBias { bias: Vec<f32> },

    /// First pop `backtrack` tokens,
    /// then force next tokens to be generated to be `ff_tokens`.
    /// `backtrack` can be 0, and `ff_tokens` can be empty but not both.
    Splice {
        backtrack: u32,
        ff_tokens: Vec<TokenId>,
    },

    /// Fork the current sequence into `num_children` sequences (including current one).
    /// `resume_fork(0)` will be called on this VM, while children will be resumed
    /// with `resume_fork(1)` ... `resume_fork(num_children - 1)`
    /// (thus, `Fork {1}` will not create any new sequences).
    Fork { num_children: u32 },

    /// Wait until all listed variables are available for reading,
    /// and all listed sequences have finished executing.
    WaitAll {
        variables: Vec<(SeqId, String)>,
        finished: Vec<SeqId>,
    },
}

pub struct SeqId {
    // ...
}

pub trait AiciHost {
    /// Log a string.
    fn println(&self, msg: &str);

    /// Initialize a new instances of TokTrie, which can iterate over all tokens,
    /// turn token-ids to byte buffers, give vocab size, id of EOS token etc.
    fn read_token_trie(&self) -> TokTrie;

    /// Read argument passed by the user (typically JSON).
    fn read_arg(&self) -> Vec<u8>;

    /// Tokenize given UTF8 string.
    fn tokenize(&self, s: &str) -> Vec<TokenId>;

    /// Return Id of the current sequence.
    fn sequence_id(&self) -> SeqId;

    /// Return list of all sequences that are currently running.
    fn running_sequences(&self) -> Vec<SeqId>;

    /// Read variable from a given sequence.
    fn read_variable(&self, seq: SeqId, name: &str) -> Option<Vec<u8>>;

    /// Write variable in the current sequence.
    fn write_variable(&self, name: &str, val: &[u8]);
}

pub trait AiciHighLevelVm {
    /// Create a new instance of constraint.
    fn new(host: impl AiciHost) -> Self;

    /// Executed once, at the beginning.
    fn process_prompt(&mut self, prompt: Vec<TokenId>) -> AiciInstruction;

    /// Executed for every sampled token.
    fn append_token(&mut self, token: TokenId) -> AiciInstruction;

    /// Executed in response to AiciInstruction::Fork.
    fn resume_fork(&mut self, seqs_in_group: Vec<SeqId>, self_id: SeqId) -> AiciInstruction;
}

scheduler

Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target
latency for a request?

backtracking and fast-forward in vllm

fast-forward (FF) - generating multiple zero-entropy tokens, extending existing generation (or prompt)

SequenceGroup (SG) - a bunch of sequences being generated from a single user req, sharing the prompt; they can be only swapped in and out as a whole

  • we may want to create a new SG when WASM module requests a fork
  • each model forward pass is either for all prompt tokens or all generation tokens; this seems deeply engrained into CUDA kernels
  • it's unclear how to FF - it may be easier to create a new SG and recompute everything?
  • backtracking should be easy (just cut some variable and free blocks) but FF typically follows so unclear if this is useful
  • KV cache sits in blocks; each block is 16 tokens long (there is maybe 1 megabyte (+- order of magnitude) of data per token though)

integrate with llama.cpp

It seems it would be relatively easy. The idea would be to implement a Rust server that calls into llama.cpp library.

The following Rust crates implement bindings:

and probably more. Either one of these needs to be updated (likely) to expose the needed llama.cpp APIs:

  • llama_kv_cache_seq_rm
  • llama_kv_cache_seq_cp
  • batch APIs
  • sampling APIs

With these it looks like we can do forking, backtracking, fast-forwarding, and also constraints.

check on pipelining

Check if vllm does any pipelining - if so airicrt will have trouble as it assumes one set of requests at a time.

async side channel comms

The side MessageChannel is used to compile modules. It should be also used to initialize recognizer outside of any steps.

It also needs to be made async.

use processes not threads for workers

Advantages:

  • OS-level Spectre mitigations
  • we can limit the time externally (via kill(2)); don't have to use epochs in WASM - 28% faster
  • we can also do tighter limits on memory and time (also during compilation of module)
  • we can use real fork(2) for forking search

The worker would first compile the WASM module (if needed) and then run it. The worker is for a single WASM module only.

Things to sort out:

  • communication mechanism (possibly existing MessageChannel - name it after process PID, delete first)
  • is it somehow possible to drop some privileges before running WASM?

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

OR

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : This repository will ship as Open Source or go public
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

Need more help? 🖐️

add more lr grammars

latex
markdown
typst
rust
ipynb
python
js,ts,html,css,json,xml,toml,etc formats

Warning(s) when starting rLLM-cpp

Steps

cd rllm-cpp
./cpp-server.sh phi2

Result

  • The server builds and starts
  • The log contains one warning: WARN [llama_cpp_low] llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 )
  • This log entry looks like a warning but is logged as INFO 7 times: INFO [hf_hub] Token file not found "/Users/tester/.cache/huggingface/token"

System

  • system: macOS 14.3, Apple M3
  • cargo 1.75.0
  • cmake version 3.28.2
  • ccache version 4.9

add "prompted virtualized repl agent" as pseudo model

dockerized application reproducibly built by output of the generative script of repl agent name with passed parameters and ENV variables

ability to use one demand spawn off instances of such applications repl environments (for example, docker's interactive bash) as conversational models

llama CPU phi2 model wrong generation

   Compiling rquickjs v0.4.0 (https://github.com/mmoskal/rquickjs?rev=5b0e3b24d5021d3cd4981d3693fd7bd1a106314c#5b0e3b24)

    Finished release [optimized + debuginfo] target(s) in 35.89s

built: /Users/markus/src/aici/target/wasm32-wasi/release/aici_jsctrl.wasm, 3.14 MiB

upload module... 3213kB -> 10087kB id:3508abaf

[0]: FIXED "Ultimate answer is to the life, universe and everything is "

[0]: GEN-OPT {regex: /\d\d/}

[0]: regex constraint: "\\d\\d"

[0]: dfa: 160 bytes

[0]: GEN "𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘"

[0]: JsCtrl: done

[DONE]

[Response] Ultimate answer is to the life, universe and everything is 𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘𝫘
 
response saved to tmp/response.json

Usage: {'sampled_tokens': 21, 'ff_tokens': 32, 'cost': 74}

Storage: {}

SeqId should start at 0 for every group

Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is used, which is something we may want to hide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.