microsoft / aici Goto Github PK
View Code? Open in Web Editor NEWAICI: Prompts as (Wasm) Programs
License: MIT License
AICI: Prompts as (Wasm) Programs
License: MIT License
both in ModuleRegistry and Stepper - if unused for too long, just delete them
This will help with latency
We currently fail here, for simple rx like '\d*'
impl FunctionalRecognizer<RecRxState> for RecRx {
fn initial(&self) -> RecRxState {
self.dfa
.universal_start_state(regex_automata::Anchored::Yes)
.expect("dfa has no universal start state; make sure it doesn't match empty string")
}
right now it's imported by comms.py which is imported from init.py
If two reqs try to add the same module in parallel, make sure we get the right result from fs::write()
stop_at is a string, not a token, that can occur anywhere inside the token; do we keep the final token?
And also setup.py
Docs:
WIP for shm only: win.patch
logits tensor is float16, we use -100 to ban a token. Temperature setting below around 0.0003
causes overflow and the following crash:
File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample
parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 356, in _sample_from_generation_tokens
next_token_ids = torch.multinomial(probs,
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the kernel could be modified to only read these prompt entries once.
Proposed structure:
top
README.md
aici.sh
top (compliance)
CODE_OF_CONDUCT.md
LICENSE
NOTICE.md
SECURITY.md
SUPPORT.md
TRANSPARENCY.md
top (rust)
Cargo.lock
Cargo.toml
rustfmt.toml
target/
top (python)
pytest.ini
docs/ ?
REST.md
proxy.md (move to README.md)
aici/
aici_abi/
aicirt/
controllers/
declctrl/
jsctrl/
pyctrl/
uppercase/
rllm/
rllm-cpp/ -> rllm-llama-cpp/
rllm-cuda/
rllm-lib/
llama-cpp-low/
tch-cuda/
scripts/
harness/ -> py/
py/
promptlib/
pyaici/
tests/
vllm/
setup.py
In particular I want pyvm-latest and pyvm-2023-12-23-08-12 kind of tags so ppl don't have to upload their own.
Draft here:
pub type TokenId = u32;
pub enum AiciInstruction {
/// Stop the current sequence.
/// Similar to strong bias to EOS.
Stop,
/// Sample next token in the current sequence, using given bias.
/// `bias.len()` is size of vocabulary.
LogitBias { bias: Vec<f32> },
/// First pop `backtrack` tokens,
/// then force next tokens to be generated to be `ff_tokens`.
/// `backtrack` can be 0, and `ff_tokens` can be empty but not both.
Splice {
backtrack: u32,
ff_tokens: Vec<TokenId>,
},
/// Fork the current sequence into `num_children` sequences (including current one).
/// `resume_fork(0)` will be called on this VM, while children will be resumed
/// with `resume_fork(1)` ... `resume_fork(num_children - 1)`
/// (thus, `Fork {1}` will not create any new sequences).
Fork { num_children: u32 },
/// Wait until all listed variables are available for reading,
/// and all listed sequences have finished executing.
WaitAll {
variables: Vec<(SeqId, String)>,
finished: Vec<SeqId>,
},
}
pub struct SeqId {
// ...
}
pub trait AiciHost {
/// Log a string.
fn println(&self, msg: &str);
/// Initialize a new instances of TokTrie, which can iterate over all tokens,
/// turn token-ids to byte buffers, give vocab size, id of EOS token etc.
fn read_token_trie(&self) -> TokTrie;
/// Read argument passed by the user (typically JSON).
fn read_arg(&self) -> Vec<u8>;
/// Tokenize given UTF8 string.
fn tokenize(&self, s: &str) -> Vec<TokenId>;
/// Return Id of the current sequence.
fn sequence_id(&self) -> SeqId;
/// Return list of all sequences that are currently running.
fn running_sequences(&self) -> Vec<SeqId>;
/// Read variable from a given sequence.
fn read_variable(&self, seq: SeqId, name: &str) -> Option<Vec<u8>>;
/// Write variable in the current sequence.
fn write_variable(&self, name: &str, val: &[u8]);
}
pub trait AiciHighLevelVm {
/// Create a new instance of constraint.
fn new(host: impl AiciHost) -> Self;
/// Executed once, at the beginning.
fn process_prompt(&mut self, prompt: Vec<TokenId>) -> AiciInstruction;
/// Executed for every sampled token.
fn append_token(&mut self, token: TokenId) -> AiciInstruction;
/// Executed in response to AiciInstruction::Fork.
fn resume_fork(&mut self, seqs_in_group: Vec<SeqId>, self_id: SeqId) -> AiciInstruction;
}
Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target
latency for a request?
fast-forward (FF) - generating multiple zero-entropy tokens, extending existing generation (or prompt)
SequenceGroup (SG) - a bunch of sequences being generated from a single user req, sharing the prompt; they can be only swapped in and out as a whole
It seems it would be relatively easy. The idea would be to implement a Rust server that calls into llama.cpp library.
The following Rust crates implement bindings:
and probably more. Either one of these needs to be updated (likely) to expose the needed llama.cpp APIs:
With these it looks like we can do forking, backtracking, fast-forwarding, and also constraints.
"pre_process" and "post_process" have been combined into "post_pre_process"
Check if vllm does any pipelining - if so airicrt will have trouble as it assumes one set of requests at a time.
The side MessageChannel is used to compile modules. It should be also used to initialize recognizer outside of any steps.
It also needs to be made async.
also explain how to use 'aici jsinit'
Advantages:
The worker would first compile the WASM module (if needed) and then run it. The worker is for a single WASM module only.
Things to sort out:
MessageChannel
- name it after process PID, delete first)In order to protect and secure Microsoft, private
or internal
repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).
✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.
❗Only users with admin
permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒
Reply with a comment on this issue containing one of the following optin
or optout
command options below.
✅ Opt-in to migrate
@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>
Example:
@gimsvc optin --date 03-15-2023
OR
❌ Opt-out of migration
@gimsvc optout --reason <staging|collaboration|delete|other>
Example:
@gimsvc optout --reason staging
Options:
staging
: This repository will ship as Open Source or gopublic
collaboration
: Used for external or 3rd party collaboration with customers, partners, suppliers, etc.delete
: This repository will be deleted because it is no longer needed.other
: Other reasons not specified
This
aici/promptlib/promptlib/aici.py
Line 27 in 5fd57e4
Pyvm wouldn't require cuda
Rllm would be quicker to build than vllm
Vllm would allow building all
pre = softmax(logits)
logits += bias
post = softmax(logits)
dropped = sum(max(0, pre[i] - post[i]) for i in range(len(post)))
if dropped is close to 1 we're going against the model
The timeout is 1000ms, and we seem to be hitting it sometimes on macOS M3
cc @dluc
latex
markdown
typst
rust
ipynb
python
js,ts,html,css,json,xml,toml,etc formats
Possibly via POST /v1/run?controller=xyz
?
We should use a trait for Tensor and Model types. Right now it uses cfg()
cd rllm-cpp
./cpp-server.sh phi2
WARN [llama_cpp_low] llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 )
INFO [hf_hub] Token file not found "/Users/tester/.cache/huggingface/token"
Right now we take one "file". We should allow multiple to allow for sub-modules and arguments
The README links to https://github.com/microsoft/aici/tree/main/aici_ast_runner/c.y but the file does not exist in the repository
For example, when wasmtime version is updated.
https://docs.rs/wasmtime/latest/wasmtime/struct.Engine.html#method.detect_precompiled
Right now we extend OpenAI completion.
Similar to max_tokens
but in different units.
turn LLM output to string, encode and compare token ids
vllm uses max number of possible forks in a sequenace group for scheduling
also that max should be limited
While at it, also measure mem transfer speed and see how many KV entries can be transferred in a single inference round
dockerized application reproducibly built by output of the generative script of repl agent name with passed parameters and ENV variables
ability to use one demand spawn off instances of such applications repl environments (for example, docker's interactive bash) as conversational models
one_of = ["foo", "bar", "baz"]
Similar to
rx = r"foo|bar|baz"
but may be tokenized more efficiently
Compiling rquickjs v0.4.0 (https://github.com/mmoskal/rquickjs?rev=5b0e3b24d5021d3cd4981d3693fd7bd1a106314c#5b0e3b24)
Finished release [optimized + debuginfo] target(s) in 35.89s
built: /Users/markus/src/aici/target/wasm32-wasi/release/aici_jsctrl.wasm, 3.14 MiB
upload module... 3213kB -> 10087kB id:3508abaf
[0]: FIXED "Ultimate answer is to the life, universe and everything is "
[0]: GEN-OPT {regex: /\d\d/}
[0]: regex constraint: "\\d\\d"
[0]: dfa: 160 bytes
[0]: GEN ""
[0]: JsCtrl: done
[DONE]
[Response] Ultimate answer is to the life, universe and everything is
response saved to tmp/response.json
Usage: {'sampled_tokens': 21, 'ff_tokens': 32, 'cost': 74}
Storage: {}
Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is used, which is something we may want to hide.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.