eth-sri / lmql Goto Github PK

View Code? Open in Web Editor NEW

3.6K 22.0 194.0 185.17 MB

A language for constraint-guided and efficient LLM programming.

Home Page: https://lmql.ai

License: Apache License 2.0

Shell 0.32% Python 53.70% JavaScript 13.40% Dockerfile 0.05% HTML 0.71% CSS 1.61% Nix 30.21%

chatgpt huggingface language-model programming-language

lmql's Introduction

LMQL

A programming language for large language models.
Documentation »

Explore Examples · Playground IDE · Report Bug

LMQL is a programming language for large language models (LLMs) based on a superset of Python. LMQL offers a novel way of interweaving traditional programming with the ability to call LLMs in your code. It goes beyond traditional templating languages by integrating LLM interaction natively at the level of your program code.

Help us shape the next major version of LMQL by filling out the LMQL developer survey: https://forms.gle/pGvAicNpUhS1rAkK9

Explore LMQL

An LMQL program reads like standard Python, but top-level strings are interpreted as query strings: They are passed to an LLM, where template variables like [GREETINGS] are automatically completed by the model:

"Greet LMQL:[GREETINGS]\n" where stops_at(GREETINGS, ".") and not "\n" in GREETINGS

if "Hi there" in GREETINGS:
    "Can you reformulate your greeting in the speech of \
     victorian-era English: [VIC_GREETINGS]\n" where stops_at(VIC_GREETINGS, ".")

"Analyse what part of this response makes it typically victorian:\n"

for i in range(4):
    "-[THOUGHT]\n" where stops_at(THOUGHT, ".")

"To summarize:[SUMMARY]"

Program Output:

LMQL allows you to express programs that contain both, traditional algorithmic logic, and LLM calls. At any point during execution, you can prompt an LLM on program variables in combination with standard natural language prompting, to leverage model reasoning capabilities in the context of your program.

To better control LLM behavior, you can use the where keyword to specify constraints and data types of the generated text. This enables guidance of the model's reasoning process, and constraining of intermediate outputs using an expressive constraint language.

Beyond this linear form of scripting, LMQL also supports a number of decoding algorithms to execute your program, such as argmax, sample or even advanced branching decoders like beam search and best_k.

Learn more about LMQL by exploring thne Example Showcase, by running your own programs in our browser-based Playground IDE or by reading the documentation.

Feature Overview

LMQL is designed to make working with language models like OpenAI and 🤗 Transformers more efficient and powerful through its advanced functionality, including multi-variable templates, conditional distributions, constraints, datatypes and control flow.

Getting Started

To install the latest version of LMQL run the following command with Python ==3.10 installed.

pip install lmql

Local GPU Support: If you want to run models on a local GPU, make sure to install LMQL in an environment with a GPU-enabled installation of PyTorch >= 1.11 (cf. https://pytorch.org/get-started/locally/) and install via pip install lmql[hf].

Running LMQL Programs

After installation, you can launch the LMQL playground IDE with the following command:

lmql playground

Using the LMQL playground requires an installation of Node.js. If you are in a conda-managed environment you can install node.js via conda install nodejs=14.20 -c conda-forge. Otherwise, please see the official Node.js website https://nodejs.org/en/download/ for instructions how to install it on your system.

This launches a browser-based playground IDE, including a showcase of many exemplary LMQL programs. If the IDE does not launch automatically, go to http://localhost:3000.

Alternatively, lmql run can be used to execute local .lmql files. Note that when using local HuggingFace Transformers models in the Playground IDE or via lmql run, you have to first launch an instance of the LMQL Inference API for the corresponding model via the command lmql serve-model.

Configuring OpenAI API Credentials

If you want to use OpenAI models, you have to configure your API credentials. To do so you can either define the OPENAI_API_KEY environment variable or create a file api.env in the active working directory, with the following contents:

openai-org: <org identifier>
openai-secret: <api secret>

For system-wide configuration, you can also create an api.env file at $HOME/.lmql/api.env or at the project root of your LMQL distribution (e.g. src/ in a development copy).

Alternatively, you can use LMQL-specific env variables LMQL_OPENAI_SECRET and LMQL_OPENAI_ORG.

Installing the Latest Development Version

To install the latest (bleeding-edge) version of LMQL, you can also run the following command:

pip install git+https://github.com/eth-sri/lmql

This will install the lmql package directly from the main branch of this repository. We do not continously test the main version, so it may be less stable than the latest PyPI release.

Contributing

LMQL is a community-centric project. If you are interested in contributing to LMQL, please see the contributing guidelines for more information, and reach out to us via Discord. We are looking forward to your contributions!

Setting Up a Development Environment

To setup a conda environment for local LMQL development with GPU support, run the following commands:

# prepare conda environment
conda env create -f scripts/conda/requirements.yml -n lmql
conda activate lmql

# registers the `lmql` command in the current shell
source scripts/activate-dev.sh

Operating System: The GPU-enabled version of LMQL was tested to work on Ubuntu 22.04 with CUDA 12.0 and Windows 10 via WSL2 and CUDA 11.7. The no-GPU version (see below) was tested to work on Ubuntu 22.04 and macOS 13.2 Ventura or Windows 10 via WSL2.

Development without GPU

This section outlines how to setup an LMQL development environment without local GPU support. Note that LMQL without local GPU support only supports the use of API-integrated models like openai/text-davinci-003. Please see the OpenAI API documentation (https://platform.openai.com/docs/models/gpt-3-5) to learn more about the set of available models.

To setup a conda environment for LMQL with no GPU support, run the following commands:

# prepare conda environment
conda env create -f scripts/conda/requirements-no-gpu.yml -n lmql-no-gpu
conda activate lmql-no-gpu

# registers the `lmql` command in the current shell
source scripts/activate-dev.sh

lmql's People

Contributors

Stargazers

Watchers

Forkers

nicroto nanaimi logiai hazemabdelkawy vishalsingh17 primedeviation lericdax senyaoh soxunlocks lambdaofgod taha7ussein007 readmecode marco-emmanuel-noto papiguy salomartin joskid dmarx chrispan68 princeton-nlp circargs martinrooty babyblue26 lachlangray jonmatthis kharvd wellalb mariagrandury jerryjliu codeaudit sizzles brucepro nn6n foxalabs laiso viehzeug hbcbh1999 outday29 veqtor 4onen danieleog hector1993prog lmql-lang gmermoud jjhw techthiyanes silaccia sekmet jamesfebin jawond mrcodechef zhangying098 apollohuang1 crjaensch kp-forks ayulockin kianmeng saeub nielstron studiovc j4acks0n knowledgeforge atul997 ardeat while-basic xcloudplatform samqin123 alejandrosuarez minosvasilias rand sorokinvld sidjha1 karim1104 jcmo-research alysawyer demolicity mohamedsobhi777 jakeboggs shivani-speaks-data innopreneur baskaranangappan architectureofthings tohrnii afiqmuzaffar josegron sciumo robcaulk attia-alshareef knightcn1983 meditans reuank charles-dyfis-net mosheduminer standardgalactic ralf12358 adriancmilian moohax dkzdev dekkada delphai festinuz

lmql's Issues

Length constraints unexpected behaviour

Consider the following query:

sample(temperature=0.8, openai_chunksize=32, max_len=128, chatty_openai=True)
    "The movie review in positive sentiment is: '[OUTPUT]"
FROM
    "openai/text-ada-001"
WHERE
    len(OUTPUT) == 50

This constraint is very specific, but seems to be ignored by LMQL. As a result, I get the following error:
AssertionError: The decoder returned a sequence that exceeds the provided max_len (max_len=64, sequence length=64). To increase the max_len, please provide a corresponding max_len argument to the decoder function.

A similar story when we require OUTPUT to be of at least length 50:

sample(temperature=0.8, openai_chunksize=32, max_len=128, chatty_openai=True)
    "The movie review in positive sentiment is: '[OUTPUT]"
FROM
    "openai/text-ada-001"
WHERE
    len(OUTPUT) > 50

In this case, LMQL keeps querying the model, despite len(OUTPUT) > 50 already having occurred. At the end, no error occurs because the longer output still satisfies the constraint. Not sure how often this constraint appears in relevant use cases though.

[Upgrade to 0.0.6] cache folder ~/.cache/lmql has to be deleted manually

When I upgraded from 0.0.5.1 to 0.0.6.0 I had to manually delete the ~/.cache/lmql folder. Not the biggest problem, but would be nice if it would not be required in following updates :)

Multiple values per VAR in LMQLResult

Hey thanks for your work!!
I notice that changing the LMQLResult from:

class LMQLResult:
    # full prompt with all variables substituted
    prompt: str
    # a dictionary of all assigned template variable values
    variables: Dict[str, str]

to:

class LMQLResult:
    # full prompt with all variables substituted
    prompt: str
    # a dictionary of all assigned template variable values
    variables: Dict[str, List[str]]

where the values taken by VAR get appended to lmql_result.variables['VAR']

would simplify certain python integrations like:

@lmql.query
async def generate_summaries(material_text: str, n_summaries: int, n_distractors: int, output):
    '''
sample(temperature=0.5)
   """Write 2 one-sentence true summaries and 4 one-sentence false summaries in the style of a first-grade reading comprehension test.  Replace "I" with a third-person word. Only use information from the paragraph.
   Paragraph:
   {material_text}
   True summaries:/n"""
   for i in range(n_summaries):
      "- [SUMMARY]"
      output['SUMMARY'].append(SUMMARY.strip())
   """False summaries:"""
   for i in range(n_distractors):
      "- [DISTRACTOR]"
      output['DISTRACTOR'].append(DISTRACTOR.strip())
from
   "openai/text-davinci-003"
where
   STOPS_AT(SUMMARY, "/n") and STOPS_AT(DISTRACTOR, "/n")
    '''

would become

@lmql.query
async def generate_summaries(material_text: str, n_summaries: int, n_distractors: int):
    '''
sample(temperature=0.5)
   """Write 2 one-sentence true summaries and 4 one-sentence false summaries in the style of a first-grade reading comprehension test.  Replace "I" with a third-person word. Only use information from the paragraph.
   Paragraph:
   {material_text}
   True summaries:/n"""
   for i in range(n_summaries):
      "- [SUMMARY]"
   """False summaries:"""
   for i in range(n_distractors):
      "- [DISTRACTOR]"
from
   "openai/text-davinci-003"
where
   STOPS_AT(SUMMARY, "/n") and STOPS_AT(DISTRACTOR, "/n")
    '''

or even better a way to access the context of the lmql from the rest of the python code

Model as a parameter

I saw that Python variables are now supported in the "where" clause – that's great news, thank you! 🙂

It would also be nice if variables were supported in the "from" clause:

@lmql.query
async def hello(model: str, length: int):
    """
    argmax
        "Hello[WHO]"
    from
        model
    where
        len(WHO) < length
    """

asyncio.run(hello(
    "openai/text-ada-001",
    10,
))

However, I would even prefer the following syntax:

@lmql.query("openai/text-ada-001")
async def hello(length: int):
    """
    argmax
        "Hello[WHO]"
    where
        len(WHO) < length
    """

lmql.run_file *args problem

In the __init__.py::run_file definition, *args needs to come before the keyword arguments, otherwise python tries to use them as an output_writer.

I was also thinking it would make sense to allow passing query variables to run_file as *kwargs instead

Is there plan for a typescript version?

OR constraint issue

The following query leads to an error (both in Python package and Playground)

sample(temperature=0.8, openai_chunksize=128, max_len=128, chatty_openai=True)
     "The movie review in positive sentiment is: '[OUTPUT]"
FROM
    "openai/text-ada-001"
WHERE
    STOPS_BEFORE(OUTPUT, "\\n") or len(OUTPUT) < 100

AttributeError: module 'lmql' has no attribute 'OrOp'

How does INT constraint work?

argmax
"A number: [N]"
from
'openai/text-ada-001'
where
INT(N)

If for some reason the token with the highest log probability is not an integer, what would LMQL do?

Question about amount of API calls

Thank you for the great work!
For a query for the sunscreen example:
"A list of things not to forget when going to the sea (not travelling): \n"
"- [THING]
- [THING]
"

From a quick skim of the paper, my understanding is that there will be an API call for each THING in each line (two calls). Is that correct?

Also, it seems like the OpenAI API already gives some control over the decoding strategy (especially with introduction of logit bias, which can be used to upsample/downsample certain tokens). Is the additional value this library gives over that constraints?

programatic query creation

Is there a way or intention to create a way to create queries in a programatic pythonic way instead of the dsl like with a builder pattern or something else?

My reason for wanting such a thing is because the only way I can think of within the confines of the DSL to add dynamic control flow is to use python metaprogramming with strings and exec which is ugly, error prone and risky.

consider the example from the homepage:

import re
from lmql.demo import gsm8k_samples

def calc(expr):

      expr = re.sub(r"[^0-9+\-*/().]", "", expr)
      return eval(expr)

argmax(openai_chunksize=64, max_len=2048)
      QUESTION = "Josh decides to try flipping
      a house.  He buys a house for $80,000
      and then puts in $50,000 in repairs.
      This increased the value of the
      house by 150%.  How much profit did
      he make?"
      # few shot samples
      "{gsm8k_samples()}"
      # prompt template
      "Q: {QUESTION}\n"
      "Let's think step by step.\n"
      for i in range(4):
         "[REASON_OR_CALC]"
         if REASON_OR_CALC.endswith("<<"):
            " [EXPR]"
            " {calc(EXPR)
}>>"
         elif REASON_OR_CALC.endswith("So the answer"):
            break
      "is[RESULT]"
from 
      'openai/text-davinci-003'
where
      STOPS_AT(REASON_OR_CALC, "<<") and
      STOPS_AT(EXPR, "=") and
      STOPS_AT(REASON_OR_CALC, "So the answer")

this prompt could be made to use many different versions of calc and each version of calc might require different generation paths/variables and this would require more branches - potentially a branch for each version of calc. How might you dynamically add these branches to the prompt? Also, how would you account for different variables introduced in each branch possibly needing different criteria?

Running asyncio.get_event_loop().run_until_complete(query(args)) more than once deadlocks program

for

@lmql.query
async def s(output, material_text: str):
    '''
sample(temperature=0.5)
   """Write 2 one-sentence true summaries and 4 one-sentence false summaries in the style of a first-grade reading comprehension test.  Replace "I" with a third-person word. Only use information from the paragraph.
   Paragraph:
   {material_text}
   True summaries:{nl}"""
   for i in range(2):
      "- [correct_summaries]"
      output.append(correct_summaries)
   """False summaries:{nl}"""
   for i in range(4):
      "- [incorrect_summaries]"
      output.append(incorrect_summaries)
from
   "openai/text-davinci-003"
where
   STOPS_AT(correct_summaries, "{nl}") and STOPS_AT(incorrect_summaries, "{nl}")'''

if I run

l=[]
asyncio.get_event_loop().run_until_complete(s(l, "some text"))

it executes just fine, but the second time it seems to produce a deadlock

As a smith I would like the openai_chunksize to be affected by token length constraints if possible

If a query specifies len(TOKENS(VAR)) < 20, we should automatically set the chunk size to at most 20, to save some speculative tokens.

minimal example for inference-as-a-service

Congrats on this amazing project! We are interested in testing LMQL for inference-as-a-service. It would be great if the document contains a minimal example of using LMQL as library that

load a HF model once
handle LMQL as string

As a user I need a way to include [ and ] in the literal sense in the prompt

We need to add support to escape [[ to [ and ]] to ] to enable use of squared brackets in the prompt clause.

Query ignoring condition and then failing for unknown reason

The following query fails:

argmax(chatty_openai=True, max_len=256)
   "A rhyme:\n"
   "Verse: [RHYME_START]\n"
   for i in range(5):
      "Verse: [RHYME]\n"
from
   'openai/text-davinci-003'
where
   len(TOKENS(RHYME)) == 5 and len(TOKENS(RHYME_START)) == 5

I am not sure why, but the first RHYME variable does not stop after five tokens. However, if I remove the "RHYME START" variable completely, the query works fine again. I also ran several variations of this query where sometimes it works and sometimes it fails at the second RHYME variable. I cannot really figure out what is going wrong though.

Visibility into LLM requests

Is there already any way to easily see what are the requests to the LLM being made? (Esp. for API based LLMs.) Just having trouble parsing excerpts like this from the docs (e.g. is it actually generating $O(n^2)$ requests where $n$ scales with length?), and what are the logit biases generated for various constraint expressions, etc.:

Apologies if this already exists and I'm just missing it!

[Windows non-WSL] playground won't start

playground won't start on windows system (all requirements met) with error:
'REACT_APP_BUILD_COMMIT' is not recognized as an internal or external command,
operable program or batch file.
pls advise.

Error when migrating from serverless to edge function in Next.js API using LangchainJS

Max token length

How do I specify token length in lmql run?

Is it supposed to be specified in .lmql file?

Node v14.20 is outdated and not available for ARM on Mac

When trying to create a conda environment on a M1 Mac, I'm getting the following error:

$ conda env create -f scripts/conda/requirements-no-gpu.yml -n lmql
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - conda-forge::nodejs=14.20

14.20 is a very old version of Node, and setting it to the latest v18.15.0 in the requirements file seems to fix the issue

Further token efficiency on `in` conditions

Thinking about conditions such as where SOMETHING in ['aaa', 'bbb', 'ccc'] say a model produces a token a for SOMETHING, can LMQL be made more efficient to infer that the only possibility is then aaa and skip to the end of SOMETHING?

RuntimeError: There is no current event loop in thread 'MainThread'.

When I run lmql run on your joke example, or using lml playground I get that RuntimeError.

BTW thanks for a great initiative.

As a user I would like a flag for automatic left truncation

If query execution exceeds the maximum token limit of model or memory limitations, LMQL should provide an option to automatically left-truncate the active prompt, to enable continuous (infinite) generation (by dismissing early parts of the prompts if needed).

Feature Proposal: We want to provide a flag autotruncate=<VALUE> where

VALUE: bool: enables automatic left truncation based on the inherit model limit (e.g. 2048)
VALUE: int: enables automatic left truncation for token sequences that are >= VALUE

AttributeError: 'NoneType' object has no attribute 'query_head'

I'm getting the following error and it's not clear why.

  File "/Users/robertchandler/code/ai/gpt_experiments/venv/lib/python3.10/site-packages/lmql/runtime/interpreter.py", line 647, in run
    assert state.query_head.result is not None, "decoder designates sequence {} as finished but the underyling query program has not produced a result. This is likekly a decoder bug. Decoder in use {}".format(await s.str(), decoder_args["decoder"])
AttributeError: 'NoneType' object has no attribute 'query_head'

My LMQL script is:

argmax(step_budget=20000, openai_chunksize=128)
        """{pre_prompt}
        Question: {question}
        """
        for i in range(100):
            "Thought:[THOUGHT]\n"
            print(f"Had thought: {red}{THOUGHT.strip()}{reset}")

            "[ACTION_OR_ANSWER][ACTION_OR_ANSWER_OUTPUT]\n"
            action_or_answer = ACTION_OR_ANSWER.strip()
            if action_or_answer == "Action:":
                cleaned_action = ACTION_OR_ANSWER_OUTPUT.strip()
                "Action Input: ```[ACTION_INPUT]```"
                cleaned_input = ACTION_INPUT.strip().strip('```').strip()
                tool_names_to_tool = {t.__name__: t for t in tools}
                if cleaned_action in tool_names_to_tool:
                    result = tool_names_to_tool[cleaned_action](cleaned_input)
                else:
                    result = f"'{cleaned_action}' is not a valid action, actions should be one of: {tool_names_mapping(tools)}"
                "\nObservation: {result}\n"
            else:
                break
    from
        "chatgpt"
    where
        # ACTION_OR_ANSWER in ["Action", "Final Answer"]
        STOPS_AT(THOUGHT, "\n") and
        STOPS_AT(ACTION_OR_ANSWER, ":") and
        STOPS_AT(ACTION_OR_ANSWER_OUTPUT, "\n") and
        STOPS_AT(ACTION_INPUT, "```")

I'm using version 0.0.5.1

LMQL Playground ends in react error

Installed latest version of Python, Node.JS and npm, plus react-scripts. Getting the following:

`C:\WINDOWS\system32>lmql playground
[lmql playground C:\Users\Simulara\AppData\Local\Programs\Python\Python310\lib\site-packages, liveserver=localhost:3004, ui=localhost:3000]

changed 1 package in 848ms
yarn install v1.22.19
yarn install v1.22.19
[1/4] Resolving packages...
[1/4] Resolving packages...
success Already up-to-date.
Done in 0.43s.
'PORT' is not recognized as an internal or external command,
operable program or batch file.
success Already up-to-date.
Done in 1.63s.
'REACT_APP_BUILD_COMMIT' is not recognized as an internal or external command,
operable program or batch file.

C:\WINDOWS\system32>`

HF Llama Indexing Issue

I'm currently trying to use the huggingface LlamaForCausalLM. I downloaded the weights, and hosting / inference work great.

The only issue is that there is some tokenization issue going on.

My .lmql file looks like this:

argmax """Once upon a time[GENERATION]""" from "llama/hf_llama7b" where len(GENERATION) < 40

However, my output is a bit messed up:

I think the issue has to do with the llama tokenizer always prepending the <s> token (beginning of string) to any text it tokenizes. This creates two failure potential failure modes that could result in the behavior above:

it potentially throws off any indexing that somehow doesn't accounts for the <s>
if lmql tokenizes an any intermediate text, we incorrectly add the <s> again. (For the intermediate text tokenizations we should pass add_special_tokens=False)

The issue is that I'm not familiar with the tokenization logic for lmql. If someone gave me a quick high-level rundown of all the relevant tokenization files to change, I can submit a PR.

Prepending <|endoftext|> to the prompt leads to discrepancies between LMQL and openai API

LMQL automatically prepends the token "<|endoftext|> to the beginning of the prompt (as BOS token). However, when querying openai models through playground/api this doesn't happen in the prompt (they might do it on their end). Thus prepending the token might actually not be correct. Also, when querying models with "argmax" option, lmql returns different output than the api/playground which is not what one wants.

Support for future constraints in WHERE clauses

I am willing to generate real numbers and percentages. Is there any plans to support other number constraints than INT ? What about a REGEXP constraint ? (this would be quite general)

STOPS_BEFORE on multiple tokens retains the first token

The following query returns "What did the fish say when" instead of the expected "What did the fish say".

argmax(max_len=80)
   """A list of good dad jokes. A indicates the punchline
   Q: How does a penguin build its house?
   A: Igloos it together.
   Q: Which knight invented King Arthur's Round Table?
   A: Sir Cumference.
   Q:[JOKE]"""
from
   "openai/text-davinci-003"
where
   STOPS_BEFORE(JOKE, "when it hit")

Multiple STOPS_BEFORE constraint issue

The following query results in an error (both in the Python package and the Playground)

sample(temperature=0.8, openai_chunksize=128, max_len=128, chatty_openai=True)
    "The movie review in positive sentiment is: '[OUTPUT]"
FROM
    "openai/text-ada-001"
WHERE
    STOPS_BEFORE(OUTPUT, "\\n") and STOPS_BEFORE(OUTPUT, "n")

AssertionError: The specified set of constraints contains multiple incompatible postprocessing operations for the same variable. The conflicting operations are: and . Please make sure the used constraints implement postprocess_order for each other, to use them together.

The same query with "STOPS_AT" instead of "STOPS_BEFORE" works.

Add support for logit_bias for OpenAI Chat models / support for cl100k_base tokenizer

Currently, lmql doesn't support logit bias-based constraints for openai/gpt-3.5-turbo and openai/gpt-4. However, the OpenAI documentation states that logit_bias is indeed supported by the API. The reason why it currently doesn't work might have something to do with the fact that these models use a different tokenizer, cl100k_base, which doesn't seem to be well documented. For example, for the "list of things not to forget when going to the sea" example, here's the current prompt generated by lmql:

{
  "model": "gpt-4",
  "max_tokens": 32,
  "temperature": 0.8,
  "user": "lmql",
  "stream": true,
  "logit_bias": {
    "16012": 100,
    "16598": 100,
    "24541": 100
  },
  "messages": [
    {
      "role": "user",
      "content": "A list of things not to forget when going to the sea (not travelling): \n- Sunglasses \n-  Ur \n- "
    }
  ]
}

However, tokens 16012, 16598 and 24541 in cl100k_base are "Urls", " Crusher" and ".getType", which explains why the model generates a completion "Ur". The solution might be to use the https://github.com/openai/tiktoken library:

>>> import tiktoken
>>> enc = tiktoken.get_encoding("cl100k_base")
>>> enc.encode("Volleyball")
[53, 35619, 4047]
>>> enc.encode("Sunscreen")
[31192, 8337]
>>> enc.encode("Bathing Suite")
[33, 44661, 21652]

And indeed using logit biases

{
    "53": 100,
    "31192": 100,
    "33": 100
}

I'm getting {"content":"Sun"}

Question regarding Python functions

Congratulations on this exciting release! I have some questions regarding the Python functions that can be called from the LMQL query. In the paper, you mention that such functions are assumed to be pure and deterministic.

Is it enforced by lmql, i.e. does lmql take charge of caching these function calls? Or is this left to the user?

I'm asking because it could be useful to call non-deterministic functions or functions with side effects so ideally it would be possible to not cache some function calls.

Alternative LLM backend support

I'm really excited for this project. This could be huge for on-device LLMs. It would pair extremely nicely with projects like GPT4All, RWKV, and whatever else is on the horizon.

It would be great if there was an API for plugging in arbitrary models. What information do you need from the LLM to apply lmql?

Using triple quotes instead of single quotes with lmql>=0.0.6.1

Consider the following code:

import asyncio

import lmql


@lmql.query
async def write_catch_phrase1(adj: str, noun: str):
    '''
    argmax "Write a catchphrase for the following company: {adj} {noun}. [catchphrase]" from "chatgpt"
    '''


@lmql.query
async def write_catch_phrase2(adj: str, noun: str):
    '''
    argmax
    """
        Write a catchphrase for the following company: {adj} {noun}. [catchphrase]
    """
    from "chatgpt"
    '''


async def main():
    catchphrase1 = (await write_catch_phrase1("colorful", "socks"))[0].variables["catchphrase"]
    print(catchphrase1)
    catchphrase2 = (await write_catch_phrase2("colorful", "socks"))[0].variables["catchphrase"]
    print(catchphrase2)


asyncio.run(main())

The function write_catch_phrase1 (using single quotes) works with lmql==0.0.5.1 and lmql==0.0.6.1, while the second function write_catch_phrase2 works only with lmql==0.0.5.1. The issue persists when using the functions as chains in langchain chains.
The error I get is: AttributeError: 'LMQLTokenizer' object has no attribute 'vocab_range'. Did you mean: 'vocab_size'?

LMQL not supporting multibyte characters, causing garbled output for Japanese queries

I have encountered an issue with LMQL not supporting multibyte characters. While it works fine for English queries, I am facing garbled output when trying to run Japanese queries.

@lmql.query
async def en_query():
    '''
    argmax
        "Q: Who are you?\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await en_query()
print(response[0].prompt)

@lmql.query
async def ja_query():
    '''
    argmax
        "Q: あなたは誰？\n"
        "A: [WHAT]"
    from
        "chatgpt"
    '''

response = await ja_query()
print(response[0].prompt)

It appears that the problem stems from the code not being able to handle multibyte characters when separating the strings returned from LLMs character by character.

Is there any plan to support multibyte characters in the future?
I am more than willing to help and cooperate in any way necessary to resolve this issue.

Running non-reloading hosted version of playground

Hi, I've gotten the model serving endpoint up on in a private instance, but I'm having difficulty starting a simple playground UI that can interact with it.

Am I missing anything obvious about which entry points or config to use?

Would I be better off using the pyiodide-based UI like the public playground uses?

Tokens removed by STOPS_BEFORE count towards the length of the output

In the following query, we would not expect the query to end on the first occurrence of "wall" (output sentence: "what did the fish say when it hit the wall?"), since the length of the sequence right before the word is 37. However, since the word "wall" is also counted in the overall length of the sequence, len(JOKE) > 40 evaluates to true, even though it shouldn't yet.

argmax
   """A list of good dad jokes. A indicates the punchline
   Q: How does a penguin build its house?
   A: Igloos it together.
   Q: Which knight invented King Arthur's Round Table?
   A: Sir Cumference.
   Q:[JOKE]"""
from
   "openai/text-davinci-003"
where
   STOPS_BEFORE(JOKE, "wall") and len(JOKE) > 40

More advanced stopping conditions: STOPS_AT/STOPS_BEFORE with regex/lists

It would be very helpful to be able to have more advanced stopping conditions in STOPS_AT/STOPS_BEFORE. One use case for stopping conditions with lists instead of strings, is that:

argmax(chatty_openai=True, max_len=128)
   """[SENTENCE]"""
from
   "openai/text-davinci-003"
where
   STOPS_AT(SENTENCE, [".", "?", "!"])

is a lot easier than:

argmax(chatty_openai=True, max_len=128)
   """[SENTENCE]"""
from
   "openai/text-davinci-003"
where
   STOPS_AT(SENTENCE, ".") and STOPS_AT(SENTENCE, "?") and STOPS_AT(SENTENCE, "!")

For more advanced conditions based on regexes, one could look at the "calculator" example from the "tool-augmented queries". Using regexes, it would be possible to do this without few-shot examples (note that my regexes might not be exactly correct):

def calc(expr):
      expr = re.sub(r"[^0-9+\-*/().,]", "", expr)
      try:
         return eval(expr)
      except Exception:
         return ""

argmax(openai_chunksize=64, max_len=2048)
      QUESTION = "Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"
      # prompt template
      "Q: {QUESTION}\n"
      "Let's think step by step.\n"
      for i in range(4):
         "[REASONING]"
         "[CALC]"
         if CALC.endswith("="):
            " {calc(CALC)}>>"
      # Note: the last CALC would contain the RESULT.
from 
      'openai/text-davinci-003'
where
      STOPS_BEFORE(REASONING, r"^[^\d]+[\d]+$") and
      STOPS_AT(CALC, "=") and
      STOPS_BEFORE(CALC, r"^[0-9+\-*/().,]+[a-zA-Z?!\n]$")

Use typing in server/client

TL;DR

As for now the server code is pretty hard to comprehend.

I propose to add typing, I can rewrite server using FastAPI - in addition to specifying request/response schema it runs a /docs endpoint that contains examples so that there is no need to use Postman/CURL.
Also it would be easier to maintain because when someone will use incompatible version than this person will get a ValidationError instead of some other runtime error in some other place.

Example where it would help - adding a new model interface

I've started working on adding RWKV integration, but I've run into the following problems

both server and client code need to be changed, I think I've updated server succesfully but on client side it doesn't work
RWKV uses tokenizers.Tokenizer which uses different interface than actual instantiated classes in transformers
I have problems with dc lib (actually what's up with that? Is this dclib a kind of abstraction over different models?)

@lbeurerkellner could you point me in the right direction?

Please let me know what you think.

Anyway, happy Easter!

When max_len is exceeded the interpreter will always return an error, even if the sequence would already be valid.

I have the following query:

sample(temperature=0.8, openai_chunksize=32, max_len=64, chatty_openai=True)
    "The movie review in positive sentiment is: '[OUTPUT]"
FROM
    "openai/text-ada-001"

The OUTPUT variable in this case is thus constrained only by the max_len parameter. Unfortunately, this leads to the following error:

AssertionError: The decoder returned a sequence that exceeds the provided max_len (max_len=64, sequence length=64). To increase the max_len, please provide a corresponding max_len argument to the decoder function.

If I were to give a constraint "len(OUTPUT) < 100", but I would like to avoid doing this since I am not really interested in the exact length of the OUTPUT, only that it isn't too long (since the models starts drifting if it is too long).

browser: playground does not reset query stats on each run

Step budget exceeded

Even after setting the max_len variable I still exceed the step budget. Looking into the code it looks like I need to set the step_budget parameter on the argmax instead

I'm not sure if the docs need updating or if the code needs changing.

Here's the error I received:

warning: step budget exceeded
... (stack trace)
AssertionError: decoder argmax did not finish with dc.finish(), which means no result sequences were returned. 
The reason for this may be a too small max_len value (max_len=9999999999999999)

Using version 0.0.5.1

P.S. Really enjoying the LMQL declarative prompting, feels a bit like React vs jQuery (as compared to Langchain)

Is there a constraint for the standard time format

similar to INT(N)，I want to constraint the variable to be standard time format, like this: YYYY-MM-DDTHH:MM:SS
is there a constraint implement this function?

As a user I want to limit query execution not only by max_len and step_budget, but also by consumed tokens.

This is important as a fail safe mechanism in practice and to make sure long-running queries do not cause needless cost.

Allow for custom port on client side

When serving the model on a custom port with

lmql serve-model bigscience/bloom-7b1 --port 8031

... there doesn't seem to be any way to redirect the Python client towards the correct port with the decorator @lmql.query.

The 8080 port seems hard coded in several places of the code: https://github.com/eth-sri/lmql/search?q=8080

As a smith I need basic documentation on the inner workings of LMQL

We want to add a chapter to the documentation, mapping out the individual components that make up LMQL and how it functions.

AssertionError: token_ids are not unique

This query reliably causes an issue with the beam decoder:

beam(2)
    thing = "a cake"
    "Q: I'm going to make {thing}. And need help.\n"
    "A: First, you'll need the following items:\n"
    " -[ITEM]"
    items = [ITEM]
    while True:
        "[APPEND]"
        if APPEND == "\nand that's all.\n\n":
            break
        "[ITEM]"
        items.append(ITEM)
from 
    "openai/text-davinci-003"
where 
    APPEND in set([" -", "\nand that's all.\n\n"]) and
    STOPS_AT(ITEM, "\n")

  File "<exec>", line 12, in cli
  File "/home/pyodide/lmql/ui/live/livelib.py", line 92, in async_cli
    result = await endpoint.fct(input, *endpoint_args, **kwargs)
  File "/home/pyodide/lmql/ui/live/live.py", line 75, in lmql
    result = await lmql.run(code, output_writer=output_writer)
  File "/home/pyodide/lmql/__init__.py", line 107, in run
    return await run_file(temp_lmql_file, output_writer=output_writer)
  File "/home/pyodide/lmql/__init__.py", line 99, in run_file
    return await module.query(**kwargs)
  File "/home/pyodide/lmql/runtime/lmql_runtime.py", line 134, in __acall__
    results = await interpreter.run(self.fct, **query_kwargs)
  File "/home/pyodide/lmql/runtime/prompt_interpreter.py", line 812, in run
    return await pool.run()
  File "/home/pyodide/lmql/runtime/prompt_interpreter.py", line 731, in run
    query_result: Optional[List[DecoderHead]] = await self.intepreter.model.query(self.initial_prompt, where, rewriter, active_prompter)
  File "/home/pyodide/lmql/runtime/openai_integration.py", line 974, in query
    return await self.adapter.query(prompt, mask_logits_processor, head_input_id_rewriter, active_prompt_rewriter, dclib_model, self.decoder_args)
  File "/home/pyodide/lmql/runtime/dclib/lmql_adapter.py", line 180, in query
    async for _ in decoder_fct(prompt_ids, **decoder_args):
  File "/home/pyodide/lmql/runtime/dclib/decoders.py", line 156, in beam_search
    h = h.extend(await model.topk_continuations(h, k=n))
  File "/home/pyodide/lmql/runtime/openai_integration.py", line 581, in topk_continuations
    return await sequences.aelement_wise(op_topk)
  File "/home/pyodide/lmql/runtime/dclib/dclib_array.py", line 312, in aelement_wise
    result_items = await asyncio.gather(*[op_with_path(path, seqs, *args, **kwargs) for path, seqs in self.sequences.items()])
  File "/lib/python3.10/asyncio/futures.py", line 284, in __await__
    yield self  # This tells Task to wait for completion.
  File "/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception
  File "/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/home/pyodide/lmql/runtime/dclib/dclib_array.py", line 311, in op_with_path
    return path, await op(element, *args, **kwargs)
  File "/home/pyodide/lmql/runtime/openai_integration.py", line 559, in op_topk
    assert len(token_ids) == len(set(token_ids)), "token_ids are not unique"
AssertionError: token_ids are not unique```

serve_model doesn't terminate children

I'm using

serve_model_process = subprocess.Popen(["python" , "-m" "lmql.model.serve", args.model, "--cuda"])

to serve a model in the background of a python script. However, when I call serve_model_process.terminate() it doesn't properly terminate the process. When I look into ps, I see there are actually multiple processes running that are serving the model.

Is there a better way to terminate the subprocess that serves the model in python or is this an issue with LMQL's termination logic?

(Love this repo by the way, thanks so much for everyone's hard work 💯 )

Documentation: Add note on double escaping "\n" being necessary for lmql.run, but not for @lmql.query.

In the following query, I am still required to write "\n" instead of "\n" in order to be able to compile the query in the Python package (note that this specific query also fails for other reasons, see #61):

argmax(chatty_openai=True, max_len=256)
   "A rhyme:\\n"
   "Verse: [RHYME_START]\\n"
   for i in range(5):
      "Verse: [RHYME]\\n"
from
   'openai/text-davinci-003'
where
   len(TOKENS(RHYME)) == 5 and len(TOKENS(RHYME_START)) == 5

As a user I would like to use the Azure OpenAI endpoints

As a user I would like to configure to use the Azure OpenAI endpoints instead of openai.com.