c0sogi / llama-api Goto Github PK

View Code? Open in Web Editor NEW

110.0 4.0 9.0 680 KB

An OpenAI-like LLaMA inference API

License: MIT License

Python 99.23% Batchfile 0.11% Shell 0.11% Dockerfile 0.55%

api exllama fastapi llama llamacpp

llama-api's Introduction

Python 3.8 / 3.9 / 3.10 / 3.11 on Windows / Linux / MacOS

About this repository

This project aims to provide a simple way to run LLama.cpp and Exllama models as a OpenAI-like API server.

You can use this server to run the models in your own application, or use it as a standalone API server!

Before you start

Python 3.8 / 3.9 / 3.10 / 3.11 is required to run the server. You can download it from https://www.python.org/downloads/
llama.cpp: To use cuBLAS(for nvidia gpus) version of llama.cpp, and if you are Windows user, download CUDA Toolkit 11.8.
ExLlama: To use ExLlama, install the prerequisites of this repository. Maybe Windows user needs to install both MSVC 2022 and CUDA Toolkit 11.8.

How to run server

All required packages will be installed automatically with this command.

python -m main --install-pkgs

If you already have all required packages installed, you can skip the installation with this command.

python -m main

Options:

usage: main.py [-h] [--port PORT] [--max-workers MAX_WORKERS]
               [--max-semaphores MAX_SEMAPHORES]
               [--max-tokens-limit MAX_TOKENS_LIMIT] [--api-key API_KEY]
               [--no-embed] [--tunnel] [--install-pkgs] [--force-cuda]
               [--skip-torch-install] [--skip-tf-install] [--skip-compile]
               [--no-cache-dir] [--upgrade]

options:
  -h, --help            show this help message and exit
  --port PORT, -p PORT  Port to run the server on; default is 8000
  --max-workers MAX_WORKERS, -w MAX_WORKERS
                        Maximum number of process workers to run; default is 1
  --max-semaphores MAX_SEMAPHORES, -s MAX_SEMAPHORES
                        Maximum number of process semaphores to permit;
                        default is 1
  --max-tokens-limit MAX_TOKENS_LIMIT, -l MAX_TOKENS_LIMIT
                        Set the maximum number of tokens to `max_tokens`. This
                        is needed to limit the number of tokens
                        generated.Default is None, which means no limit.        
  --api-key API_KEY, -k API_KEY
                        API key to use for the server
  --no-embed            Disable embeddings endpoint
  --tunnel, -t          Tunnel the server through cloudflared
  --install-pkgs, -i    Install all required packages before running the        
                        server
  --force-cuda, -c      Force CUDA version of pytorch to be used when
                        installing pytorch. e.g. torch==2.0.1+cu118
  --skip-torch-install, --no-torch
                        Skip installing pytorch, if `install-pkgs` is set       
  --skip-tf-install, --no-tf
                        Skip installing tensorflow, if `install-pkgs` is set    
  --skip-compile, --no-compile
                        Skip compiling the shared library of LLaMA C++ code     
  --no-cache-dir, --no-cache
                        Disable caching of pip installs, if `install-pkgs` is   
                        set
  --upgrade, -u         Upgrade all packages and repositories before running    
                        the server

Unique features

On-Demand Model Loading
- The project tries to load the model defined in model_definitions.py into the worker process when it is sent along with the request JSON body. The worker continually uses the cached model and when a request for a different model comes in, it unloads the existing model and loads the new one.
Parallelism and Concurrency Enabled
- Due to the internal operation of the process pool, both parallelism and concurrency are secured. The --max-workers $NUM_WORKERS option needs to be provided when starting the server. This, however, only applies when requests are made simultaneously for different models. If requests are made for the same model, they will wait until a slot becomes available due to the semaphore.
Auto Dependency Installation
- The project automatically do git clones and installs the required dependencies, including pytorch and tensorflow, when the server is started. This is done by checking the pyproject.toml or requirements.txt file in the root directory of this project or other repositories. pyproject.toml will be parsed into requirements.txt with poetry. If you want to add more dependencies, simply add them to the file.

How can I get the models?

1. Automatic download (Recommended)

Just set model_path of your own model defintion in model_definitions.py as actual huggingface repository and run the server. The server will automatically download the model from HuggingFace.co, when the model is requested for the first time.

2. Manual download

You can download the models manually if you want. I prefer to use the following link to download the models

For LLama.cpp models: Download the gguf file from the GGML model page. Choose quantization method you prefer. The gguf file name will be the model_path.

The LLama.cpp model must be put here as a gguf file, in models/ggml/.

For example, if you downloaded a q4_k_m quantized model from this link, The path of the model has to be mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf.

Available quantizations: q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K
For Exllama models: Download three files from the GPTQ model page: config.json / tokenizer.model / *.safetensors and put them in a folder. The folder name will be the model_path.

The Exllama GPTQ model must be put here as a folder, in models/gptq/.

For example, if you downloaded 3 files from this link,
- orca-mini-7b-GPTQ-4bit-128g.no-act.order.safetensors
- tokenizer.model
- config.json
then you need to put them in a folder. The path of the model has to be the folder name. Let's say, orca_mini_7b, which contains the 3 files.

Where to define the models

Define llama.cpp & exllama models in model_definitions.py. You can define all necessary parameters to load the models there. Refer to the example in the file. or, you can define the models in python script file that includes model and def in the file name. e.g. my_model_def.py. The file must include at least one llm model (LlamaCppModel or ExLlamaModel) definition. Also, you can define openai_replacement_models dictionary in the file to replace the openai models with your own models. For example,

# my_model_def.py
from llama_api.schemas.models import LlamaCppModel, ExLlamaModel

# `my_ggml` and `my_ggml2` is the same definition of same model.
my_ggml = LlamaCppModel(model_path="TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF", max_total_tokens=4096)
my_ggml2 = LlamaCppModel(model_path="models/ggml/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf", max_total_tokens=4096)

# `my_gptq` and `my_gptq2` is the same definition of same model.
my_gptq = ExLlamaModel(model_path="TheBloke/orca_mini_7B-GPTQ", max_total_tokens=8192)
my_gptq2 = ExLlamaModel(model_path="models/gptq/orca_mini_7b", max_total_tokens=8192)

# You can replace the openai models with your own models.
openai_replacement_models = {"gpt-3.5-turbo": "my_ggml", "gpt-4": "my_gptq2"}

The RoPE frequency and scaling factor will be automatically calculated and set if you don't set them in the model definition. Assuming that you are using Llama2 model.

Usage: Langchain integration

Langchain allows you to incorporate custom language models seamlessly. This guide will walk you through setting up your own custom model, replacing OpenAI models, and running text or chat completions.

Defining Your Custom Model

First, you need to define your custom language model in a Python file, for instance, my_model_def.py. This file should include the definition of your custom model.

# my_model_def.py
from llama_api.schemas.models import LlamaCppModel, ExllamaModel

mythomax_l2_13b_gptq = ExllamaModel(
    model_path="TheBloke/MythoMax-L2-13B-GPTQ",  # automatic download
    max_total_tokens=4096,
)

In the example above, we've defined a custom model named mythomax_l2_13b_gptq using the ExllamaModel class.

Replacing OpenAI Models

You can replace an OpenAI model with your custom model using the openai_replacement_models dictionary. Add your custom model to this dictionary in the my_model_def.py file.

# my_model_def.py (Continued)
openai_replacement_models = {"gpt-3.5-turbo": "mythomax_l2_13b_gptq"}

Here, we replaced the gpt-3.5-turbo model with our custom mythomax_l2_13b_gptq model.

Running Text/Chat Completions

Finally, you can utilize your custom model in Langchain for performing text and chat completions.

# langchain_test.py
from langchain.chat_models import ChatOpenAI
from os import environ

environ["OPENAI_API_KEY"] = "Bearer foo"

chat_model = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_base="http://localhost:8000/v1",
)
print(chat_model.predict("hi!"))

Now, running the langchain_test.py file will make use of your custom model for completions. Note that 'function call' feature will only work for LlamaCppModel. That's it! You've successfully integrated a custom model into Langchain. Enjoy your enhanced text and chat completions!

Usage: Text Completion

Now, you can send a request to the server.

import requests

url = "http://localhost:8000/v1/completions"
payload = {
    "model": "my_ggml",
    "prompt": "Hello, my name is",
    "max_tokens": 30,
    "top_p": 0.9,
    "temperature": 0.9,
    "stop": ["\n"]
}
response = requests.post(url, json=payload)
print(response.json())

# Output:
# {'id': 'cmpl-243b22e4-6215-4833-8960-c1b12b49aa60', 'object': 'text_completion', 'created': 1689857470, 'model': 'D:/llama-api/models/ggml/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf', 'choices': [{'text': " John and I'm excited to share with you how I built a 6-figure online business from scratch! In this video series, I will", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 30, 'total_tokens': 36}}

Usage: Chat Completion

import requests

url = "http://localhost:8000/v1/chat/completions"
payload = {
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello there!"}],
    "max_tokens": 30,
    "top_p": 0.9,
    "temperature": 0.9,
    "stop": ["\n"]
}
response = requests.post(url, json=payload)
print(response.json())

# Output:
# {'id': 'chatcmpl-da87a0b1-0f20-4e10-b731-ba483e13b450', 'object': 'chat.completion', 'created': 1689868843, 'model': 'D:/llama-api/models/gptq/orca_mini_7b', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': " Hi there! Sure, I'd be happy to help you with that. What can I assist you with?"}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 23, 'total_tokens': 34}}

Usage: Vector Embedding

You can also use the server to get embeddings of a text. For sentence encoder(e.g. universal-sentence-encoder/4), TensorFlow Hub is used. For the other models, embedding model will automatically be downloaded from HuggingFace, and inference will be done using Transformers and Pytorch.

import requests

url = "http://localhost:8000/v1/embeddings"
payload = {
  "model": "intfloat/e5-large-v2",  # You can also use `universal-sentence-encoder/4`
  "input": "hello world!"
}
response = requests.post(url, json=payload)
print(response.json())

# Output:
# {'object': 'list', 'model': 'intfloat/e5-large-v2', 'data': [{'index': 0, 'object': 'embedding', 'embedding': [0.28619545698165894, -0.8573919534683228, ...,  1.0349756479263306]}], 'usage': {'prompt_tokens': -1, 'total_tokens': -1}}

llama-api's People

Contributors

Stargazers

Watchers

Forkers

aresa7796 seanbenhur bet0x felbdogg wendlerc yashm5528 michel34343 zhangjiekui techthiyanes

llama-api's Issues

Proxy to openAI

Hi!
I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it).

For example:
openai_replacement_models = {"gpt-3.5-turbo": "my_ggml", "gpt-4": "openai_proxy", "lllama": "another_ggml"}
If user call gpt-3.5-turbo - api server will use my_ggml, if user call gpt-4 - send request to openai.
This will make it easy to use both local llama and openai at the same time.

PS: Thanx so much for example with LangChain!

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

I am getting this memory problem trying to run in llama-api. The same exact model works perfect in oobabooga

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

This is my model_definition:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=4096
)

Llama2_70b_q5_gguf - llama-api

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

Working from oobabooga Llama2_70b_q5_gguf

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB

how to run this api in cpu only mode

Hello can someone guide me to run this nice API in CPU mode only

Usage of embedding through langchain

Hello,

I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?

Jordan

model_definitions.py

I know its probably easy for everyone else but I am struggling every time I add a new model to test. I often get "model not found" when it seems it is there. It would be a huge help if it didn't just return model not found, but the exact path and filename it is trying to load. Either in the terminal window or return info.

ps. when will the new branch be ready that handles these definitions better?

Thanks so much, Doug

Zephyr7b gives gobbly gook output but Mistral7b works fine.

Could there be some new format of gguf that we need to update the code for or something?

Generation stops at 251 tokens - works fine on oobabooga

I hate to be a pain. You have been so helpful already, but I am stuck.

My generations are ending prematurely: "finish_reason": "length" as seen below

{
"id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3",
"object": "chat.completion",
"created": 1694531345,
"model": "llama-2-70b-chat.Q5_K_M",
"choices": [
{
"message": {
"role": "assistant",
"content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot"
},
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 123,
"completion_tokens": 128,
"total_tokens": 251
}
}

My definition is:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=16384,
use_mlock=False
)

When I load I get:

llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 82684.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

From the sever start screen I get:

llama2_70b_q5_gguf
model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25

I have tried:

Starting the server specifying the max tokens: python3 main.py --max-tokens-limit 4096
I have set my ulimit to unlimited
I have set max_total_tokens: 16384
I tried setting the rope settings to be the same as oobabooga:
rope_freq_base=10000,
rope_freq_scale=1,
BUT THESE SETTINGS WERE IGNORED.

The same model works perfectly on oobabooga.

I am not sure what else to try.

Thanks so so much, Doug

BUG: I found the model path bug!

So this has been driving me crazy. I thought I was losing my mind. So I finally figured it out.

In my model definitions I had:

WizardLM_70B_q4_GGUF = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

but when I listed the model definitions in the API I got:

{
  "id": "wizardlm_70b_q4_gguf",
  "object": "model",
  "owned_by": "LlamaCppModel",
  "permissions": [
    "model_path:wizardlm-70b-v1.0.Q4_K_M.gguf",

......

It converted the model id to lower case!!!!!!!!!!
So I changed my model definition to be all lower case AND IT WORKS!

So to fix either we need to clearly document that model definitions variable names MUST be in lower case. Or change the code to not convert to lower case.

** But this is not the whole story. I have a working model definition with upper case letters working... So something I am saying is not correct. But the above procedure definitely fixed my problem.

FastAPI + llamapi issue

We are facing "ValueError - Can't patch loop of type <class 'uvloop.Loop'>" while using llamaapi with FastAPI. Are there any known issues and resolutions?

Support for ExLlama V2

https://github.com/turboderp/exllamav2

Long generations dont return data but server says 200 OK. Swagger screen just says LOADING forever.

How to reproduce:

1) Model being used:

wizardlm_70b_q4_gguf = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

2) From swagger run this query against the chat completion endpoint. Please note there are backslashes in front of quotes
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'."
}
],
"model": "wizardlm_70b_q4_gguf"
}

3) When the server completes the query it says:

llama_print_timings: load time = 70698.17 ms
llama_print_timings: sample time = 353.01 ms / 861 runs ( 0.41 ms per token, 2439.01 tokens per second)
llama_print_timings: prompt eval time = 56156.99 ms / 95 tokens ( 591.13 ms per token, 1.69 tokens per second)
llama_print_timings: eval time = 920273.58 ms / 860 runs ( 1070.09 ms per token, 0.93 tokens per second)
llama_print_timings: total time = 978060.67 ms
[2023-09-17 15:00:28,909] llama_api.server.pools.llama:INFO - 🦙 [done for wizardlm_70b_q4_gguf]: (elapsed time: 978.1s | tokens: 860( 0.9tok/s))
INFO: 216.8.141.240:47056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
doug@Ubuntu-2204-jammy-amd64-base:~/llama-api$

4) The swagger call still says LOADING infinitely

High RAM and CPU usage

When I run a model on my GPU, my CPU and RAM Usage is insanely high

How can I use a specific prompt template?

For example openchat 3.5 wants this prompt template format:

GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:

I tried a few things a managed to crash the server so I am stuck. Can anyone help. I think the author is away...

Stopped working after enabling CUDA

Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load due to my 3080 not having enough vRAM.

How do I completely force off CUDA so that I can use the tool again? I've tried taking the PATH and LD_ paths away, but the installer still seems to be building in CUDA mode.

Thanks

Set number of cores being used on cpu?

I am on a box with 19 physical cores, but only it looks like only 9 or 10 are being used. Is there a way to specify the number of cores to use?

Using with LangChain instead openai API

Thank for a promising project!
Can I use llama-api with LangChain instead OpenAI? Can U provide an example?

exllama GPU split

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.

Support min_p sampler

Support min_p sampler, which is implemented in ExLlamav2.-

exllamav2

Please add support for exllamav2

Dumb question: definitions.py model parameters

I am very sorry for this newbie question. In the definitions.py there are a number of parameters for each model. I assume these correspond to the settings given on the model page. My question is how do I know the variable names you have used for each setting? For example:

airoboros_l2_13b_gguf = LlamaCppModel(
model_path="TheBloke/Airoboros-L2-13B-2.1-GGUF", # automatic download
max_total_tokens=8192,
rope_freq_base=26000,
rope_freq_scale=0.5,
n_gpu_layers=30,
n_batch=8192,

rope_freq_base : It doesn't appear in any of your other examples. I assume your examples are a non-exhaustive usage of all possible parameters. How can I know the variable names you used? Is there a mapping chart somewhere?

Again I apologize for the newbie question that is probably painfully obvious to others.

Thanks, Doug

Is there a way to use this on google Colab and have the url be public?

I would love to use this in google Colab but I would need the url to be public, is there a way to do that with this?

Any way to define embeddings model in model_definitions.py?

First of all, thank you for creating llama-api, it really works great! Just wanted to ask: is there a possibility to add embeddings models as well to the model_definitions.py?

It seems that the automatic downloader sometimes gets corrupted or times out. I tried it with a smaller embeddings model and everything worked fine, it cached the model and embeddings work fine. But anything over roughly 100MB times out at some point, and I'm not sure why.

Alternatively, is there any way to manually put an embeddings model into the .cache folder? I'm not really sure about the structure here, it looks quite different than a regular model directory that I would download on my own.

Thank you!

PS: Happy to contribute a bit to the codebase if it is still actively maintained, as we will probably make some changes for better production-serving. Even if it's just the readme file to explain how to serve it in production over Ngnix with load balancing and multiple instances on one server.