Giter Club home page Giter Club logo

tinyllm's Introduction

TinyLLM

TinyLLM? Yes, the name is a bit of a contradiction, but it means well. It's all about putting a large language model (LLM) on a tiny system that still delivers acceptable performance.

This project helps you build a small locally hosted LLM with a ChatGPT-like web interface using consumer grade hardware. To read more about my research with llama.cpp and LLMs, see research.md.

Key Features

  • Supports multiple LLMs (see list below)
  • Builds a local OpenAI API web service via Ollama, llama.cpp or vLLM.
  • Serves up a Chatbot web interface with customizable prompts, accessing external websites (URLs), vector databases and other sources (e.g. news, stocks, weather).

Hardware Requirements

  • CPU: Intel, AMD or Apple Silicon
  • Memory: 8GB+ DDR4
  • Disk: 128G+ SSD
  • GPU: NVIDIA (e.g. GTX 1060 6GB, RTX 3090 24GB) or Apple M1/M2
  • OS: Ubuntu Linux, MacOS
  • Software: Python 3, CUDA Version: 12.2

Quickstart

TODO - Quick start setup script.

Manual Setup

# Clone the project
git clone https://github.com/jasonacox/TinyLLM.git
cd TinyLLM

Run a Local LLM

To run a local LLM, you will need an inference server for the model. This project recommends these options: vLLM, llama-cpp-python, and Ollama. All of these provide a built-in OpenAI API compatible web server that will make it easier for you to integrate with other tools.

Ollama Server (Option 1)

The Ollama project has made it super easy to install and run LLMs on a variety of systems (MacOS, Linux, Windows) with limited hardware. It serves up an OpenAI compatible API as well. The underlying LLM engine is llama.cpp. Like llama.cpp, the downside with this server is that it can only handle one session/prompt at a time. To run the Ollama server container:

# Install and run Ollama server
docker run -d --gpus=all \
    -v $PWD/ollama:/root/.ollama \
    -p 11434:11434 \
    -p 8000:11434 \
    --restart unless-stopped \
    --name ollama \
    ollama/ollama

# Download and test run the llama3 model
docker exec -it ollama ollama run llama3

# Tell server to keep model loaded in GPU
curl http://localhost:11434/api/generate -d '{"model": "llama3", "keep_alive": -1}'

Ollama support several models (LLMs): https://ollama.com/library If you set up the docker container mentioned above, you can down and run them using:

# Download and run Phi-3 Mini, open model by Microsoft.
docker exec -it ollama ollama run phi3

# Download and run mistral 7B model, by Mistral AI
docker exec -it ollama ollama run mistral

If you use the TinyLLM Chatbot (see below) with Ollama, make sure you specify the model via: LLM_MODEL="llama3" This will cause Ollama to download and run this model. It may take a while to start on first run unless you run one of the ollama run or curl commands above.

vLLM Server (Option 2)

vLLM offers a robust OpenAI API compatible web server that supports multiple simultaneous inference threads (sessions). It automatically downloads the models you specifdy from HuggingFace and runs extremely well in containers. vLLM requires GPUs with more VRAM since it uses non-quantized models. AWQ models are also available and more optimizations are underway in the project to reduce the memory footprint. Note, for GPUs with a compute capability of 6 or less, Pascal architecture (see GPU table), follow details here instead.

# Build Container
cd vllm
./build.sh 

# Make a Directory to store Models
mkdir models

# Edit run.sh or run-awq.sh to pull the model you want to use. Mistral is set by default.
# Run the Container - This will download the model on the first run
./run.sh  

# The trailing logs will be displayed so you can see the progress. Use ^C to exit without
# stopping the container. 

Llama-cpp-python Server (Option 3)

The llama-cpp-python's OpenAI API compatible web server is easy to set up and use. It runs optimized GGUF models that work well on many consumer grade GPUs with small amounts of VRAM. As with Ollama, a downside with this server is that it can only handle one session/prompt at a time. The steps below outline how to setup and run the server via command line. Read the details in llmserver to see how to set it up as a persistent service or docker container on your Linux host.

# Uninstall any old version of llama-cpp-python
pip3 uninstall llama-cpp-python -y

# Linux Target with Nvidia CUDA support
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python==0.2.27 --no-cache-dir
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python[server]==0.2.27 --no-cache-dir

# MacOS Target with Apple Silicon M1/M2
CMAKE_ARGS="-DLLAMA_METAL=on" pip3 install -U llama-cpp-python --no-cache-dir
pip3 install 'llama-cpp-python[server]'

# Download Models from HuggingFace
cd llmserver/models

# Get the Mistral 7B GGUF Q-5bit model Q5_K_M and Meta LLaMA-2 7B GGUF Q-5bit model Q5_K_M
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf
wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

# Run Test - API Server
python3 -m llama_cpp.server \
    --model ./models/mistral-7b-instruct-v0.1.Q5_K_M.gguf \
    --host localhost \
    --n_gpu_layers 99 \
    --n_ctx 2048 \
    --chat_format llama-2

Run a Chatbot

The TinyLLM Chatbot is a simple web based python FastAPI app that allows you to chat with an LLM using the OpenAI API. It supports multiple sessions and remembers your conversational history. Some RAG (Retrieval Augmented Generation) features including:

  • Summarizing external websites and PDFs (paste a URL in chat window)
  • List top 10 headlines from current news (use /news)
  • Display company stock symbol and current stock price (use /stock <company>)
  • Provide current weather conditions (use /weather <location>)
  • Use a vector databases for RAG queries - see RAG page for details
# Move to chatbot folder
cd ../chatbot
touch prompts.json

# Pull and run latest container - see run.sh
docker run \
    -d \
    -p 5000:5000 \
    -e PORT=5000 \
    -e OPENAI_API_BASE="http://localhost:8000/v1" \
    -e LLM_MODEL="tinyllm" \
    -e USE_SYSTEM="false" \
    -e SENTENCE_TRANSFORMERS_HOME=/app/.tinyllm \
    -v $PWD/.tinyllm:/app/.tinyllm \
    --name chatbot \
    --restart unless-stopped \
    jasonacox/chatbot

Example Session

Open http://localhost:5000 - Example session:

image

Read URLs

If a URL is pasted in the text box, the chatbot will read and summarize it.

image

Current News

The /news command will fetch the latest news and have the LLM summarize the top ten headlines. It will store the raw feed in the context prompt to allow follow-up questions.

image

Manual Setup

You can also test the chatbot server without docker using the following.

# Install required packages
pip3 install fastapi uvicorn python-socketio jinja2 openai bs4 pypdf requests lxml aiohttp

# Run the chatbot web server
python3 server.py

LLM Models

Here are some suggested models that work well with llmserver (llama-cpp-python). You can test other models and different quantization, but in my experiments, the Q5_K_M models performed the best. Below are the download links from HuggingFace as well as the model card's suggested context length size and chat prompt mode.

LLM Quantized Link to Download Context Length Chat Prompt Mode
7B Models
Mistral v0.1 7B 5-bit mistral-7b-instruct-v0.1.Q5_K_M.gguf 4096 llama-2
Llama-2 7B 5-bit llama-2-7b-chat.Q5_K_M.gguf 2048 llama-2
Mistrallite 32K 7B 5-bit mistrallite.Q5_K_M.gguf 16384 mistrallite (can be glitchy)
10B Models
Nous-Hermes-2-SOLAR 10.7B 5-bit nous-hermes-2-solar-10.7b.Q5_K_M.gguf 4096 chatml
13B Models
Claude2 trained Alpaca 13B 5-bit claude2-alpaca-13b.Q5_K_M.gguf 2048 chatml
Llama-2 13B 5-bit llama-2-13b-chat.Q5_K_M.gguf 2048 llama-2
Vicuna 13B v1.5 5-bit vicuna-13b-v1.5.Q5_K_M.gguf 2048 vicuna
Mixture-of-Experts (MoE) Models
Hai's Mixtral 11Bx2 MoE 19B 5-bit mixtral_11bx2_moe_19b.Q5_K_M.gguf 4096 chatml
Mixtral-8x7B v0.1 3-bit Mixtral-8x7B-Instruct-v0.1-GGUF 4096 llama-2
Mixtral-8x7B v0.1 4-bit Mixtral-8x7B-Instruct-v0.1-GGUF 4096 llama-2

Here are some suggested models that work well with vLLM.

LLM Quantized Link to Download Context Length
Mistral v0.1 7B None mistralai/Mistral-7B-Instruct-v0.1 32k
Mistral v0.2 7B None mistralai/Mistral-7B-Instruct-v0.2 32k
Mistral v0.1 7B AWQ AWQ TheBloke/Mistral-7B-Instruct-v0.1-AWQ 32k
Mixtral-8x7B None mistralai/Mixtral-8x7B-Instruct-v0.1 32k
Meta Llama-3 8B None meta-llama/Meta-Llama-3-8B-Instruct 8k

References

tinyllm's People

Contributors

jasonacox avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tinyllm's Issues

Chatbot - Switch to WSGI server

INFO WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

None of the "easy" conversions worked due to the way the threading works to support model output streaming (token streams to browser via socketio).

  • Gunicorn - No streaming token response and crashes
  • FastAPI + uvicorn + socketio - Moving to async: Test mostly works but again, no streaming

Ollama GPU support on Apple Silicon

When leveraging Ollama via Docker as mentioned in Option 1 on Apple Silicon using the --gpus=all flag. Since Apple Silicon is not using Nvidia GPU's. Docker Desktop is not exposed to Apple's own GPU, and users may receive the following error message:

docker: Error response from daemon: could not select device driver "" with capabilities: [[GPU]].

Recommend if I can submit a PR to the README with the following guidance:

**Apple Silicon GPU Support**:

Apple Silicon GPUs use the Metal Performance Shaders API, which is not as widely supported as NVIDIA's CUDA API. This means that Docker, which is commonly used to run applications in containers, does not detect or utilize the Apple Silicon GPU effectively.

**Docker Limitations**:
When running Ollama in Docker on an Apple Silicon Mac, the GPU is not detected, and the system falls back to using the CPU. This is because Docker images are typically configured to use NVIDIA GPU libraries, which are not compatible with Apple Silicon GPUs.

**Native Execution**:
Running Ollama natively on macOS, without Docker, can enable GPU acceleration. This approach leverages the Metal API directly, allowing better utilization of the Apple Silicon GPU.

**Model Size and Memory Constraints**:
Large models may not fit within the GPU memory available on Apple Silicon Macs, leading to fallback to CPU usage. For efficient performance, use models that fit within the memory accessible to the GPU (approximately 10.5GB for a 16GB RAM system).

Chatbot - FastAPI Template Update

/usr/local/lib/python3.10/site-packages/starlette/templating.py:178: DeprecationWarning: The name is not the first parameter anymore. The first parameter should be the Request instance.
Replace TemplateResponse(name, {"request": request}) by TemplateResponse(request, name).

vLLM Pascal architecture fix no longer works

Hello, what a great project! Unfortunately, vLLM fix for Pascal arch no longer works on the main branch.
vLLM changed the way it checks for compute capability, but I was unable to find how it's done in the current version
Would you be able to refactor the Pascal patch to make TinyLLM work with recent version of vLLM? Would much oblige!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.