h2oai / h2ogpt Goto Github PK

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

Home Page: http://h2o.ai

License: Apache License 2.0

Python 92.81% Dockerfile 0.02% Shell 0.52% Makefile 0.08% TeX 2.12% Groovy 0.22% Smarty 0.05% HTML 1.51% Jupyter Notebook 2.67%

chatgpt llm ai embeddings generative gpt gpt4all pdf private privategpt

h2ogpt's Introduction

h2oGPT

Turn ★ into ⭐ (top-right corner) if you like the project!

Query and summarize your documents or just chat with local private GPT LLMs using h2oGPT, an Apache V2 open-source project.

Private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, YouTube, Audio, Code, Text, MarkDown, etc.)
- Persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.)
- Efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach)
- Parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model
- HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses
Variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.)
- GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models
- Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.)
UI or CLI with streaming of all models
- Upload and View documents through the UI (control multiple collaborative or personal collections)
- Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision
- Image Generation Stable Diffusion (sdxl-turbo, sdxl, SD3) and PlaygroundAI (playv2)
- Voice STT using Whisper with streaming audio conversion
- Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion
- Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion
- AI Assistant Voice Control Mode for hands-free control of h2oGPT chat
- Bake-off UI mode against many models at the same time
- Easy Download of model artifacts and control over models like LLaMa.cpp through the UI
- Authentication in the UI by user/password via Native or Google OAuth
- State Preservation in the UI by user/password
Linux, Docker, macOS, and Windows support
- Easy Windows Installer for Windows 10 64-bit (CPU/CUDA)
- Easy macOS Installer for macOS (CPU/M1/M2)
Inference Servers support for oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, Together.ai, OpenAI, Azure OpenAI, Anthropic
OpenAI-compliant
- Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server)
- Supports Chat and Text Completions (streaming and non-streaming), Audio Transcription (STT), Audio Generation (TTS), Image Generation, and Embedding
JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema
Web-Search integration with Chat and Document Q/A
Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently)
Open Web UI with h2oGPT as backend via OpenAI Proxy
- Supports chat completion with streaming, document Q/A, STT, TTS, and image generation. See Start-up Docs.
Evaluate performance using reward models
Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours

Get Started

Limited Doc Q/A trial

To quickly try out h2oGPT with limited document Q/A capability, create a fresh Python 3.10 environment and run:

CPU or MAC (M1/M2):

# for windows/mac use "set" or relevant environment setting mechanism
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"

Linux/Windows CPU/CUDA/ROC:

# for windows/mac use "set" or relevant environment setting mechanism
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu121 https://huggingface.github.io/autogptq-index/whl/cu121"
# for cu118 use export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu118 https://huggingface.github.io/autogptq-index/whl/cu118"

Then choose your llama_cpp_python options, by changing CMAKE_ARGS to whichever system you have according to llama_cpp_python backend documentation. E.g. CUDA on Linux:

export LLAMA_CUBLAS=1
export CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all"
export FORCE_CMAKE=1

Note for some reason things will fail with llama_cpp_python if don't add all cuda arches, and building with all those arches does take some time. Windows CUDA:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all
set LLAMA_CUBLAS=1
set FORCE_CMAKE=1

Note for some reason things will fail with llama_cpp_python if don't add all cuda arches, and building with all those arches does take some time. Metal M1/M2:

export CMAKE_ARGS="-DLLAMA_METAL=on"
export FORCE_CMAKE=1

Chat with h2oGPT

Then run the following commands on any system:
   ```bash
   git clone https://github.com/h2oai/h2ogpt.git
   cd h2ogpt
   pip install -r requirements.txt
   pip install -r reqs_optional/requirements_optional_langchain.txt

   pip uninstall llama_cpp_python llama_cpp_python_cuda -y
   pip install -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt --no-cache-dir

   pip install -r reqs_optional/requirements_optional_langchain.urls.txt
   # GPL, only run next line if that is ok:
   pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt

   # choose up to 32768 if have enough GPU memory:
   python generate.py --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF --prompt_type=mistral --max_seq_len=4096

Next, go to your browser by visiting http://127.0.0.1:7860 or http://localhost:7860. Choose 13B for a better model than 7B.

Chat template based GGUF models

For newer chat template models, a --prompt_type is not required on CLI, but for GGUF files one should pass the HF tokenizer so it knows the chat template, e.g. for LLaMa-3:

python generate.py --base_model=llama --model_path_llama=https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf?download=true --tokenizer_base_model=meta-llama/Meta-Llama-3-8B-Instruct --max_seq_len=8192

Or for Phi:

python generate.py  --tokenizer_base_model=microsoft/Phi-3-mini-4k-instruct --base_model=llama --llama_cpp_model=https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf --max_seq_len=4096

the --llama_cpp_path could be a local path as well if you already downloaded it, or we will also check the llamacpp_path for the file.

See Offline for how to run h2oGPT offline.

Note that for all platforms, some packages such as DocTR, Unstructured, BLIP, Stable Diffusion, etc. download models at runtime that appear to delay operations in the UI. The progress appears in the console logs.

Windows 10/11 64-bit with full document Q/A capability

One-Click Installer
- CPU or GPU: Download h2oGPT Windows Installer (1.3GB file)
  - Once installed, feel free to change start directory for icon from %HOMEDRIVE%\%HOMEPATH% to (e.g.) %HOMEDRIVE%\%HOMEPATH%\h2ogpt_data so all created files (like database) go there. All paths saved are relative to this path.
- CPU: Click the h2oGPT icon in the Start menu. Give it about 15 seconds to open in a browser if many optional packages are included. By default, the browser will launch with the actual local IP address, not localhost.
- GPU: Before starting, run the following commands (replace pseud with your user):
```
C:\Users\pseud\AppData\Local\Programs\h2oGPT\Python\python.exe -m pip uninstall -y torch
C:\Users\pseud\AppData\Local\Programs\h2oGPT\Python\python.exe -m pip install https://h2o-release.s3.amazonaws.com/h2ogpt/torch-2.1.2%2Bcu118-cp310-cp310-win_amd64.whl
```
  Now click the h2oGPT icon in the Start menu. Give it about 20 seconds to open in a browser if many optional packages are included. By default, the browser will launch with the actual local IP address, not localhost.
  - Some other users may have python located here: C:\Program Files (x86)\h2oGPT\Python\python.exe.
- To debug any issues, run the following (replace pseud with your user):
```
C:\Users\pseud\AppData\Local\Programs\h2oGPT\Python\python.exe "C:\Users\pseud\AppData\Local\Programs\h2oGPT\h2oGPT.launch.pyw"
```
  Any start-up exceptions are appended to log, e.g. C:\Users\pseud\h2ogpt_exception.log.
To control startup, tweak the python startup file, e.g. for user pseud: C:\Users\pseud\AppData\Local\Programs\h2oGPT\pkgs\win_run_app.py
- In this Python code, set ENVs anywhere before main_h2ogpt() is called
  - E.g. os.environ['name'] = 'value', e.g. os.environ['n_jobs'] = '10' (must be always a string).
- Environment variables can be changed, e.g.:
  - n_jobs: number of cores for various tasks
  - OMP_NUM_THREADS thread count for LLaMa
  - CUDA_VISIBLE_DEVICES which GPUs are used. Recommend set to single fast GPU, e.g. CUDA_VISIBLE_DEVICES=0 if have multiple GPUs. Note that UI cannot control which GPUs (or CPU mode) for LLaMa models.
  - Any CLI argument from python generate.py --help with environment variable set as h2ogpt_x, e.g. h2ogpt_h2ocolors to False.
  - Set env h2ogpt_server_name to actual IP address for LAN to see app, e.g. h2ogpt_server_name to 192.168.1.172 and allow access through firewall if have Windows Defender activated.
One can tweak installed h2oGPT code at, e.g. C:\Users\pseud\AppData\Local\Programs\h2oGPT.
To terminate the app, go to System Tab and click Admin and click Shutdown h2oGPT.
- If startup fails, run as console and check for errors, e.g. and kill any old Python processes.
Full Windows 10/11 Manual Installation Script
- Single .bat file for installation (if you do not skip any optional packages, takes about 9GB filled on disk).
- Recommend base Conda env, which allows for DocTR that requires pygobject that has otherwise no support (except mysys2 that cannot be used by h2oGPT).
- Also allows for the TTS package by Coqui, which is otherwise not currently enabled in the one-click installer.

Linux (CPU/CUDA) with full document Q/A capability

macOS (CPU/M1/M2) with full document Q/A capability

One-click Installers (Experimental and subject to changes, we haven't tested each and every feature with these installers, we encourage the community to try them and report any issues)

Mar 07, 2024
- h2ogpt-osx-m1-cpu
- h2ogpt-osx-m1-gpu
Nov 08, 2023
- h2ogpt-osx-m1-cpu
- h2ogpt-osx-m1-gpu
Download the runnable file and open it from the Finder. It will take a few minutes to unpack and run the application. These one-click installers are experimental. Report any issues with steps to reproduce at https://github.com/h2oai/h2ogpt/issues.

Note: The app bundle is unsigned. If you experience any issues with running the app, run the following commands:
```
$ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu
$ chmod +x {file-path}/h2ogpt-osx-m1-gpu
```
macOS Manual Install and Run Docs

Live Demos

Resources

Partners

Live Leaderboard for GPT-4 Elo Evaluation of Instruct/Chat models with h2o-LLM-eval.
Advanced fine-tuning with H2O LLM Studio

Video Demo

demo2.mp4

YouTube 4K version: https://www.youtube.com/watch?v=_iktbj4obAI

Docs Guide

Experimental features

These are not part of normal installation instructions and are experimental.

Agents -- in Alpha testing. Optimal for OpenAI, but that also fails sometimes.

Roadmap

Complement h2oGPT chatbot with other APIs like ToolBench, Wolfram Alpha, Semantic Scholar, etc.
Enhance h2oGPT with reliable agentic control

Development

To create a development environment for training and generation, follow the installation instructions.
To fine-tune any LLM models on your data, follow the fine-tuning instructions.

To run h2oGPT tests:

pip install requirements-parser pytest-instafail pytest-random-order playsound==1.3.0
conda install -c conda-forge gst-python -y
sudo apt-get install gstreamer-1.0
pip install pygame
GPT_H2O_AI=0 CONCURRENCY_COUNT=1 pytest --instafail -s -v tests
# for openai server test on already-running local server
pytest -s -v -n 4 openai_server/test_openai_server.py::test_openai_client

or tweak/run tests/test4gpus.sh to run tests in parallel.

Help

FAQs
README for LangChain
Useful links for additional context and information on competitors, models, and datasets

Inference Benchmarks for Summarization & Generation

Acknowledgements

Some training code was based upon March 24 version of Alpaca-LoRA.
Used high-quality created data by OpenAssistant.
Used base models by EleutherAI.
Used OIG data created by LAION.

Why H2O.ai?

Our Makers at H2O.ai have built several world-class Machine Learning, Deep Learning and AI platforms:

#1 open-source machine learning platform for the enterprise H2O-3
The world's best AutoML (Automatic Machine Learning) with H2O Driverless AI
No-Code Deep Learning with H2O Hydrogen Torch
Document Processing with Deep Learning in Document AI

We also built platforms for deployment and monitoring, and for data wrangling and governance:

H2O MLOps to deploy and monitor models at scale
H2O Feature Store in collaboration with AT&T
Open-source Low-Code AI App Development Frameworks Wave and Nitro
Open-source Python datatable (the engine for H2O Driverless AI feature engineering)

Many of our customers are creating models and deploying them enterprise-wide and at scale in the H2O AI Cloud:

Multi-Cloud or on Premises
Managed Cloud (SaaS)
Hybrid Cloud
AI Appstore

We are proud to have over 25 (of the world's 280) Kaggle Grandmasters call H2O home, including three Kaggle Grandmasters who have made it to world #1.

Disclaimer

Please read this disclaimer carefully before using the large language model provided in this repository. Your use of the model signifies your agreement to the following terms and conditions.

Biases and Offensiveness: The large language model is trained on a diverse range of internet text data, which may contain biased, racist, offensive, or otherwise inappropriate content. By using this model, you acknowledge and accept that the generated content may sometimes exhibit biases or produce content that is offensive or inappropriate. The developers of this repository do not endorse, support, or promote any such content or viewpoints.
Limitations: The large language model is an AI-based tool and not a human. It may produce incorrect, nonsensical, or irrelevant responses. It is the user's responsibility to critically evaluate the generated content and use it at their discretion.
Use at Your Own Risk: Users of this large language model must assume full responsibility for any consequences that may arise from their use of the tool. The developers and contributors of this repository shall not be held liable for any damages, losses, or harm resulting from the use or misuse of the provided model.
Ethical Considerations: Users are encouraged to use the large language model responsibly and ethically. By using this model, you agree not to use it for purposes that promote hate speech, discrimination, harassment, or any form of illegal or harmful activities.
Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.

By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.

Star History

h2ogpt's People

Contributors

Stargazers

Watchers

Forkers

swapnilpatil22 themindasrimal sreekiranar clem0nt25 farshidbalan sabrinalameiras biterbilen demandresults vpegasus yousra-aoudi mzkaramat nrv wdshin pppppyamap cpatrickalves curiosity007 eltociear briancabbott gurpreetkaurjethra cjh88888 itsnotlupus ralf12358 aorist-ai sizzles nikileshsa steelblu jimafisk learning-student cts2021 ne-apps twilwa centaurioun csqr dumpmemory realbigdave912 kuntal-c gqadonis thesimpleone zxwar111 jandersolutions bonabobo msaifmfz k-kit gassechen scriptonics madushan98 ahmed-ali khryptorgraphics virtual-insaynityy peternara kimjuik askagirl achyun leemgs amart85 positioner qqq-tech sqsjavaer cch230 maninda bharatr21 munifico petercao q-kuwe-w rheehot ma1112 kelvinks benjamin-ky unecomunicacion yoshad-dev jade2290 06opotehb orellavie1212 imksuma joanzhou the0developer bybyzyanka sartify ado5 ayeshgk ddkang1 paulroberttaylor plaethos27 yusifelawawdeh alexlead sebjsan chorseng guillaumeolivieri hatgit evilrobotbuilder jfontestad hengshan techsuni2023 worthmining dkzdev kengnemabou skbylife experiencesnetwork lecole r1gan

h2ogpt's Issues

if human_bot, terminate early when see human, so generation doesn't continue and waste time

E.g. this should have taken no time, but in reality hidden are many human/bot exchanges, and it ended up taking 80s.

Cannot train 'EleutherAI/gpt-neox-20b' on 2x 24GB cards

Need to step up to larger models with permissive license. 30b Llama works, but can't be used. 6b is too small, bad results. So next better choice is gpt-neox-20b.

this works:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 python finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --llama_type=False --ddp=False --lora_target_modules="['query_key_value']" --base_model="EleutherAI/gpt-neox-20b" with python too.

See if can train Llama 13B/30B with flash attention from scratch on A100x8

https://huggingface.co/h2oai/h2ogpt-research-oasst1-512-30b

Increase in GPU memory usage as generation continues, imbalanced across GPUs

>>> import torch
>>> from transformers import pipeline
>>> from transformers import pipeline
>>> generate_text = pipeline(model="h2oai/h2ogpt-oasst1-512-20b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
>>> res = generate_text("Why is drinking water so healthy?", max_new_tokens=3000)
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

During this long generation, first starts out balanced, then increasingly imbalanced.

Thu Apr 20 16:37:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   45C    P2              105W / 250W|  12220MiB / 49140MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   45C    P2               72W / 250W|  11744MiB / 49140MiB |     17%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   45C    P2               98W / 250W|  11744MiB / 49140MiB |     19%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   45C    P2              103W / 250W|  11125MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  40566MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   48C    P2               76W / 250W|  15926MiB / 49140MiB |     36%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   48C    P2               87W / 250W|  15926MiB / 49140MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              130W / 250W|  14682MiB / 49140MiB |     21%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

but then can go back down by alot still during generation:

Thu Apr 20 16:47:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  18334MiB / 49140MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   49C    P2               74W / 250W|  17642MiB / 49140MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   50C    P2              117W / 250W|  17642MiB / 49140MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              115W / 250W|  16139MiB / 49140MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Also eventually fails:

??????????????????????????????? Traceback (most recent call last) ?????????????????????????????????
? in <module>:1                                                                                    ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:209 in __call__                                                                          ?
?                                                                                                  ?
?   206 ?   ?   ?   - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `retu   ?
?   207 ?   ?   ?     ids of the generated text.                                                   ?
?   208 ?   ?   """                                                                                ?
? ? 209 ?   ?   return super().__call__(text_inputs, **kwargs)                                     ?
?   210 ?                                                                                          ?
?   211 ?   def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate   ?
?   212 ?   ?   inputs = self.tokenizer(                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:110 ?
? 9 in __call__                                                                                    ?
?                                                                                                  ?
?   1106 ?   ?   ?   ?   )                                                                         ?
?   1107 ?   ?   ?   )                                                                             ?
?   1108 ?   ?   else:                                                                             ?
? ? 1109 ?   ?   ?   return self.run_single(inputs, preprocess_params, forward_params, postproces  ?
?   1110 ?                                                                                         ?
?   1111 ?   def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):   ?
?   1112 ?   ?   return [self.run_single(item, preprocess_params, forward_params, postprocess_par  ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:111 ?
? 6 in run_single                                                                                  ?
?                                                                                                  ?
?   1113 ?                                                                                         ?
?   1114 ?   def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):  ?
?   1115 ?   ?   model_inputs = self.preprocess(inputs, **preprocess_params)                       ?
? ? 1116 ?   ?   model_outputs = self.forward(model_inputs, **forward_params)                      ?
?   1117 ?   ?   outputs = self.postprocess(model_outputs, **postprocess_params)                   ?
?   1118 ?   ?   return outputs                                                                    ?
?   1119                                                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:101 ?
? 5 in forward                                                                                     ?
?                                                                                                  ?
?   1012 ?   ?   ?   ?   inference_context = self.get_inference_context()                          ?
?   1013 ?   ?   ?   ?   with inference_context():                                                 ?
?   1014 ?   ?   ?   ?   ?   model_inputs = self._ensure_tensor_on_device(model_inputs, device=se  ?
? ? 1015 ?   ?   ?   ?   ?   model_outputs = self._forward(model_inputs, **forward_params)         ?
?   1016 ?   ?   ?   ?   ?   model_outputs = self._ensure_tensor_on_device(model_outputs, device=  ?
?   1017 ?   ?   ?   else:                                                                         ?
?   1018 ?   ?   ?   ?   raise ValueError(f"Framework {self.framework} is not supported")          ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:251 in _forward                                                                          ?
?                                                                                                  ?
?   248 ?   ?   ?   in_b = input_ids.shape[0]                                                      ?
?   249 ?   ?   prompt_text = model_inputs.pop("prompt_text")                                      ?
?   250 ?   ?   # BS x SL                                                                          ?
? ? 251 ?   ?   generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=att   ?
?   252 ?   ?   out_b = generated_sequence.shape[0]                                                ?
?   253 ?   ?   if self.framework == "pt":                                                         ?
?   254 ?   ?   ?   generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *genera   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in  ?
? decorate_context                                                                                 ?
?                                                                                                  ?
?   112 ?   @functools.wraps(func)                                                                 ?
?   113 ?   def decorate_context(*args, **kwargs):                                                 ?
?   114 ?   ?   with ctx_factory():                                                                ?
? ? 115 ?   ?   ?   return func(*args, **kwargs)                                                   ?
?   116 ?                                                                                          ?
?   117 ?   return decorate_context                                                                ?
?   118                                                                                            ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:1 ?
? 437 in generate                                                                                  ?
?                                                                                                  ?
?   1434 ?   ?   ?   ?   )                                                                         ?
?   1435 ?   ?   ?                                                                                 ?
?   1436 ?   ?   ?   # 11. run greedy search                                                       ?
? ? 1437 ?   ?   ?   return self.greedy_search(                                                    ?
?   1438 ?   ?   ?   ?   input_ids,                                                                ?
?   1439 ?   ?   ?   ?   logits_processor=logits_processor,                                        ?
?   1440 ?   ?   ?   ?   stopping_criteria=stopping_criteria,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:2 ?
? 248 in greedy_search                                                                             ?
?                                                                                                  ?
?   2245 ?   ?   ?   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  ?
?   2246 ?   ?   ?                                                                                 ?
?   2247 ?   ?   ?   # forward pass to get next token                                              ?
? ? 2248 ?   ?   ?   outputs = self(                                                               ?
?   2249 ?   ?   ?   ?   **model_inputs,                                                           ?
?   2250 ?   ?   ?   ?   return_dict=True,                                                         ?
?   2251 ?   ?   ?   ?   output_attentions=output_attentions,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:662 in forward                                                                   ?
?                                                                                                  ?
?   659 ?   ?   ```"""                                                                             ?
?   660 ?   ?   return_dict = return_dict if return_dict is not None else self.config.use_return   ?
?   661 ?   ?                                                                                      ?
? ? 662 ?   ?   outputs = self.gpt_neox(                                                           ?
?   663 ?   ?   ?   input_ids,                                                                     ?
?   664 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   665 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:553 in forward                                                                   ?
?                                                                                                  ?
?   550 ?   ?   ?   ?   ?   head_mask[i],                                                          ?
?   551 ?   ?   ?   ?   )                                                                          ?
?   552 ?   ?   ?   else:                                                                          ?
? ? 553 ?   ?   ?   ?   outputs = layer(                                                           ?
?   554 ?   ?   ?   ?   ?   hidden_states,                                                         ?
?   555 ?   ?   ?   ?   ?   attention_mask=attention_mask,                                         ?
?   556 ?   ?   ?   ?   ?   position_ids=position_ids,                                             ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:320 in forward                                                                   ?
?                                                                                                  ?
?   317 ?   ?   layer_past: Optional[Tuple[torch.Tensor]] = None,                                  ?
?   318 ?   ?   output_attentions: Optional[bool] = False,                                         ?
?   319 ?   ):                                                                                     ?
? ? 320 ?   ?   attention_layer_outputs = self.attention(                                          ?
?   321 ?   ?   ?   self.input_layernorm(hidden_states),                                           ?
?   322 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   323 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:152 in forward                                                                   ?
?                                                                                                  ?
?   149 ?   ?   present = (key, value) if use_cache else None                                      ?
?   150 ?   ?                                                                                      ?
?   151 ?   ?   # Compute attention                                                                ?
? ? 152 ?   ?   attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_m   ?
?   153 ?   ?                                                                                      ?
?   154 ?   ?   # Reshape outputs                                                                  ?
?   155 ?   ?   attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:219 in _attn                                                                     ?
?                                                                                                  ?
?   216 ?   ?   # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar ty   ?
?   217 ?   ?   # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on    ?
?   218 ?   ?   mask_value = torch.tensor(mask_value, dtype=attn_scores.dtype).to(attn_scores.de   ?
? ? 219 ?   ?   attn_scores = torch.where(causal_mask, attn_scores, mask_value)                    ?
?   220 ?   ?                                                                                      ?
?   221 ?   ?   if attention_mask is not None:                                                     ?
?   222 ?   ?   ?   # Apply the attention mask                                                     ?
????????????????????????????????????????????????????????????????????????????????????????????????????
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3
>>>

Add issue templates

clean training data to be formatted well into paragraphs when long

E.g. like https://github.com/poloniki/quint/blob/master/notebooks/Chunking%20text%20into%20paragraphs.ipynb

Chunking text into paragraphs.ipynb.zip

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 1059, in process_api
    result = await self.call_function(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 868, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/data/jon/h2o-llm/generate.py", line 132, in evaluate
    outputs = model.generate(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1528, in generate
    return self.beam_sample(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 3126, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

"recipe refers to # Recipe type## Recipes override any GUI settings- **'auto'**: all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort- **'compliant'** : like 'auto' except:    - *interpretability=10* (to avoid complexity, overrides GUI or python client chose for interpretability)    - *enable_glm='on'* (rest 'off', to avoid complexity and be compatible with algorithms supported by MLI)    - *fixed_ensemble_level=0*: Don't use any ensemble    - *feature_brain_level=0*(: No feature brain used (to ensure every restart is identical)    - *max_feature_interaction_depth=1*: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)    - *target_transformer='identity'*: for regression (to avoid complexity)    - *check_distribution_shift_drop='off'*: Don't use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning)- **'monotonic_gbm'** : like 'auto' except:    - *monotonicity_constraints_interpretability_switch=1*: enable monotonicity constraints    - *self.config.monotonicity_constraints_correlation_threshold = 0.01*: see below    - *monotonicity_constraints_drop_low_correlation_features=true*: drop features that aren't correlated with target by at least 0.01 (specified by parameter above)    - *fixed_ensemble_level=0*: Don't use any ensemble (to avoid complexity)    - *included_models=['LightGBMModel']*    - *included_transformers=['OriginalTransformer']*: only original (numeric) features will be used    - *feature_brain_level=0*: No feature brain used (to ensure every restart is identical)    - *monotonicity_constraints_log_level='high'*    - *autodoc_pd_max_runtime=-1*: no timeout for PDP creation in AutoDoc- **'kaggle'** : like 'auto' except:    - external validation set is concatenated with train set, with target marked as missing    - test set is concatenated with train set, with target marked as missing    - transformers that do not use the target are allowed to fit_transform across entire train + validation + test    - several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)    - Note: If plentiful memory, can:        - choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,    otherwise default number of features given to transformer is limited to 50 by default        - choose mutation_mode = \"full\", so even more types are transformations are done at once per transformer- **'nlp_model'**: Only enables NLP models that process pure text- **'nlp_transformer'**: Only enables NLP transformers that process pure text, while any model type is allowed- **'image_model'**: Only enables Image models that process pure images- **'image_transformer'**: Only enables Image transformers that process pure images, while any model type is allowed- **'unsupervised'**: Only enables unsupervised transformers, models and scorers- **'gpus_max'**: Maximize use of GPUs (e.g. use XGBoost, rapids, Optuna hyperparameter search, etc.)- **'more_overfit_protection'**: Potentially improve overfit, esp. for small data, by disabling target encoding and making GA behave like final model for tree counts and learning rate- **'feature_store_mojo'**: Creates a MOJO to be used as transformer in the H2O Feature Store, to augment data on a row-by-row level based on Driverless AI's feature engineering. Only includes transformers that don't depend on the target, since features like target encoding need to be created at model fitting time to avoid data leakage. And features like lags need to be created from the raw data, they can't be computed with a row-by-row MOJO transformer.Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings.  Changing thepipeline building recipe will reset all pipeline building recipe options back to default and then re-apply thespecific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline buildingrecipe rules.If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-appliedand any fine-tuning is preserved.  To reset recipe behavior, one can switch between 'auto' and the desired mode.  Thisway the new child experiment will use the default settings for the chosen recipe." Summarize the above into a single paragraph.

in gradio, if refresh, state removed, but then model on GPU not clearable ever again

Inspected state and other things, don't see way to fix. No callback in the state reset appears.

So have to avoid hitting browser refresh if have something on GPU until figure out.

integrate wandb and mlflow

https://gradio.app/Gradio-and-Wandb-Integration/

Controlling URLs/links and other hard references so LLM can be fine-tuned on them or at least not mess them up

via prompt engineering, see voiceflow: https://youtu.be/1C3rU3fxcME?t=2056

e.g. https://www.perplexity.ai/

e.g. bing

Where can I find the codes to train the open source LLM, please?

Where can I find the codes to train the open source LLM, please? Trying to build an inhouse model.

Thank you.

validation framework

E.g. big-bench, etc. can't just use direct validation on next token etc.

Something went wrong $"{p1.Name} is {p1.Age} years old.");<br> Console.WriteLine($ ^ ParseException: Expected end of text, found '$' (at char 0), (line:1, col:1)

gradio error for certain inputs:

Downloading pytorch_model.bin: 100%|██████████| 1.74G/1.74G [00:25<00:00, 67.6MB/s]
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/deprecation.py:43: UserWarning: You have unused kwarg parameters in Row, please remove them: {'scale': 1}
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started GUI
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING: Special characters in prompt
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^
ParseException: Expected end of text, found '$'  (at char 0), (line:1, col:1)
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.

allow user to download their conversation

consider trl-peft for RLHF

https://huggingface.co/blog/trl-peft
https://huggingface.co/docs/trl/main/en/sentiment_tuning_peft
https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py

other:
https://huggingface.co/blog/stackllama
https://huggingface.co/docs/accelerate/usage_guides/big_modeling

special characters handled poorly, even SVG graphics

API for LLM

Design API for application and composability with h2o LLM (along the lines of Langchain / compatibility)

PR for langchain

See if can get 8bit 20B to run on 20GB, as advertised

huggingface/trl#300

chatbot: starlette.websockets.WebSocketDisconnect: 1001

Task exception was never retrieved
future: <Task finished name='xsce894h9ta_5' coro=<Queue.process_events() done, defined at /home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py:343> exception=WebSocketDisconnect(1001)>
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 347, in process_events
    client_awake = await self.gather_event_data(event)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 220, in gather_event_data
    data, client_awake = await self.get_message(event, timeout=receive_timeout)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 453, in get_message
    data = await asyncio.wait_for(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 133, in receive_json
    self._raise_on_disconnect(message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 105, in _raise_on_disconnect
    raise WebSocketDisconnect(message["code"])
starlette.websockets.WebSocketDisconnect: 1001

Reveal model probability to see confidence

Beams, top_k, top_p, etc. all mean we are probing process. Can we use this generation-time information to reveal confidence?

raise ValueError("\n" + ParseException.explain(err, 0)) from None

Some non-fatal matlab processing issue seen in HF demo:

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^

add chatbot mode to gradio

https://gradio.app/blocks-and-event-listeners/

43M OIG data, sample from it and mix with actual training data

get ApacheV2 dataset for instructions https://huggingface.co/datasets/laion/OIG
mix a sample of it with our "actual" dataset during training, so stays sharp
add prompt type as optional input per row, so can mix & match instruct (faq/OIG) vs plain data (say docs)
make prompt type an enum/str for better legibility
use correct prompt type for OIG

Collection of useful results

gradio causing slow generation

E.g. Put contents of this page into chat: https://www.emergentmind.com/ and hit enter

Once done, put this in: https://raw.githubusercontent.com/h2oai/h2o-llmstudio/main/LICENSE and hit enter

See slow generation and heavy CPU usage:

and very slow generation:

1 core usage, and alot of attempts by gradio to handle as image, tokenize it, parse it, etc.:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1039, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 491, in async_iteration
    return next(iterator)
  File "app.py", line 914, in bot
    for output in fun1(*tuple(args_list)):
  File "app.py", line 1346, in evaluate
    for output in CallbackToGenerator(generate, callback=None, **gen_kwargs):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/_collections_abc.py", line 317, in __next__
    return self.send(None)
  File "/home/user/app/stopping.py", line 119, in send
    return self._put('send', value)
  File "/home/user/app/stopping.py", line 111, in _put
    raise val
  File "/home/user/app/stopping.py", line 95, in thread_func
    ret = func(callback=val_callback, **self.kwargs)
  File "app.py", line 1324, in generate
    model.generate(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Control repetition -- increase repetition penalty? Some have increased to 1.1 or so

Give h2oGPT proper personality

Fix https://github.com/h2oai/h2ogpt/blob/main/FAQ.md#why-does-the-h2ogpt-say-it-was-trained-by-openai-or-open-assistant

add automatic push of conversations to HF data

filter our ToS violations like profanity (not captured by profanity filter)

not able to run inference on the docker

When I ran "sudo docker-compose up -d --build" and use docker-compose logs -f to check, I got the following errors:
My system has 32 GB DRAM and TitanX GPU, 12GB VRAM:

h2ogpt-h2o-llm-1 | python generate.py --base_model='togethercomputer/GPT-NeoXT-Chat-Base-20B' --prompt_type='human_bot' --lora_weights='GPT-NeoXT-Chat-Base-20B.merged.json.8_epochs.57b2892c53df5b8cefac45f84d019cace803ef26.28'
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 | Using Model eleutherai/gpt-j-6b
h2ogpt-h2o-llm-1 | Get EleutherAI/gpt-j-6B model
h2ogpt-h2o-llm-1 | Traceback (most recent call last):
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 1515, in
h2ogpt-h2o-llm-1 | fire.Fire(main)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
h2ogpt-h2o-llm-1 | component_trace = _Fire(component, args, parsed_flag_args, context, name)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
h2ogpt-h2o-llm-1 | component, remaining_args = _CallAndUpdateTrace(
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
h2ogpt-h2o-llm-1 | component = fn(*varargs, **kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 249, in main
h2ogpt-h2o-llm-1 | go_gradio(**locals())
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 490, in go_gradio
h2ogpt-h2o-llm-1 | model0, tokenizer0, device = get_model(**all_kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 358, in get_model
h2ogpt-h2o-llm-1 | device = get_device()
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 256, in get_device
h2ogpt-h2o-llm-1 | raise RuntimeError("only cuda supported")
h2ogpt-h2o-llm-1 | RuntimeError: only cuda supported
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
h2ogpt-h2o-llm-1 | return torch._C._cuda_getDeviceCount() > 0
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
h2ogpt-h2o-llm-1 | warn("The installed version of bitsandbytes was compiled without GPU support. "

Adversarial attack on reward models

Question: What do reward models really optimize for? How much assumed context do they have?

E.g. adversarial attack might include:

arbitrary \n after some average number of words
long semi-random sequence of words in paragraphs
i.e. just formatting.

It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.

Then reward models might assume alot about nature of input data, that already human readable, correct, etc.

How can RLHF can prune wrong/hallucinated responses?

Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.

thesis at front
average words per sentence
average sentences per paragraph
new lines between paragraphs
summary at end.

At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.

add tensorboard/w&b/neptune tracking

e.g. directly in gradio app as opposed to separate app.

Add option to replace attention with flash attention

Flash attention has already been integrated into gpt-neox models here: https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/models/gpt.py#L215

Can add the swapped model definition as an option to the training and generation scripts and benchmark the speed difference.

Converting Llama and others might be more work. it uses a pretty standard looking attention, but not sure how it differs from the pytorch default. Might just need to remap some layer names https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L160

gradio matplotlib issue then Tcl_AsyncDelete: async handler deleted by the wrong thread

/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/utils.py:901: UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail.
  fig = plt.figure(figsize=(0.01, 0.01))
Exception ignored in: <function Image.__del__ at 0x7f17e015f2e0>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 4056, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Ensemble multi-task LORAs

Plan is to develop multiple LORAs. Point is base can be inferenced once, then each new task can be:
1 base + first
2) -first + second
3) -second + third
etc.

So base is only forward once. This is normal part of LORA paper.

Mixture-of-experts idea can then be used, where yet another LORA is built, but this time it sits in front of all other LORA outputs an an ensemble model to be able to handle the diverse tasks. In principle alot less data is required for the ensemble LORA for it to just choose which task LORAs to blend.

Benchmarks on 2xA6000 Ada vs 2xA100 80GB (roughly same speed)

2x A6000 Ada:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node=2 --nnodes=1 finetune.py --data_path=ShareGPT_unfiltered_cleaned_split.json.generate_human_bot.train_plain.json --num_epochs=1 --base_model=togethercomputer/GPT-NeoXT-Chat-Base-20B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=4 --batch_size=16 --cutoff_len=1024 --run_id=4
54%|█████▍ | 2888/5311 [21:08:17<17:33:41, 26.09s/it]

Add Code of Conduct

gradio_client==0.1.3 fails causes gradio app to fail with recursion error when using client_test.py

python generate.py  --base_model=gpt2

causes gradio server to fail with recursion error:

clientout.log

Still fails with same requirements back on ebaedb7 that worked fine

Changing gradio version doesn't help, but changing gradio_client from 0.1.3 back to 0.0.8 leads to no issues on old or current hash: 3548454

Hallucinations over generation. models tend to increasingly ramble, how to control?

Often (e.g. for driverless docs fine-tuning) @arnocandel found the model would give a good first sentence, but every new sentence would hallucinate more and more. How to control?

That is, first sentence good, second ok, third odd, fourth made up, fifth crazy, sixth random:

Hard token/termination isn't sufficient, but sometimes works:

For Model tab, some minor fixes

clear new name/path after add
Make empty lora more obvious somehow
Fix examples, so don't have to include model_state

Recover when GPU OOMs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 22.20 GiB total capacity; 20.67 GiB already allocated; 4.12 MiB free; 21.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

brings app down, no longer can generate. Protect against GPU OOM or at least recover without hanging.

RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

When max new tokens >~ 2048

See bottom of: #66

Train with all clean OSS data + model

Step 1: Get best open-source model:

model: togethercomputer/GPT-NeoXT-Chat-Base-20B https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B

Step 2: Get good open-source instruct data:

Inspired by
https://bair.berkeley.edu/blog/2023/04/03/koala/

Note: GPT-NeoXT-Chat-Base-20B was already trained on OIG data, so "nothing new", just fine-tuning on high-quality data. We need to include new good datasets too.

Run these pytests to create data:
https://github.com/h2oai/h2o-llm/blob/8a1636e35bba5be28d41ab27719d0f70d7eccd91/scrape_dai_docs.py#L364-L398

https://slack-files.com/T0329MHH6-F051UHFFUTD-d93fe5bb76 direct link to data (136MB)

working examples

"Unable to locate package nvidia-container-toolkit" on Debian (Ubuntu) x86_64

Hi Team,

Nice work and appreciate your efforts on this project 🫡

I am trying to run the Docker container and I had the following issue when executing the command sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base

Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Fetched 110 kB in 1s (195 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package nvidia-container-toolkit-base

And the solution I found was to:

wget https://nvidia.github.io/nvidia-docker/gpgkey --no-check-certificate
sudo apt-key add gpgkey
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

This fix the problem but still giving the following error for the command docker run --runtime=nvidia --shm-size=64g -p 7860:7860 -v ${HOME}/.cache:/root/.cache --rm h2o-llm -it generate.py --base_model=EleutherAI/gpt-neox-20b --lora_weights=h2ogpt_lora_weights --prompt_type=human_bot

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Could someone help me on this? I am trying to run the Docker container. Tried with docker compose up but still the same.

Isn't the point of a demo to demo a working product?

Can't even get a response generated and I know I'm not the only one.

input_ids are not moved to GPU

I'm running this locally with downloaded h2oai_pipeline:

`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`

And while the generation works, I get this Warning:

Setting pad_token_idtoeos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the input_idsbeing on a device type different than your model's device.input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running.generate(). warnings.warn(

Question 1: How do I make your custom pipeline move the input_ids to GPU?

Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?

Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?

Thanks!

Have to push stop twice, once for stopping output and another to stop actual GPU generation, fix

Tried adding click_event twice in cancel, didn't help.

Also, while message stops instantly, generation might continue for 2-3 seconds more since in middle of hard generation.

Also, bit uncontrolled, hits the ValueError when generation finally stopped:

Traceback (most recent call last):
  File "/data/jon/h2o-llm/callbacks.py", line 48, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/data/jon/h2o-llm/generate.py", line 597, in generate_with_callback
    model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 2256, in greedy_search
    if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in __call__
    return any(criteria(input_ids, scores) for criteria in self)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in <genexpr>
    return any(criteria(input_ids, scores) for criteria in self)
  File "/data/jon/h2o-llm/callbacks.py", line 22, in __call__
    self.callback_func(input_ids[0])
  File "/data/jon/h2o-llm/callbacks.py", line 43, in _callback
    raise ValueError
ValueError

Testing grounds for LLMs - Validation Framework

https://twitter.com/omarsar0/status/1641792530667675648/photo/1

Compare various models on a fixed set of instructions

https://github.com/h2oai/h2o-llm/blob/5eccc3fadc58384f7cffe2769ed5dd5176d84d6e/generate.py#L496-L527