h2oai / h2ogpt Goto Github PK

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

Home Page: http://h2o.ai

License: Apache License 2.0

Python 92.09% Dockerfile 0.03% Shell 0.59% Makefile 0.09% TeX 2.33% Groovy 0.24% Smarty 0.06% HTML 1.66% Jupyter Notebook 2.93%

chatgpt llm ai embeddings generative gpt gpt4all pdf private privategpt

h2ogpt's Issues

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1039, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 491, in async_iteration
    return next(iterator)
  File "app.py", line 914, in bot
    for output in fun1(*tuple(args_list)):
  File "app.py", line 1346, in evaluate
    for output in CallbackToGenerator(generate, callback=None, **gen_kwargs):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/_collections_abc.py", line 317, in __next__
    return self.send(None)
  File "/home/user/app/stopping.py", line 119, in send
    return self._put('send', value)
  File "/home/user/app/stopping.py", line 111, in _put
    raise val
  File "/home/user/app/stopping.py", line 95, in thread_func
    ret = func(callback=val_callback, **self.kwargs)
  File "app.py", line 1324, in generate
    model.generate(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Reveal model probability to see confidence

Beams, top_k, top_p, etc. all mean we are probing process. Can we use this generation-time information to reveal confidence?

Add issue templates

Benchmarks on 2xA6000 Ada vs 2xA100 80GB (roughly same speed)

2x A6000 Ada:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node=2 --nnodes=1 finetune.py --data_path=ShareGPT_unfiltered_cleaned_split.json.generate_human_bot.train_plain.json --num_epochs=1 --base_model=togethercomputer/GPT-NeoXT-Chat-Base-20B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=4 --batch_size=16 --cutoff_len=1024 --run_id=4
54%|█████▍ | 2888/5311 [21:08:17<17:33:41, 26.09s/it]

Train with all clean OSS data + model

Step 1: Get best open-source model:

model: togethercomputer/GPT-NeoXT-Chat-Base-20B https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B

Step 2: Get good open-source instruct data:

Inspired by
https://bair.berkeley.edu/blog/2023/04/03/koala/

Note: GPT-NeoXT-Chat-Base-20B was already trained on OIG data, so "nothing new", just fine-tuning on high-quality data. We need to include new good datasets too.

Run these pytests to create data:
https://github.com/h2oai/h2o-llm/blob/8a1636e35bba5be28d41ab27719d0f70d7eccd91/scrape_dai_docs.py#L364-L398

https://slack-files.com/T0329MHH6-F051UHFFUTD-d93fe5bb76 direct link to data (136MB)

not able to run inference on the docker

When I ran "sudo docker-compose up -d --build" and use docker-compose logs -f to check, I got the following errors:
My system has 32 GB DRAM and TitanX GPU, 12GB VRAM:

h2ogpt-h2o-llm-1 | python generate.py --base_model='togethercomputer/GPT-NeoXT-Chat-Base-20B' --prompt_type='human_bot' --lora_weights='GPT-NeoXT-Chat-Base-20B.merged.json.8_epochs.57b2892c53df5b8cefac45f84d019cace803ef26.28'
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 | Using Model eleutherai/gpt-j-6b
h2ogpt-h2o-llm-1 | Get EleutherAI/gpt-j-6B model
h2ogpt-h2o-llm-1 | Traceback (most recent call last):
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 1515, in
h2ogpt-h2o-llm-1 | fire.Fire(main)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
h2ogpt-h2o-llm-1 | component_trace = _Fire(component, args, parsed_flag_args, context, name)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
h2ogpt-h2o-llm-1 | component, remaining_args = _CallAndUpdateTrace(
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
h2ogpt-h2o-llm-1 | component = fn(*varargs, **kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 249, in main
h2ogpt-h2o-llm-1 | go_gradio(**locals())
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 490, in go_gradio
h2ogpt-h2o-llm-1 | model0, tokenizer0, device = get_model(**all_kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 358, in get_model
h2ogpt-h2o-llm-1 | device = get_device()
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 256, in get_device
h2ogpt-h2o-llm-1 | raise RuntimeError("only cuda supported")
h2ogpt-h2o-llm-1 | RuntimeError: only cuda supported
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
h2ogpt-h2o-llm-1 | return torch._C._cuda_getDeviceCount() > 0
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
h2ogpt-h2o-llm-1 | warn("The installed version of bitsandbytes was compiled without GPU support. "

in gradio, if refresh, state removed, but then model on GPU not clearable ever again

Inspected state and other things, don't see way to fix. No callback in the state reset appears.

So have to avoid hitting browser refresh if have something on GPU until figure out.

Collection of useful results

special characters handled poorly, even SVG graphics

See if can get 8bit 20B to run on 20GB, as advertised

huggingface/trl#300

gradio_client==0.1.3 fails causes gradio app to fail with recursion error when using client_test.py

python generate.py  --base_model=gpt2

causes gradio server to fail with recursion error:

clientout.log

Still fails with same requirements back on ebaedb7 that worked fine

Changing gradio version doesn't help, but changing gradio_client from 0.1.3 back to 0.0.8 leads to no issues on old or current hash: 3548454

Something went wrong $"{p1.Name} is {p1.Age} years old.");<br> Console.WriteLine($ ^ ParseException: Expected end of text, found '$' (at char 0), (line:1, col:1)

gradio error for certain inputs:

Downloading pytorch_model.bin: 100%|██████████| 1.74G/1.74G [00:25<00:00, 67.6MB/s]
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/deprecation.py:43: UserWarning: You have unused kwarg parameters in Row, please remove them: {'scale': 1}
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started GUI
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING: Special characters in prompt
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^
ParseException: Expected end of text, found '$'  (at char 0), (line:1, col:1)
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.

Isn't the point of a demo to demo a working product?

Can't even get a response generated and I know I'm not the only one.

integrate wandb and mlflow

https://gradio.app/Gradio-and-Wandb-Integration/

"Unable to locate package nvidia-container-toolkit" on Debian (Ubuntu) x86_64

Hi Team,

Nice work and appreciate your efforts on this project 🫡

I am trying to run the Docker container and I had the following issue when executing the command sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base

Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Fetched 110 kB in 1s (195 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package nvidia-container-toolkit-base

And the solution I found was to:

wget https://nvidia.github.io/nvidia-docker/gpgkey --no-check-certificate
sudo apt-key add gpgkey
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

This fix the problem but still giving the following error for the command docker run --runtime=nvidia --shm-size=64g -p 7860:7860 -v ${HOME}/.cache:/root/.cache --rm h2o-llm -it generate.py --base_model=EleutherAI/gpt-neox-20b --lora_weights=h2ogpt_lora_weights --prompt_type=human_bot

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Could someone help me on this? I am trying to run the Docker container. Tried with docker compose up but still the same.

consider trl-peft for RLHF

https://huggingface.co/blog/trl-peft
https://huggingface.co/docs/trl/main/en/sentiment_tuning_peft
https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py

other:
https://huggingface.co/blog/stackllama
https://huggingface.co/docs/accelerate/usage_guides/big_modeling

Control repetition -- increase repetition penalty? Some have increased to 1.1 or so

raise ValueError("\n" + ParseException.explain(err, 0)) from None

Some non-fatal matlab processing issue seen in HF demo:

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^

Ensemble multi-task LORAs

Plan is to develop multiple LORAs. Point is base can be inferenced once, then each new task can be:
1 base + first
2) -first + second
3) -second + third
etc.

So base is only forward once. This is normal part of LORA paper.

Mixture-of-experts idea can then be used, where yet another LORA is built, but this time it sits in front of all other LORA outputs an an ensemble model to be able to handle the diverse tasks. In principle alot less data is required for the ensemble LORA for it to just choose which task LORAs to blend.

add automatic push of conversations to HF data

filter our ToS violations like profanity (not captured by profanity filter)

Controlling URLs/links and other hard references so LLM can be fine-tuned on them or at least not mess them up

via prompt engineering, see voiceflow: https://youtu.be/1C3rU3fxcME?t=2056

e.g. https://www.perplexity.ai/

e.g. bing

RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3

When max new tokens >~ 2048

See bottom of: #66

input_ids are not moved to GPU

I'm running this locally with downloaded h2oai_pipeline:

`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`

And while the generation works, I get this Warning:

Setting pad_token_idtoeos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the input_idsbeing on a device type different than your model's device.input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running.generate(). warnings.warn(

Question 1: How do I make your custom pipeline move the input_ids to GPU?

Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?

Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?

Thanks!

if human_bot, terminate early when see human, so generation doesn't continue and waste time

E.g. this should have taken no time, but in reality hidden are many human/bot exchanges, and it ended up taking 80s.

Adversarial attack on reward models

Question: What do reward models really optimize for? How much assumed context do they have?

E.g. adversarial attack might include:

arbitrary \n after some average number of words
long semi-random sequence of words in paragraphs
i.e. just formatting.

It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.

Then reward models might assume alot about nature of input data, that already human readable, correct, etc.

How can RLHF can prune wrong/hallucinated responses?

Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.

thesis at front
average words per sentence
average sentences per paragraph
new lines between paragraphs
summary at end.

At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.

43M OIG data, sample from it and mix with actual training data

get ApacheV2 dataset for instructions https://huggingface.co/datasets/laion/OIG
mix a sample of it with our "actual" dataset during training, so stays sharp
add prompt type as optional input per row, so can mix & match instruct (faq/OIG) vs plain data (say docs)
make prompt type an enum/str for better legibility
use correct prompt type for OIG

API for LLM

Design API for application and composability with h2o LLM (along the lines of Langchain / compatibility)

PR for langchain

Have to push stop twice, once for stopping output and another to stop actual GPU generation, fix

Tried adding click_event twice in cancel, didn't help.

Also, while message stops instantly, generation might continue for 2-3 seconds more since in middle of hard generation.

Also, bit uncontrolled, hits the ValueError when generation finally stopped:

Traceback (most recent call last):
  File "/data/jon/h2o-llm/callbacks.py", line 48, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/data/jon/h2o-llm/generate.py", line 597, in generate_with_callback
    model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 2256, in greedy_search
    if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in __call__
    return any(criteria(input_ids, scores) for criteria in self)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in <genexpr>
    return any(criteria(input_ids, scores) for criteria in self)
  File "/data/jon/h2o-llm/callbacks.py", line 22, in __call__
    self.callback_func(input_ids[0])
  File "/data/jon/h2o-llm/callbacks.py", line 43, in _callback
    raise ValueError
ValueError

See if can train Llama 13B/30B with flash attention from scratch on A100x8

https://huggingface.co/h2oai/h2ogpt-research-oasst1-512-30b

Hallucinations over generation. models tend to increasingly ramble, how to control?

Often (e.g. for driverless docs fine-tuning) @arnocandel found the model would give a good first sentence, but every new sentence would hallucinate more and more. How to control?

That is, first sentence good, second ok, third odd, fourth made up, fifth crazy, sixth random:

Hard token/termination isn't sufficient, but sometimes works:

clean training data to be formatted well into paragraphs when long

E.g. like https://github.com/poloniki/quint/blob/master/notebooks/Chunking%20text%20into%20paragraphs.ipynb

Chunking text into paragraphs.ipynb.zip

working examples

gradio matplotlib issue then Tcl_AsyncDelete: async handler deleted by the wrong thread

/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/utils.py:901: UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail.
  fig = plt.figure(figsize=(0.01, 0.01))
Exception ignored in: <function Image.__del__ at 0x7f17e015f2e0>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 4056, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Increase in GPU memory usage as generation continues, imbalanced across GPUs

>>> import torch
>>> from transformers import pipeline
>>> from transformers import pipeline
>>> generate_text = pipeline(model="h2oai/h2ogpt-oasst1-512-20b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
>>> res = generate_text("Why is drinking water so healthy?", max_new_tokens=3000)
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

During this long generation, first starts out balanced, then increasingly imbalanced.

Thu Apr 20 16:37:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   45C    P2              105W / 250W|  12220MiB / 49140MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   45C    P2               72W / 250W|  11744MiB / 49140MiB |     17%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   45C    P2               98W / 250W|  11744MiB / 49140MiB |     19%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   45C    P2              103W / 250W|  11125MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  40566MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   48C    P2               76W / 250W|  15926MiB / 49140MiB |     36%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   48C    P2               87W / 250W|  15926MiB / 49140MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              130W / 250W|  14682MiB / 49140MiB |     21%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

but then can go back down by alot still during generation:

Thu Apr 20 16:47:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  18334MiB / 49140MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   49C    P2               74W / 250W|  17642MiB / 49140MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   50C    P2              117W / 250W|  17642MiB / 49140MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              115W / 250W|  16139MiB / 49140MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Also eventually fails:

??????????????????????????????? Traceback (most recent call last) ?????????????????????????????????
? in <module>:1                                                                                    ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:209 in __call__                                                                          ?
?                                                                                                  ?
?   206 ?   ?   ?   - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `retu   ?
?   207 ?   ?   ?     ids of the generated text.                                                   ?
?   208 ?   ?   """                                                                                ?
? ? 209 ?   ?   return super().__call__(text_inputs, **kwargs)                                     ?
?   210 ?                                                                                          ?
?   211 ?   def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate   ?
?   212 ?   ?   inputs = self.tokenizer(                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:110 ?
? 9 in __call__                                                                                    ?
?                                                                                                  ?
?   1106 ?   ?   ?   ?   )                                                                         ?
?   1107 ?   ?   ?   )                                                                             ?
?   1108 ?   ?   else:                                                                             ?
? ? 1109 ?   ?   ?   return self.run_single(inputs, preprocess_params, forward_params, postproces  ?
?   1110 ?                                                                                         ?
?   1111 ?   def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):   ?
?   1112 ?   ?   return [self.run_single(item, preprocess_params, forward_params, postprocess_par  ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:111 ?
? 6 in run_single                                                                                  ?
?                                                                                                  ?
?   1113 ?                                                                                         ?
?   1114 ?   def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):  ?
?   1115 ?   ?   model_inputs = self.preprocess(inputs, **preprocess_params)                       ?
? ? 1116 ?   ?   model_outputs = self.forward(model_inputs, **forward_params)                      ?
?   1117 ?   ?   outputs = self.postprocess(model_outputs, **postprocess_params)                   ?
?   1118 ?   ?   return outputs                                                                    ?
?   1119                                                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:101 ?
? 5 in forward                                                                                     ?
?                                                                                                  ?
?   1012 ?   ?   ?   ?   inference_context = self.get_inference_context()                          ?
?   1013 ?   ?   ?   ?   with inference_context():                                                 ?
?   1014 ?   ?   ?   ?   ?   model_inputs = self._ensure_tensor_on_device(model_inputs, device=se  ?
? ? 1015 ?   ?   ?   ?   ?   model_outputs = self._forward(model_inputs, **forward_params)         ?
?   1016 ?   ?   ?   ?   ?   model_outputs = self._ensure_tensor_on_device(model_outputs, device=  ?
?   1017 ?   ?   ?   else:                                                                         ?
?   1018 ?   ?   ?   ?   raise ValueError(f"Framework {self.framework} is not supported")          ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:251 in _forward                                                                          ?
?                                                                                                  ?
?   248 ?   ?   ?   in_b = input_ids.shape[0]                                                      ?
?   249 ?   ?   prompt_text = model_inputs.pop("prompt_text")                                      ?
?   250 ?   ?   # BS x SL                                                                          ?
? ? 251 ?   ?   generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=att   ?
?   252 ?   ?   out_b = generated_sequence.shape[0]                                                ?
?   253 ?   ?   if self.framework == "pt":                                                         ?
?   254 ?   ?   ?   generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *genera   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in  ?
? decorate_context                                                                                 ?
?                                                                                                  ?
?   112 ?   @functools.wraps(func)                                                                 ?
?   113 ?   def decorate_context(*args, **kwargs):                                                 ?
?   114 ?   ?   with ctx_factory():                                                                ?
? ? 115 ?   ?   ?   return func(*args, **kwargs)                                                   ?
?   116 ?                                                                                          ?
?   117 ?   return decorate_context                                                                ?
?   118                                                                                            ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:1 ?
? 437 in generate                                                                                  ?
?                                                                                                  ?
?   1434 ?   ?   ?   ?   )                                                                         ?
?   1435 ?   ?   ?                                                                                 ?
?   1436 ?   ?   ?   # 11. run greedy search                                                       ?
? ? 1437 ?   ?   ?   return self.greedy_search(                                                    ?
?   1438 ?   ?   ?   ?   input_ids,                                                                ?
?   1439 ?   ?   ?   ?   logits_processor=logits_processor,                                        ?
?   1440 ?   ?   ?   ?   stopping_criteria=stopping_criteria,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:2 ?
? 248 in greedy_search                                                                             ?
?                                                                                                  ?
?   2245 ?   ?   ?   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  ?
?   2246 ?   ?   ?                                                                                 ?
?   2247 ?   ?   ?   # forward pass to get next token                                              ?
? ? 2248 ?   ?   ?   outputs = self(                                                               ?
?   2249 ?   ?   ?   ?   **model_inputs,                                                           ?
?   2250 ?   ?   ?   ?   return_dict=True,                                                         ?
?   2251 ?   ?   ?   ?   output_attentions=output_attentions,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:662 in forward                                                                   ?
?                                                                                                  ?
?   659 ?   ?   ```"""                                                                             ?
?   660 ?   ?   return_dict = return_dict if return_dict is not None else self.config.use_return   ?
?   661 ?   ?                                                                                      ?
? ? 662 ?   ?   outputs = self.gpt_neox(                                                           ?
?   663 ?   ?   ?   input_ids,                                                                     ?
?   664 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   665 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:553 in forward                                                                   ?
?                                                                                                  ?
?   550 ?   ?   ?   ?   ?   head_mask[i],                                                          ?
?   551 ?   ?   ?   ?   )                                                                          ?
?   552 ?   ?   ?   else:                                                                          ?
? ? 553 ?   ?   ?   ?   outputs = layer(                                                           ?
?   554 ?   ?   ?   ?   ?   hidden_states,                                                         ?
?   555 ?   ?   ?   ?   ?   attention_mask=attention_mask,                                         ?
?   556 ?   ?   ?   ?   ?   position_ids=position_ids,                                             ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:320 in forward                                                                   ?
?                                                                                                  ?
?   317 ?   ?   layer_past: Optional[Tuple[torch.Tensor]] = None,                                  ?
?   318 ?   ?   output_attentions: Optional[bool] = False,                                         ?
?   319 ?   ):                                                                                     ?
? ? 320 ?   ?   attention_layer_outputs = self.attention(                                          ?
?   321 ?   ?   ?   self.input_layernorm(hidden_states),                                           ?
?   322 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   323 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:152 in forward                                                                   ?
?                                                                                                  ?
?   149 ?   ?   present = (key, value) if use_cache else None                                      ?
?   150 ?   ?                                                                                      ?
?   151 ?   ?   # Compute attention                                                                ?
? ? 152 ?   ?   attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_m   ?
?   153 ?   ?                                                                                      ?
?   154 ?   ?   # Reshape outputs                                                                  ?
?   155 ?   ?   attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:219 in _attn                                                                     ?
?                                                                                                  ?
?   216 ?   ?   # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar ty   ?
?   217 ?   ?   # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on    ?
?   218 ?   ?   mask_value = torch.tensor(mask_value, dtype=attn_scores.dtype).to(attn_scores.de   ?
? ? 219 ?   ?   attn_scores = torch.where(causal_mask, attn_scores, mask_value)                    ?
?   220 ?   ?                                                                                      ?
?   221 ?   ?   if attention_mask is not None:                                                     ?
?   222 ?   ?   ?   # Apply the attention mask                                                     ?
????????????????????????????????????????????????????????????????????????????????????????????????????
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3
>>>

Compare various models on a fixed set of instructions

https://github.com/h2oai/h2o-llm/blob/5eccc3fadc58384f7cffe2769ed5dd5176d84d6e/generate.py#L496-L527

Testing grounds for LLMs - Validation Framework

https://twitter.com/omarsar0/status/1641792530667675648/photo/1

add tensorboard/w&b/neptune tracking

e.g. directly in gradio app as opposed to separate app.

Give h2oGPT proper personality

Fix https://github.com/h2oai/h2ogpt/blob/main/FAQ.md#why-does-the-h2ogpt-say-it-was-trained-by-openai-or-open-assistant

Add Code of Conduct

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 1059, in process_api
    result = await self.call_function(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 868, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/data/jon/h2o-llm/generate.py", line 132, in evaluate
    outputs = model.generate(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1528, in generate
    return self.beam_sample(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 3126, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

"recipe refers to # Recipe type## Recipes override any GUI settings- **'auto'**: all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort- **'compliant'** : like 'auto' except:    - *interpretability=10* (to avoid complexity, overrides GUI or python client chose for interpretability)    - *enable_glm='on'* (rest 'off', to avoid complexity and be compatible with algorithms supported by MLI)    - *fixed_ensemble_level=0*: Don't use any ensemble    - *feature_brain_level=0*(: No feature brain used (to ensure every restart is identical)    - *max_feature_interaction_depth=1*: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)    - *target_transformer='identity'*: for regression (to avoid complexity)    - *check_distribution_shift_drop='off'*: Don't use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning)- **'monotonic_gbm'** : like 'auto' except:    - *monotonicity_constraints_interpretability_switch=1*: enable monotonicity constraints    - *self.config.monotonicity_constraints_correlation_threshold = 0.01*: see below    - *monotonicity_constraints_drop_low_correlation_features=true*: drop features that aren't correlated with target by at least 0.01 (specified by parameter above)    - *fixed_ensemble_level=0*: Don't use any ensemble (to avoid complexity)    - *included_models=['LightGBMModel']*    - *included_transformers=['OriginalTransformer']*: only original (numeric) features will be used    - *feature_brain_level=0*: No feature brain used (to ensure every restart is identical)    - *monotonicity_constraints_log_level='high'*    - *autodoc_pd_max_runtime=-1*: no timeout for PDP creation in AutoDoc- **'kaggle'** : like 'auto' except:    - external validation set is concatenated with train set, with target marked as missing    - test set is concatenated with train set, with target marked as missing    - transformers that do not use the target are allowed to fit_transform across entire train + validation + test    - several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)    - Note: If plentiful memory, can:        - choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,    otherwise default number of features given to transformer is limited to 50 by default        - choose mutation_mode = \"full\", so even more types are transformations are done at once per transformer- **'nlp_model'**: Only enables NLP models that process pure text- **'nlp_transformer'**: Only enables NLP transformers that process pure text, while any model type is allowed- **'image_model'**: Only enables Image models that process pure images- **'image_transformer'**: Only enables Image transformers that process pure images, while any model type is allowed- **'unsupervised'**: Only enables unsupervised transformers, models and scorers- **'gpus_max'**: Maximize use of GPUs (e.g. use XGBoost, rapids, Optuna hyperparameter search, etc.)- **'more_overfit_protection'**: Potentially improve overfit, esp. for small data, by disabling target encoding and making GA behave like final model for tree counts and learning rate- **'feature_store_mojo'**: Creates a MOJO to be used as transformer in the H2O Feature Store, to augment data on a row-by-row level based on Driverless AI's feature engineering. Only includes transformers that don't depend on the target, since features like target encoding need to be created at model fitting time to avoid data leakage. And features like lags need to be created from the raw data, they can't be computed with a row-by-row MOJO transformer.Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings.  Changing thepipeline building recipe will reset all pipeline building recipe options back to default and then re-apply thespecific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline buildingrecipe rules.If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-appliedand any fine-tuning is preserved.  To reset recipe behavior, one can switch between 'auto' and the desired mode.  Thisway the new child experiment will use the default settings for the chosen recipe." Summarize the above into a single paragraph.

allow user to download their conversation

Add option to replace attention with flash attention

Flash attention has already been integrated into gpt-neox models here: https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/models/gpt.py#L215

Can add the swapped model definition as an option to the training and generation scripts and benchmark the speed difference.

Converting Llama and others might be more work. it uses a pretty standard looking attention, but not sure how it differs from the pytorch default. Might just need to remap some layer names https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L160

gradio causing slow generation

E.g. Put contents of this page into chat: https://www.emergentmind.com/ and hit enter

Once done, put this in: https://raw.githubusercontent.com/h2oai/h2o-llmstudio/main/LICENSE and hit enter

See slow generation and heavy CPU usage:

and very slow generation:

1 core usage, and alot of attempts by gradio to handle as image, tokenize it, parse it, etc.:

chatbot: starlette.websockets.WebSocketDisconnect: 1001

Task exception was never retrieved
future: <Task finished name='xsce894h9ta_5' coro=<Queue.process_events() done, defined at /home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py:343> exception=WebSocketDisconnect(1001)>
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 347, in process_events
    client_awake = await self.gather_event_data(event)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 220, in gather_event_data
    data, client_awake = await self.get_message(event, timeout=receive_timeout)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 453, in get_message
    data = await asyncio.wait_for(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 133, in receive_json
    self._raise_on_disconnect(message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 105, in _raise_on_disconnect
    raise WebSocketDisconnect(message["code"])
starlette.websockets.WebSocketDisconnect: 1001

Where can I find the codes to train the open source LLM, please?

Where can I find the codes to train the open source LLM, please? Trying to build an inhouse model.

Thank you.

For Model tab, some minor fixes

clear new name/path after add
Make empty lora more obvious somehow
Fix examples, so don't have to include model_state

Cannot train 'EleutherAI/gpt-neox-20b' on 2x 24GB cards

Need to step up to larger models with permissive license. 30b Llama works, but can't be used. 6b is too small, bad results. So next better choice is gpt-neox-20b.

this works:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 python finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --llama_type=False --ddp=False --lora_target_modules="['query_key_value']" --base_model="EleutherAI/gpt-neox-20b" with python too.

Recover when GPU OOMs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 22.20 GiB total capacity; 20.67 GiB already allocated; 4.12 MiB free; 21.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

brings app down, no longer can generate. Protect against GPU OOM or at least recover without hanging.

validation framework

E.g. big-bench, etc. can't just use direct validation on next token etc.

add chatbot mode to gradio

https://gradio.app/blocks-and-event-listeners/

h2oai / h2ogpt Goto Github PK

h2ogpt's Issues

2x A6000 Ada:

Step 1: Get best open-source model:

Step 2: Get good open-source instruct data:

Recommend Projects

Recommend Topics

Recommend Org