Giter Club home page Giter Club logo

h2oai / h2ogpt Goto Github PK

View Code? Open in Web Editor NEW
10.8K 157.0 1.2K 50.64 MB

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

Home Page: http://h2o.ai

License: Apache License 2.0

Python 92.09% Dockerfile 0.03% Shell 0.59% Makefile 0.09% TeX 2.33% Groovy 0.24% Smarty 0.06% HTML 1.66% Jupyter Notebook 2.93%
chatgpt llm ai embeddings generative gpt gpt4all pdf private privategpt

h2ogpt's Issues

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1039, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 491, in async_iteration
    return next(iterator)
  File "app.py", line 914, in bot
    for output in fun1(*tuple(args_list)):
  File "app.py", line 1346, in evaluate
    for output in CallbackToGenerator(generate, callback=None, **gen_kwargs):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/_collections_abc.py", line 317, in __next__
    return self.send(None)
  File "/home/user/app/stopping.py", line 119, in send
    return self._put('send', value)
  File "/home/user/app/stopping.py", line 111, in _put
    raise val
  File "/home/user/app/stopping.py", line 95, in thread_func
    ret = func(callback=val_callback, **self.kwargs)
  File "app.py", line 1324, in generate
    model.generate(**kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/transformers/generation/utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Benchmarks on 2xA6000 Ada vs 2xA100 80GB (roughly same speed)

2x A6000 Ada:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node=2 --nnodes=1 finetune.py --data_path=ShareGPT_unfiltered_cleaned_split.json.generate_human_bot.train_plain.json --num_epochs=1 --base_model=togethercomputer/GPT-NeoXT-Chat-Base-20B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=4 --batch_size=16 --cutoff_len=1024 --run_id=4
54%|█████▍ | 2888/5311 [21:08:17<17:33:41, 26.09s/it]

Train with all clean OSS data + model

Step 1: Get best open-source model:

model: togethercomputer/GPT-NeoXT-Chat-Base-20B https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B

Step 2: Get good open-source instruct data:

Inspired by
https://bair.berkeley.edu/blog/2023/04/03/koala/

Note: GPT-NeoXT-Chat-Base-20B was already trained on OIG data, so "nothing new", just fine-tuning on high-quality data. We need to include new good datasets too.

Run these pytests to create data:
https://github.com/h2oai/h2o-llm/blob/8a1636e35bba5be28d41ab27719d0f70d7eccd91/scrape_dai_docs.py#L364-L398

https://slack-files.com/T0329MHH6-F051UHFFUTD-d93fe5bb76 direct link to data (136MB)

not able to run inference on the docker

When I ran "sudo docker-compose up -d --build" and use docker-compose logs -f to check, I got the following errors:
My system has 32 GB DRAM and TitanX GPU, 12GB VRAM:

h2ogpt-h2o-llm-1 | python generate.py --base_model='togethercomputer/GPT-NeoXT-Chat-Base-20B' --prompt_type='human_bot' --lora_weights='GPT-NeoXT-Chat-Base-20B.merged.json.8_epochs.57b2892c53df5b8cefac45f84d019cace803ef26.28'
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 |
h2ogpt-h2o-llm-1 | Using Model eleutherai/gpt-j-6b
h2ogpt-h2o-llm-1 | Get EleutherAI/gpt-j-6B model
h2ogpt-h2o-llm-1 | Traceback (most recent call last):
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 1515, in
h2ogpt-h2o-llm-1 | fire.Fire(main)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
h2ogpt-h2o-llm-1 | component_trace = _Fire(component, args, parsed_flag_args, context, name)
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
h2ogpt-h2o-llm-1 | component, remaining_args = _CallAndUpdateTrace(
h2ogpt-h2o-llm-1 | File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
h2ogpt-h2o-llm-1 | component = fn(*varargs, **kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 249, in main
h2ogpt-h2o-llm-1 | go_gradio(**locals())
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 490, in go_gradio
h2ogpt-h2o-llm-1 | model0, tokenizer0, device = get_model(**all_kwargs)
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 358, in get_model
h2ogpt-h2o-llm-1 | device = get_device()
h2ogpt-h2o-llm-1 | File "/workspace/generate.py", line 256, in get_device
h2ogpt-h2o-llm-1 | raise RuntimeError("only cuda supported")
h2ogpt-h2o-llm-1 | RuntimeError: only cuda supported
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
h2ogpt-h2o-llm-1 | return torch._C._cuda_getDeviceCount() > 0
h2ogpt-h2o-llm-1 | /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
h2ogpt-h2o-llm-1 | warn("The installed version of bitsandbytes was compiled without GPU support. "

Something went wrong $"{p1.Name} is {p1.Age} years old.");<br> Console.WriteLine($ ^ ParseException: Expected end of text, found '$' (at char 0), (line:1, col:1)

gradio error for certain inputs:

Downloading pytorch_model.bin: 100%|██████████| 1.74G/1.74G [00:25<00:00, 67.6MB/s]
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/deprecation.py:43: UserWarning: You have unused kwarg parameters in Row, please remove them: {'scale': 1}
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started GUI
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING: Special characters in prompt
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^
ParseException: Expected end of text, found '$'  (at char 0), (line:1, col:1)
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.

"Unable to locate package nvidia-container-toolkit" on Debian (Ubuntu) x86_64

Hi Team,

Nice work and appreciate your efforts on this project 🫡

I am trying to run the Docker container and I had the following issue when executing the command sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base

Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 https://download.docker.com/linux/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Fetched 110 kB in 1s (195 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package nvidia-container-toolkit-base

And the solution I found was to:

wget https://nvidia.github.io/nvidia-docker/gpgkey --no-check-certificate
sudo apt-key add gpgkey
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

This fix the problem but still giving the following error for the command docker run --runtime=nvidia --shm-size=64g -p 7860:7860 -v ${HOME}/.cache:/root/.cache --rm h2o-llm -it generate.py --base_model=EleutherAI/gpt-neox-20b --lora_weights=h2ogpt_lora_weights --prompt_type=human_bot

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Could someone help me on this? I am trying to run the Docker container. Tried with docker compose up but still the same.

raise ValueError("\n" + ParseException.explain(err, 0)) from None

Some non-fatal matlab processing issue seen in HF demo:

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1305, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1239, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4626, in postprocess
    self._postprocess_chat_messages(message_pair[1]),
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/components.py", line 4599, in _postprocess_chat_messages
    return self.md.renderInline(chat_message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/main.py", line 299, in renderInline
    return self.renderer.render(self.parseInline(src, env), self.options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 87, in render
    result += self.renderInline(token.children, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/markdown_it/renderer.py", line 108, in renderInline
    result += self.rules[token.type](tokens, i, options, env)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/mdit_py_plugins/dollarmath/index.py", line 70, in render_math_inline
    content = _renderer(str(tokens[idx].content).strip(), {"display_mode": False})
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/utils.py", line 904, in tex2svg
    fig.savefig(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3343, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 2342, in print_figure
    self.figure.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 752, in draw
    bbox, info, descent = self._get_layout(renderer)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 386, in _get_layout
    w, h, d = _get_text_metrics_with_cache(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 97, in _get_text_metrics_with_cache
    return _get_text_metrics_with_cache_impl(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/text.py", line 105, in _get_text_metrics_with_cache_impl
    return renderer_ref().get_text_width_height_descent(text, fontprop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/backends/backend_svg.py", line 1317, in get_text_width_height_descent
    return self._text2path.get_text_width_height_descent(s, prop, ismath)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/textpath.py", line 60, in get_text_width_height_descent
    self.mathtext_parser.parse(s, 72, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 226, in parse
    return self._parse_cached(s, dpi, prop)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/mathtext.py", line 247, in _parse_cached
    box = self._parser.parse(s, fontset, fontsize, dpi)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/matplotlib/_mathtext.py", line 1995, in parse
    raise ValueError("\n" + ParseException.explain(err, 0)) from None
ValueError: 
$"{p1.Name} is {p1.Age} years old.");<br>    Console.WriteLine($
^

Ensemble multi-task LORAs

Plan is to develop multiple LORAs. Point is base can be inferenced once, then each new task can be:
1 base + first
2) -first + second
3) -second + third
etc.

So base is only forward once. This is normal part of LORA paper.

Mixture-of-experts idea can then be used, where yet another LORA is built, but this time it sits in front of all other LORA outputs an an ensemble model to be able to handle the diverse tasks. In principle alot less data is required for the ensemble LORA for it to just choose which task LORAs to blend.

input_ids are not moved to GPU

I'm running this locally with downloaded h2oai_pipeline:

`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`

And while the generation works, I get this Warning:

Setting pad_token_idtoeos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the input_idsbeing on a device type different than your model's device.input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running.generate(). warnings.warn(

Question 1: How do I make your custom pipeline move the input_ids to GPU?

Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?

Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?

Thanks!

Adversarial attack on reward models

Question: What do reward models really optimize for? How much assumed context do they have?

E.g. adversarial attack might include:

  • arbitrary \n after some average number of words
  • long semi-random sequence of words in paragraphs
    i.e. just formatting.

It might still give high score. If detects coherence etc., would be impressive since then has to be as good as an LLM itself.

Then reward models might assume alot about nature of input data, that already human readable, correct, etc.

How can RLHF can prune wrong/hallucinated responses?

Also, human may be picking up on trivial changes, like formatting, which is easily trainable for. E.g.

  • thesis at front
  • average words per sentence
  • average sentences per paragraph
  • new lines between paragraphs
  • summary at end.

At least the length part is easily chosen from available open data. Summary can be generated from samsum type models, and thesis may not be as important for now.

API for LLM

Design API for application and composability with h2o LLM (along the lines of Langchain / compatibility)

  • PR for langchain

Have to push stop twice, once for stopping output and another to stop actual GPU generation, fix

Tried adding click_event twice in cancel, didn't help.

Also, while message stops instantly, generation might continue for 2-3 seconds more since in middle of hard generation.

Also, bit uncontrolled, hits the ValueError when generation finally stopped:

Traceback (most recent call last):
  File "/data/jon/h2o-llm/callbacks.py", line 48, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/data/jon/h2o-llm/generate.py", line 597, in generate_with_callback
    model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1406, in generate
    return self.greedy_search(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 2256, in greedy_search
    if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in __call__
    return any(criteria(input_ids, scores) for criteria in self)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/stopping_criteria.py", line 113, in <genexpr>
    return any(criteria(input_ids, scores) for criteria in self)
  File "/data/jon/h2o-llm/callbacks.py", line 22, in __call__
    self.callback_func(input_ids[0])
  File "/data/jon/h2o-llm/callbacks.py", line 43, in _callback
    raise ValueError
ValueError


gradio matplotlib issue then Tcl_AsyncDelete: async handler deleted by the wrong thread

/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/utils.py:901: UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail.
  fig = plt.figure(figsize=(0.01, 0.01))
Exception ignored in: <function Image.__del__ at 0x7f17e015f2e0>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 4056, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f17e0107b50>
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Increase in GPU memory usage as generation continues, imbalanced across GPUs

>>> import torch
>>> from transformers import pipeline
>>> from transformers import pipeline
>>> generate_text = pipeline(model="h2oai/h2ogpt-oasst1-512-20b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
>>> res = generate_text("Why is drinking water so healthy?", max_new_tokens=3000)
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

During this long generation, first starts out balanced, then increasingly imbalanced.

Thu Apr 20 16:37:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   45C    P2              105W / 250W|  12220MiB / 49140MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   45C    P2               72W / 250W|  11744MiB / 49140MiB |     17%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   45C    P2               98W / 250W|  11744MiB / 49140MiB |     19%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   45C    P2              103W / 250W|  11125MiB / 49140MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  40566MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   48C    P2               76W / 250W|  15926MiB / 49140MiB |     36%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   48C    P2               87W / 250W|  15926MiB / 49140MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              130W / 250W|  14682MiB / 49140MiB |     21%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

but then can go back down by alot still during generation:

Thu Apr 20 16:47:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:3B:00.0 Off |                  Off |
|  0%   50C    P2               95W / 250W|  18334MiB / 49140MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000                On | 00000000:5E:00.0 Off |                  Off |
|  0%   49C    P2               74W / 250W|  17642MiB / 49140MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000                On | 00000000:86:00.0 Off |                  Off |
|  0%   50C    P2              117W / 250W|  17642MiB / 49140MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000                On | 00000000:AF:00.0 Off |                  Off |
|  0%   49C    P2              115W / 250W|  16139MiB / 49140MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Also eventually fails:

??????????????????????????????? Traceback (most recent call last) ?????????????????????????????????
? in <module>:1                                                                                    ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:209 in __call__                                                                          ?
?                                                                                                  ?
?   206 ?   ?   ?   - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `retu   ?
?   207 ?   ?   ?     ids of the generated text.                                                   ?
?   208 ?   ?   """                                                                                ?
? ? 209 ?   ?   return super().__call__(text_inputs, **kwargs)                                     ?
?   210 ?                                                                                          ?
?   211 ?   def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate   ?
?   212 ?   ?   inputs = self.tokenizer(                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:110 ?
? 9 in __call__                                                                                    ?
?                                                                                                  ?
?   1106 ?   ?   ?   ?   )                                                                         ?
?   1107 ?   ?   ?   )                                                                             ?
?   1108 ?   ?   else:                                                                             ?
? ? 1109 ?   ?   ?   return self.run_single(inputs, preprocess_params, forward_params, postproces  ?
?   1110 ?                                                                                         ?
?   1111 ?   def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):   ?
?   1112 ?   ?   return [self.run_single(item, preprocess_params, forward_params, postprocess_par  ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:111 ?
? 6 in run_single                                                                                  ?
?                                                                                                  ?
?   1113 ?                                                                                         ?
?   1114 ?   def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):  ?
?   1115 ?   ?   model_inputs = self.preprocess(inputs, **preprocess_params)                       ?
? ? 1116 ?   ?   model_outputs = self.forward(model_inputs, **forward_params)                      ?
?   1117 ?   ?   outputs = self.postprocess(model_outputs, **postprocess_params)                   ?
?   1118 ?   ?   return outputs                                                                    ?
?   1119                                                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/base.py:101 ?
? 5 in forward                                                                                     ?
?                                                                                                  ?
?   1012 ?   ?   ?   ?   inference_context = self.get_inference_context()                          ?
?   1013 ?   ?   ?   ?   with inference_context():                                                 ?
?   1014 ?   ?   ?   ?   ?   model_inputs = self._ensure_tensor_on_device(model_inputs, device=se  ?
? ? 1015 ?   ?   ?   ?   ?   model_outputs = self._forward(model_inputs, **forward_params)         ?
?   1016 ?   ?   ?   ?   ?   model_outputs = self._ensure_tensor_on_device(model_outputs, device=  ?
?   1017 ?   ?   ?   else:                                                                         ?
?   1018 ?   ?   ?   ?   raise ValueError(f"Framework {self.framework} is not supported")          ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/pipelines/text_genera ?
? tion.py:251 in _forward                                                                          ?
?                                                                                                  ?
?   248 ?   ?   ?   in_b = input_ids.shape[0]                                                      ?
?   249 ?   ?   prompt_text = model_inputs.pop("prompt_text")                                      ?
?   250 ?   ?   # BS x SL                                                                          ?
? ? 251 ?   ?   generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=att   ?
?   252 ?   ?   out_b = generated_sequence.shape[0]                                                ?
?   253 ?   ?   if self.framework == "pt":                                                         ?
?   254 ?   ?   ?   generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *genera   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in  ?
? decorate_context                                                                                 ?
?                                                                                                  ?
?   112 ?   @functools.wraps(func)                                                                 ?
?   113 ?   def decorate_context(*args, **kwargs):                                                 ?
?   114 ?   ?   with ctx_factory():                                                                ?
? ? 115 ?   ?   ?   return func(*args, **kwargs)                                                   ?
?   116 ?                                                                                          ?
?   117 ?   return decorate_context                                                                ?
?   118                                                                                            ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:1 ?
? 437 in generate                                                                                  ?
?                                                                                                  ?
?   1434 ?   ?   ?   ?   )                                                                         ?
?   1435 ?   ?   ?                                                                                 ?
?   1436 ?   ?   ?   # 11. run greedy search                                                       ?
? ? 1437 ?   ?   ?   return self.greedy_search(                                                    ?
?   1438 ?   ?   ?   ?   input_ids,                                                                ?
?   1439 ?   ?   ?   ?   logits_processor=logits_processor,                                        ?
?   1440 ?   ?   ?   ?   stopping_criteria=stopping_criteria,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py:2 ?
? 248 in greedy_search                                                                             ?
?                                                                                                  ?
?   2245 ?   ?   ?   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  ?
?   2246 ?   ?   ?                                                                                 ?
?   2247 ?   ?   ?   # forward pass to get next token                                              ?
? ? 2248 ?   ?   ?   outputs = self(                                                               ?
?   2249 ?   ?   ?   ?   **model_inputs,                                                           ?
?   2250 ?   ?   ?   ?   return_dict=True,                                                         ?
?   2251 ?   ?   ?   ?   output_attentions=output_attentions,                                      ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:662 in forward                                                                   ?
?                                                                                                  ?
?   659 ?   ?   ```"""                                                                             ?
?   660 ?   ?   return_dict = return_dict if return_dict is not None else self.config.use_return   ?
?   661 ?   ?                                                                                      ?
? ? 662 ?   ?   outputs = self.gpt_neox(                                                           ?
?   663 ?   ?   ?   input_ids,                                                                     ?
?   664 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   665 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:553 in forward                                                                   ?
?                                                                                                  ?
?   550 ?   ?   ?   ?   ?   head_mask[i],                                                          ?
?   551 ?   ?   ?   ?   )                                                                          ?
?   552 ?   ?   ?   else:                                                                          ?
? ? 553 ?   ?   ?   ?   outputs = layer(                                                           ?
?   554 ?   ?   ?   ?   ?   hidden_states,                                                         ?
?   555 ?   ?   ?   ?   ?   attention_mask=attention_mask,                                         ?
?   556 ?   ?   ?   ?   ?   position_ids=position_ids,                                             ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:320 in forward                                                                   ?
?                                                                                                  ?
?   317 ?   ?   layer_past: Optional[Tuple[torch.Tensor]] = None,                                  ?
?   318 ?   ?   output_attentions: Optional[bool] = False,                                         ?
?   319 ?   ):                                                                                     ?
? ? 320 ?   ?   attention_layer_outputs = self.attention(                                          ?
?   321 ?   ?   ?   self.input_layernorm(hidden_states),                                           ?
?   322 ?   ?   ?   attention_mask=attention_mask,                                                 ?
?   323 ?   ?   ?   position_ids=position_ids,                                                     ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in ?
? _call_impl                                                                                       ?
?                                                                                                  ?
?   1498 ?   ?   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   ?
?   1499 ?   ?   ?   ?   or _global_backward_pre_hooks or _global_backward_hooks                   ?
?   1500 ?   ?   ?   ?   or _global_forward_hooks or _global_forward_pre_hooks):                   ?
? ? 1501 ?   ?   ?   return forward_call(*args, **kwargs)                                          ?
?   1502 ?   ?   # Do not call functions when jit is used                                          ?
?   1503 ?   ?   full_backward_hooks, non_full_backward_hooks = [], []                             ?
?   1504 ?   ?   backward_pre_hooks = []                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/accelerate/hooks.py:165 in         ?
? new_forward                                                                                      ?
?                                                                                                  ?
?   162 ?   ?   ?   with torch.no_grad():                                                          ?
?   163 ?   ?   ?   ?   output = old_forward(*args, **kwargs)                                      ?
?   164 ?   ?   else:                                                                              ?
? ? 165 ?   ?   ?   output = old_forward(*args, **kwargs)                                          ?
?   166 ?   ?   return module._hf_hook.post_forward(module, output)                                ?
?   167 ?                                                                                          ?
?   168 ?   module.forward = new_forward                                                           ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:152 in forward                                                                   ?
?                                                                                                  ?
?   149 ?   ?   present = (key, value) if use_cache else None                                      ?
?   150 ?   ?                                                                                      ?
?   151 ?   ?   # Compute attention                                                                ?
? ? 152 ?   ?   attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_m   ?
?   153 ?   ?                                                                                      ?
?   154 ?   ?   # Reshape outputs                                                                  ?
?   155 ?   ?   attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head   ?
?                                                                                                  ?
? /home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/models/gpt_neox/model ?
? ing_gpt_neox.py:219 in _attn                                                                     ?
?                                                                                                  ?
?   216 ?   ?   # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar ty   ?
?   217 ?   ?   # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on    ?
?   218 ?   ?   mask_value = torch.tensor(mask_value, dtype=attn_scores.dtype).to(attn_scores.de   ?
? ? 219 ?   ?   attn_scores = torch.where(causal_mask, attn_scores, mask_value)                    ?
?   220 ?   ?                                                                                      ?
?   221 ?   ?   if attention_mask is not None:                                                     ?
?   222 ?   ?   ?   # Apply the attention mask                                                     ?
????????????????????????????????????????????????????????????????????????????????????????????????????
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3
>>> 

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Traceback (most recent call last):
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 1059, in process_api
    result = await self.call_function(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/gradio/blocks.py", line 868, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/data/jon/h2o-llm/generate.py", line 132, in evaluate
    outputs = model.generate(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 1528, in generate
    return self.beam_sample(
  File "/home/jon/miniconda3/envs/alpaca/lib/python3.10/site-packages/transformers/generation/utils.py", line 3126, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

image

"recipe refers to # Recipe type## Recipes override any GUI settings- **'auto'**: all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort- **'compliant'** : like 'auto' except:    - *interpretability=10* (to avoid complexity, overrides GUI or python client chose for interpretability)    - *enable_glm='on'* (rest 'off', to avoid complexity and be compatible with algorithms supported by MLI)    - *fixed_ensemble_level=0*: Don't use any ensemble    - *feature_brain_level=0*(: No feature brain used (to ensure every restart is identical)    - *max_feature_interaction_depth=1*: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)    - *target_transformer='identity'*: for regression (to avoid complexity)    - *check_distribution_shift_drop='off'*: Don't use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning)- **'monotonic_gbm'** : like 'auto' except:    - *monotonicity_constraints_interpretability_switch=1*: enable monotonicity constraints    - *self.config.monotonicity_constraints_correlation_threshold = 0.01*: see below    - *monotonicity_constraints_drop_low_correlation_features=true*: drop features that aren't correlated with target by at least 0.01 (specified by parameter above)    - *fixed_ensemble_level=0*: Don't use any ensemble (to avoid complexity)    - *included_models=['LightGBMModel']*    - *included_transformers=['OriginalTransformer']*: only original (numeric) features will be used    - *feature_brain_level=0*: No feature brain used (to ensure every restart is identical)    - *monotonicity_constraints_log_level='high'*    - *autodoc_pd_max_runtime=-1*: no timeout for PDP creation in AutoDoc- **'kaggle'** : like 'auto' except:    - external validation set is concatenated with train set, with target marked as missing    - test set is concatenated with train set, with target marked as missing    - transformers that do not use the target are allowed to fit_transform across entire train + validation + test    - several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)    - Note: If plentiful memory, can:        - choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,    otherwise default number of features given to transformer is limited to 50 by default        - choose mutation_mode = \"full\", so even more types are transformations are done at once per transformer- **'nlp_model'**: Only enables NLP models that process pure text- **'nlp_transformer'**: Only enables NLP transformers that process pure text, while any model type is allowed- **'image_model'**: Only enables Image models that process pure images- **'image_transformer'**: Only enables Image transformers that process pure images, while any model type is allowed- **'unsupervised'**: Only enables unsupervised transformers, models and scorers- **'gpus_max'**: Maximize use of GPUs (e.g. use XGBoost, rapids, Optuna hyperparameter search, etc.)- **'more_overfit_protection'**: Potentially improve overfit, esp. for small data, by disabling target encoding and making GA behave like final model for tree counts and learning rate- **'feature_store_mojo'**: Creates a MOJO to be used as transformer in the H2O Feature Store, to augment data on a row-by-row level based on Driverless AI's feature engineering. Only includes transformers that don't depend on the target, since features like target encoding need to be created at model fitting time to avoid data leakage. And features like lags need to be created from the raw data, they can't be computed with a row-by-row MOJO transformer.Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings.  Changing thepipeline building recipe will reset all pipeline building recipe options back to default and then re-apply thespecific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline buildingrecipe rules.If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-appliedand any fine-tuning is preserved.  To reset recipe behavior, one can switch between 'auto' and the desired mode.  Thisway the new child experiment will use the default settings for the chosen recipe." Summarize the above into a single paragraph.

Add option to replace attention with flash attention

Flash attention has already been integrated into gpt-neox models here: https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/models/gpt.py#L215

Can add the swapped model definition as an option to the training and generation scripts and benchmark the speed difference.

Converting Llama and others might be more work. it uses a pretty standard looking attention, but not sure how it differs from the pytorch default. Might just need to remap some layer names https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L160

chatbot: starlette.websockets.WebSocketDisconnect: 1001

Task exception was never retrieved
future: <Task finished name='xsce894h9ta_5' coro=<Queue.process_events() done, defined at /home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py:343> exception=WebSocketDisconnect(1001)>
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 347, in process_events
    client_awake = await self.gather_event_data(event)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 220, in gather_event_data
    data, client_awake = await self.get_message(event, timeout=receive_timeout)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/queueing.py", line 453, in get_message
    data = await asyncio.wait_for(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 133, in receive_json
    self._raise_on_disconnect(message)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/starlette/websockets.py", line 105, in _raise_on_disconnect
    raise WebSocketDisconnect(message["code"])
starlette.websockets.WebSocketDisconnect: 1001

Cannot train 'EleutherAI/gpt-neox-20b' on 2x 24GB cards

Need to step up to larger models with permissive license. 30b Llama works, but can't be used. 6b is too small, bad results. So next better choice is gpt-neox-20b.

this works:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 python finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --base_model="decapoda-research/llama-30b-hf" --llama_type=True --ddp=False

this fails:
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun finetune.py --data_path=alpaca_data_cleaned.json --llama_type=False --ddp=False --lora_target_modules="['query_key_value']" --base_model="EleutherAI/gpt-neox-20b" with python too.

Recover when GPU OOMs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 22.20 GiB total capacity; 20.67 GiB already allocated; 4.12 MiB free; 21.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

brings app down, no longer can generate. Protect against GPU OOM or at least recover without hanging.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.