<a href="https://twitter.com/omarsar0/status/1641792530667675648/photo/1" rel="nofollo

<a href="https://github.com/h2oai/h2ogpt/files/11316264/h2ogpt-oasst1-512-12b.eval.log

What the tasks look like: <a href="https://github.com/h2oai/h2ogpt/files/11318421/

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Testing grounds for LLMs - Validation Framework about h2ogpt HOT 14 OPEN

arnocandel commented on July 17, 2024

Testing grounds for LLMs - Validation Framework

from h2ogpt.

Comments (14)

arnocandel commented on July 17, 2024

https://github.com/google/BIG-bench

from h2ogpt.

arnocandel commented on July 17, 2024

https://github.com/EleutherAI/lm-evaluation-harness has out-of-the-box HF connectors, seems easiest to use.
some of them like arc-e arc-ch piqa are also shown in tweet above

https://arxiv.org/pdf/2303.17564v1.pdf

from h2ogpt.

arnocandel commented on July 17, 2024

CUDA_VISIBLE_DEVICES=0 torchrun main.py --model hf-causal --model_args pretrained=h2oai/h2ogpt-oig-oasst1-512-6.9b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-512-6.9b.eval.log

Task	Version	Metric	Value		Stderr
boolq	1	acc	0.6266	±	0.0085
arc_challenge	0	acc	0.3225	±	0.0137
		acc_norm	0.3396	±	0.0138
openbookqa	0	acc	0.2660	±	0.0198
		acc_norm	0.3660	±	0.0216
arc_easy	0	acc	0.6776	±	0.0096
		acc_norm	0.6195	±	0.0100
hellaswag	0	acc	0.4822	±	0.0050
		acc_norm	0.6465	±	0.0048
winogrande	0	acc	0.6219	±	0.0136
piqa	0	acc	0.7530	±	0.0101
		acc_norm	0.7606	±	0.0100

h2ogpt-oig-oasst1-512-6.9b.eval.log

from h2ogpt.

arnocandel commented on July 17, 2024

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-12b.eval.log
CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-20b.eval.log

from h2ogpt.

arnocandel commented on July 17, 2024

h2ogpt-oasst1-512-12b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
arc_easy	0	acc	0.6932	±	0.0095
		acc_norm	0.6225	±	0.0099
openbookqa	0	acc	0.2900	±	0.0203
		acc_norm	0.3740	±	0.0217
winogrande	0	acc	0.6369	±	0.0135
hellaswag	0	acc	0.5140	±	0.0050
		acc_norm	0.6803	±	0.0047
piqa	0	acc	0.7682	±	0.0098
		acc_norm	0.7661	±	0.0099
boolq	1	acc	0.6685	±	0.0082
arc_challenge	0	acc	0.3157	±	0.0136
		acc_norm	0.3507	±	0.0139

h2ogpt-oasst1-512-20b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-20b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5419	±	0.0050
		acc_norm	0.7259	±	0.0045
boolq	1	acc	0.7125	±	0.0079
piqa	0	acc	0.7742	±	0.0098
		acc_norm	0.7775	±	0.0097
openbookqa	0	acc	0.2800	±	0.0201
		acc_norm	0.4000	±	0.0219
arc_challenge	0	acc	0.3993	±	0.0143
		acc_norm	0.4420	±	0.0145
winogrande	0	acc	0.6614	±	0.0133
arc_easy	0	acc	0.7327	±	0.0091
		acc_norm	0.6894	±	0.0095

from h2ogpt.

arnocandel commented on July 17, 2024

https://huggingface.co/databricks/dolly-v2-12b
model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-3b 0.384 0.611532 0.589582 0.650767 0.370307 0.742655 0.575535 0.544886
EleutherAI/pythia-12b 0.364 0.627104 0.636148 0.668094 0.346416 0.760065 0.673394 0.559676
EleutherAI/gpt-j-6B 0.382 0.621633 0.651144 0.662617 0.363481 0.761153 0.655963 0.565936
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487
databricks/dolly-v1-6b 0.41 0.62963 0.643252 0.676758 0.384812 0.773667 0.687768 0.583431
EleutherAI/gpt-neox-20b 0.402 0.683923 0.656669 0.7142 0.408703 0.784004 0.695413 0.602236

from h2ogpt.

arnocandel commented on July 17, 2024

https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

from h2ogpt.

arnocandel commented on July 17, 2024

What the tasks look like:
tasks.zip

created with
python scripts/write_out.py --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --num_fewshot 5 --num_examples 10 --output_base_path tasks

from h2ogpt.

arnocandel commented on July 17, 2024

from h2ogpt.

arnocandel commented on July 17, 2024

let's see if Dolly v2 12B is doing same for their numbers:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> dolly-v2-12b.eval.log

Task	Version	Metric	Value		Stderr
winogrande	0	acc	0.6298	±	0.0136
arc_easy	0	acc	0.6713	±	0.0096
		acc_norm	0.6380	±	0.0099
hellaswag	0	acc	0.5420	±	0.0050
		acc_norm	0.7109	±	0.0045
piqa	0	acc	0.7399	±	0.0102
		acc_norm	0.7541	±	0.0100
arc_challenge	0	acc	0.3618	±	0.0140
		acc_norm	0.3823	±	0.0142
openbookqa	0	acc	0.2980	±	0.0205
		acc_norm	0.4060	±	0.0220
boolq	1	acc	0.5624	±	0.0087

yes, consistent with their reported numbers, so command itself seems reasonable.

from h2ogpt.

arnocandel commented on July 17, 2024

https://arxiv.org/pdf/2302.13971.pdf

from h2ogpt.

arnocandel commented on July 17, 2024

undertrained older models perform worse indeed:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-20b.eval.log

h2ogpt-oig-oasst1-256-20b.eval.log

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5320	±	0.0050
		acc_norm	0.7115	±	0.0045
arc_easy	0	acc	0.7138	±	0.0093
		acc_norm	0.6869	±	0.0095
boolq	1	acc	0.6878	±	0.0081
piqa	0	acc	0.7742	±	0.0098
		acc_norm	0.7786	±	0.0097
openbookqa	0	acc	0.2760	±	0.0200
		acc_norm	0.3760	±	0.0217
arc_challenge	0	acc	0.3686	±	0.0141
		acc_norm	0.3942	±	0.0143
winogrande	0	acc	0.6630	±	0.0133

from h2ogpt.

arnocandel commented on July 17, 2024

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-12b.eval.log

h2ogpt-oig-oasst1-256-12b.eval.log

Task	Version	Metric	Value		Stderr
hellaswag	0	acc	0.5189	±	0.0050
		acc_norm	0.6930	±	0.0046
arc_challenge	0	acc	0.3276	±	0.0137
		acc_norm	0.3797	±	0.0142
piqa	0	acc	0.7628	±	0.0099
		acc_norm	0.7720	±	0.0098
winogrande	0	acc	0.6527	±	0.0134
boolq	1	acc	0.6602	±	0.0083
arc_easy	0	acc	0.6999	±	0.0094
		acc_norm	0.6629	±	0.0097
openbookqa	0	acc	0.2940	±	0.0204
		acc_norm	0.4020	±	0.0219

from h2ogpt.

arnocandel commented on July 17, 2024

https://github.com/EleutherAI/lm-evaluation-harness
4b701e228768052cfae9043dca13e82052ca5eea

diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py
index 4d3aa24..34b6967 100644
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -76,10 +76,10 @@ class HuggingFaceAutoLM(BaseLM):
         subfolder: Optional[str] = None,
         revision: Optional[str] = "main",
         batch_size: Optional[Union[int, str]] = 1,
-        max_gen_toks: Optional[int] = 256,
+        max_gen_toks: Optional[int] = 512,
         max_length: Optional[int] = None,
         add_special_tokens: Optional[bool] = None,
-        use_accelerate: Optional[bool] = False,
+        use_accelerate: Optional[bool] = True,
         device_map_option: Optional[str] = "auto",
         max_memory_per_gpu: Optional[Union[int, str]] = None,
         max_cpu_memory: Optional[Union[int, str]] = None,
@@ -87,9 +87,9 @@ class HuggingFaceAutoLM(BaseLM):
         dtype: Optional[Union[str, torch.dtype]] = None,
         device: Optional[Union[int, str]] = "cuda",
         peft: str = None,
-        load_in_8bit: Optional[bool] = False,
+        load_in_8bit: Optional[bool] = True,
         load_in_4bit: Optional[bool] = False,
-        trust_remote_code: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = True,
         gptq_use_triton: Optional[bool] = False,
     ):
         """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-falcon-40b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-falcon-40b.eval.log

from h2ogpt.

Testing grounds for LLMs - Validation Framework about h2ogpt HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent