Comments (14)
https://github.com/google/BIG-bench
from h2ogpt.
https://github.com/EleutherAI/lm-evaluation-harness has out-of-the-box HF connectors, seems easiest to use.
some of them like arc-e arc-ch piqa are also shown in tweet above
https://arxiv.org/pdf/2303.17564v1.pdf
from h2ogpt.
CUDA_VISIBLE_DEVICES=0 torchrun main.py --model hf-causal --model_args pretrained=h2oai/h2ogpt-oig-oasst1-512-6.9b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-512-6.9b.eval.log
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
boolq | 1 | acc | 0.6266 | ± | 0.0085 |
arc_challenge | 0 | acc | 0.3225 | ± | 0.0137 |
acc_norm | 0.3396 | ± | 0.0138 | ||
openbookqa | 0 | acc | 0.2660 | ± | 0.0198 |
acc_norm | 0.3660 | ± | 0.0216 | ||
arc_easy | 0 | acc | 0.6776 | ± | 0.0096 |
acc_norm | 0.6195 | ± | 0.0100 | ||
hellaswag | 0 | acc | 0.4822 | ± | 0.0050 |
acc_norm | 0.6465 | ± | 0.0048 | ||
winogrande | 0 | acc | 0.6219 | ± | 0.0136 |
piqa | 0 | acc | 0.7530 | ± | 0.0101 |
acc_norm | 0.7606 | ± | 0.0100 |
h2ogpt-oig-oasst1-512-6.9b.eval.log
from h2ogpt.
CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-12b.eval.log
CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-20b.eval.log
from h2ogpt.
h2ogpt-oasst1-512-12b.eval.log
hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_easy | 0 | acc | 0.6932 | ± | 0.0095 |
acc_norm | 0.6225 | ± | 0.0099 | ||
openbookqa | 0 | acc | 0.2900 | ± | 0.0203 |
acc_norm | 0.3740 | ± | 0.0217 | ||
winogrande | 0 | acc | 0.6369 | ± | 0.0135 |
hellaswag | 0 | acc | 0.5140 | ± | 0.0050 |
acc_norm | 0.6803 | ± | 0.0047 | ||
piqa | 0 | acc | 0.7682 | ± | 0.0098 |
acc_norm | 0.7661 | ± | 0.0099 | ||
boolq | 1 | acc | 0.6685 | ± | 0.0082 |
arc_challenge | 0 | acc | 0.3157 | ± | 0.0136 |
acc_norm | 0.3507 | ± | 0.0139 |
h2ogpt-oasst1-512-20b.eval.log
hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-20b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 0 | acc | 0.5419 | ± | 0.0050 |
acc_norm | 0.7259 | ± | 0.0045 | ||
boolq | 1 | acc | 0.7125 | ± | 0.0079 |
piqa | 0 | acc | 0.7742 | ± | 0.0098 |
acc_norm | 0.7775 | ± | 0.0097 | ||
openbookqa | 0 | acc | 0.2800 | ± | 0.0201 |
acc_norm | 0.4000 | ± | 0.0219 | ||
arc_challenge | 0 | acc | 0.3993 | ± | 0.0143 |
acc_norm | 0.4420 | ± | 0.0145 | ||
winogrande | 0 | acc | 0.6614 | ± | 0.0133 |
arc_easy | 0 | acc | 0.7327 | ± | 0.0091 |
acc_norm | 0.6894 | ± | 0.0095 |
from h2ogpt.
https://huggingface.co/databricks/dolly-v2-12b
model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-3b 0.384 0.611532 0.589582 0.650767 0.370307 0.742655 0.575535 0.544886
EleutherAI/pythia-12b 0.364 0.627104 0.636148 0.668094 0.346416 0.760065 0.673394 0.559676
EleutherAI/gpt-j-6B 0.382 0.621633 0.651144 0.662617 0.363481 0.761153 0.655963 0.565936
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487
databricks/dolly-v1-6b 0.41 0.62963 0.643252 0.676758 0.384812 0.773667 0.687768 0.583431
EleutherAI/gpt-neox-20b 0.402 0.683923 0.656669 0.7142 0.408703 0.784004 0.695413 0.602236
from h2ogpt.
https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf
from h2ogpt.
What the tasks look like:
tasks.zip
created with
python scripts/write_out.py --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --num_fewshot 5 --num_examples 10 --output_base_path tasks
from h2ogpt.
from h2ogpt.
let's see if Dolly v2 12B is doing same for their numbers:
CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> dolly-v2-12b.eval.log
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
winogrande | 0 | acc | 0.6298 | ± | 0.0136 |
arc_easy | 0 | acc | 0.6713 | ± | 0.0096 |
acc_norm | 0.6380 | ± | 0.0099 | ||
hellaswag | 0 | acc | 0.5420 | ± | 0.0050 |
acc_norm | 0.7109 | ± | 0.0045 | ||
piqa | 0 | acc | 0.7399 | ± | 0.0102 |
acc_norm | 0.7541 | ± | 0.0100 | ||
arc_challenge | 0 | acc | 0.3618 | ± | 0.0140 |
acc_norm | 0.3823 | ± | 0.0142 | ||
openbookqa | 0 | acc | 0.2980 | ± | 0.0205 |
acc_norm | 0.4060 | ± | 0.0220 | ||
boolq | 1 | acc | 0.5624 | ± | 0.0087 |
yes, consistent with their reported numbers, so command itself seems reasonable.
from h2ogpt.
https://arxiv.org/pdf/2302.13971.pdf
from h2ogpt.
undertrained older models perform worse indeed:
CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-20b.eval.log
h2ogpt-oig-oasst1-256-20b.eval.log
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 0 | acc | 0.5320 | ± | 0.0050 |
acc_norm | 0.7115 | ± | 0.0045 | ||
arc_easy | 0 | acc | 0.7138 | ± | 0.0093 |
acc_norm | 0.6869 | ± | 0.0095 | ||
boolq | 1 | acc | 0.6878 | ± | 0.0081 |
piqa | 0 | acc | 0.7742 | ± | 0.0098 |
acc_norm | 0.7786 | ± | 0.0097 | ||
openbookqa | 0 | acc | 0.2760 | ± | 0.0200 |
acc_norm | 0.3760 | ± | 0.0217 | ||
arc_challenge | 0 | acc | 0.3686 | ± | 0.0141 |
acc_norm | 0.3942 | ± | 0.0143 | ||
winogrande | 0 | acc | 0.6630 | ± | 0.0133 |
from h2ogpt.
CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-12b.eval.log
h2ogpt-oig-oasst1-256-12b.eval.log
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
hellaswag | 0 | acc | 0.5189 | ± | 0.0050 |
acc_norm | 0.6930 | ± | 0.0046 | ||
arc_challenge | 0 | acc | 0.3276 | ± | 0.0137 |
acc_norm | 0.3797 | ± | 0.0142 | ||
piqa | 0 | acc | 0.7628 | ± | 0.0099 |
acc_norm | 0.7720 | ± | 0.0098 | ||
winogrande | 0 | acc | 0.6527 | ± | 0.0134 |
boolq | 1 | acc | 0.6602 | ± | 0.0083 |
arc_easy | 0 | acc | 0.6999 | ± | 0.0094 |
acc_norm | 0.6629 | ± | 0.0097 | ||
openbookqa | 0 | acc | 0.2940 | ± | 0.0204 |
acc_norm | 0.4020 | ± | 0.0219 |
from h2ogpt.
https://github.com/EleutherAI/lm-evaluation-harness
4b701e228768052cfae9043dca13e82052ca5eea
diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py
index 4d3aa24..34b6967 100644
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -76,10 +76,10 @@ class HuggingFaceAutoLM(BaseLM):
subfolder: Optional[str] = None,
revision: Optional[str] = "main",
batch_size: Optional[Union[int, str]] = 1,
- max_gen_toks: Optional[int] = 256,
+ max_gen_toks: Optional[int] = 512,
max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None,
- use_accelerate: Optional[bool] = False,
+ use_accelerate: Optional[bool] = True,
device_map_option: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
@@ -87,9 +87,9 @@ class HuggingFaceAutoLM(BaseLM):
dtype: Optional[Union[str, torch.dtype]] = None,
device: Optional[Union[int, str]] = "cuda",
peft: str = None,
- load_in_8bit: Optional[bool] = False,
+ load_in_8bit: Optional[bool] = True,
load_in_4bit: Optional[bool] = False,
- trust_remote_code: Optional[bool] = False,
+ trust_remote_code: Optional[bool] = True,
gptq_use_triton: Optional[bool] = False,
):
"""Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-falcon-40b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-falcon-40b.eval.log
from h2ogpt.
Related Issues (20)
- Without Line breaks (\n) issue of extract response HOT 1
- planning prompt example HOT 1
- Load evaluations json in UTF HOT 4
- Model weights on VRAM and Embeddings on RAM HOT 1
- Maintain context with Gradio Client HOT 3
- Windows manual install, some error and strange behaviour HOT 6
- async to vllm broken, probably since transition to openai 1.x HOT 1
- Unable to import files on MacOS HOT 2
- DocTR OCR seems broken -- layout aware or not. HOT 1
- Windows 10 Manual Install: Failed to load shared library: llama_cpp_cuda\llama.dll HOT 2
- Expert Tab Prompt Control Parameters HOT 5
- ModuleNotFoundError: No module named 'langchain_core.tracers.langchain_v1' HOT 4
- Understanding h2oGpt and inference server relationship HOT 7
- Followed instructions exactly, error about H2O_Fire and Fire on run? HOT 9
- Download folder / .cache file settings==> MISSING!! HOT 4
- Can't change model, httpx module not found HOT 17
- make model list return only alive connections
- Avatars do not display HOT 3
- Enhancement Request: Include Relevant File Names in Query Responses HOT 8
- Windows installer is missing web search, downloaded models disappear. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from h2ogpt.