Giter Club home page Giter Club logo

Comments (14)

arnocandel avatar arnocandel commented on May 22, 2024

https://github.com/google/BIG-bench

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

https://github.com/EleutherAI/lm-evaluation-harness has out-of-the-box HF connectors, seems easiest to use.
some of them like arc-e arc-ch piqa are also shown in tweet above

https://arxiv.org/pdf/2303.17564v1.pdf
FsjRuxhWAAE0BW3
image
image

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

CUDA_VISIBLE_DEVICES=0 torchrun main.py --model hf-causal --model_args pretrained=h2oai/h2ogpt-oig-oasst1-512-6.9b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-512-6.9b.eval.log

Task Version Metric Value Stderr
boolq 1 acc 0.6266 ± 0.0085
arc_challenge 0 acc 0.3225 ± 0.0137
acc_norm 0.3396 ± 0.0138
openbookqa 0 acc 0.2660 ± 0.0198
acc_norm 0.3660 ± 0.0216
arc_easy 0 acc 0.6776 ± 0.0096
acc_norm 0.6195 ± 0.0100
hellaswag 0 acc 0.4822 ± 0.0050
acc_norm 0.6465 ± 0.0048
winogrande 0 acc 0.6219 ± 0.0136
piqa 0 acc 0.7530 ± 0.0101
acc_norm 0.7606 ± 0.0100

h2ogpt-oig-oasst1-512-6.9b.eval.log

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-12b.eval.log
CUDA_VISIBLE_DEVICES=1 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oasst1-512-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oasst1-512-20b.eval.log
image

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

h2ogpt-oasst1-512-12b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
arc_easy 0 acc 0.6932 ± 0.0095
acc_norm 0.6225 ± 0.0099
openbookqa 0 acc 0.2900 ± 0.0203
acc_norm 0.3740 ± 0.0217
winogrande 0 acc 0.6369 ± 0.0135
hellaswag 0 acc 0.5140 ± 0.0050
acc_norm 0.6803 ± 0.0047
piqa 0 acc 0.7682 ± 0.0098
acc_norm 0.7661 ± 0.0099
boolq 1 acc 0.6685 ± 0.0082
arc_challenge 0 acc 0.3157 ± 0.0136
acc_norm 0.3507 ± 0.0139

h2ogpt-oasst1-512-20b.eval.log

hf-causal-experimental (pretrained=h2oai/h2ogpt-oasst1-512-20b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
hellaswag 0 acc 0.5419 ± 0.0050
acc_norm 0.7259 ± 0.0045
boolq 1 acc 0.7125 ± 0.0079
piqa 0 acc 0.7742 ± 0.0098
acc_norm 0.7775 ± 0.0097
openbookqa 0 acc 0.2800 ± 0.0201
acc_norm 0.4000 ± 0.0219
arc_challenge 0 acc 0.3993 ± 0.0143
acc_norm 0.4420 ± 0.0145
winogrande 0 acc 0.6614 ± 0.0133
arc_easy 0 acc 0.7327 ± 0.0091
acc_norm 0.6894 ± 0.0095

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

https://huggingface.co/databricks/dolly-v2-12b
model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-3b 0.384 0.611532 0.589582 0.650767 0.370307 0.742655 0.575535 0.544886
EleutherAI/pythia-12b 0.364 0.627104 0.636148 0.668094 0.346416 0.760065 0.673394 0.559676
EleutherAI/gpt-j-6B 0.382 0.621633 0.651144 0.662617 0.363481 0.761153 0.655963 0.565936
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487
databricks/dolly-v1-6b 0.41 0.62963 0.643252 0.676758 0.384812 0.773667 0.687768 0.583431
EleutherAI/gpt-neox-20b 0.402 0.683923 0.656669 0.7142 0.408703 0.784004 0.695413 0.602236

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf
image

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

What the tasks look like:
tasks.zip

created with
python scripts/write_out.py --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --num_fewshot 5 --num_examples 10 --output_base_path tasks

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

image

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

let's see if Dolly v2 12B is doing same for their numbers:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> dolly-v2-12b.eval.log

Task Version Metric Value Stderr
winogrande 0 acc 0.6298 ± 0.0136
arc_easy 0 acc 0.6713 ± 0.0096
acc_norm 0.6380 ± 0.0099
hellaswag 0 acc 0.5420 ± 0.0050
acc_norm 0.7109 ± 0.0045
piqa 0 acc 0.7399 ± 0.0102
acc_norm 0.7541 ± 0.0100
arc_challenge 0 acc 0.3618 ± 0.0140
acc_norm 0.3823 ± 0.0142
openbookqa 0 acc 0.2980 ± 0.0205
acc_norm 0.4060 ± 0.0220
boolq 1 acc 0.5624 ± 0.0087

yes, consistent with their reported numbers, so command itself seems reasonable.

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

https://arxiv.org/pdf/2302.13971.pdf
image

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

undertrained older models perform worse indeed:

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-20b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-20b.eval.log

h2ogpt-oig-oasst1-256-20b.eval.log

Task Version Metric Value Stderr
hellaswag 0 acc 0.5320 ± 0.0050
acc_norm 0.7115 ± 0.0045
arc_easy 0 acc 0.7138 ± 0.0093
acc_norm 0.6869 ± 0.0095
boolq 1 acc 0.6878 ± 0.0081
piqa 0 acc 0.7742 ± 0.0098
acc_norm 0.7786 ± 0.0097
openbookqa 0 acc 0.2760 ± 0.0200
acc_norm 0.3760 ± 0.0217
arc_challenge 0 acc 0.3686 ± 0.0141
acc_norm 0.3942 ± 0.0143
winogrande 0 acc 0.6630 ± 0.0133

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-256-12b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-256-12b.eval.log

h2ogpt-oig-oasst1-256-12b.eval.log

Task Version Metric Value Stderr
hellaswag 0 acc 0.5189 ± 0.0050
acc_norm 0.6930 ± 0.0046
arc_challenge 0 acc 0.3276 ± 0.0137
acc_norm 0.3797 ± 0.0142
piqa 0 acc 0.7628 ± 0.0099
acc_norm 0.7720 ± 0.0098
winogrande 0 acc 0.6527 ± 0.0134
boolq 1 acc 0.6602 ± 0.0083
arc_easy 0 acc 0.6999 ± 0.0094
acc_norm 0.6629 ± 0.0097
openbookqa 0 acc 0.2940 ± 0.0204
acc_norm 0.4020 ± 0.0219

from h2ogpt.

arnocandel avatar arnocandel commented on May 22, 2024

https://github.com/EleutherAI/lm-evaluation-harness
4b701e228768052cfae9043dca13e82052ca5eea

diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py
index 4d3aa24..34b6967 100644
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -76,10 +76,10 @@ class HuggingFaceAutoLM(BaseLM):
         subfolder: Optional[str] = None,
         revision: Optional[str] = "main",
         batch_size: Optional[Union[int, str]] = 1,
-        max_gen_toks: Optional[int] = 256,
+        max_gen_toks: Optional[int] = 512,
         max_length: Optional[int] = None,
         add_special_tokens: Optional[bool] = None,
-        use_accelerate: Optional[bool] = False,
+        use_accelerate: Optional[bool] = True,
         device_map_option: Optional[str] = "auto",
         max_memory_per_gpu: Optional[Union[int, str]] = None,
         max_cpu_memory: Optional[Union[int, str]] = None,
@@ -87,9 +87,9 @@ class HuggingFaceAutoLM(BaseLM):
         dtype: Optional[Union[str, torch.dtype]] = None,
         device: Optional[Union[int, str]] = "cuda",
         peft: str = None,
-        load_in_8bit: Optional[bool] = False,
+        load_in_8bit: Optional[bool] = True,
         load_in_4bit: Optional[bool] = False,
-        trust_remote_code: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = True,
         gptq_use_triton: Optional[bool] = False,
     ):
         """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.

CUDA_VISIBLE_DEVICES=0 python main.py --model hf-causal-experimental --model_args pretrained=h2oai/h2ogpt-oig-oasst1-falcon-40b --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq --device cuda &> h2ogpt-oig-oasst1-falcon-40b.eval.log

from h2ogpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.