instructkr / logickor Goto Github PK

View Code? Open in Web Editor NEW

140.0 1.0 20.0 15.93 MB

한국어 언어모델 다분야 사고력 벤치마크

Python 98.38% Makefile 1.62%

logickor's Introduction

LogicKor

한국어 언어모델 다분야 사고력 벤치마크

Benchmark Website

https://lk.instruct.kr

Note

pr 적극 환영합니다. 벤치마크 결과 Self-Report도 받습니다. issue나 pr 부탁드립니다. 💕

권장 사항: PR 이전에 make format && make check 를 통해 코드 포맷팅을 확인해주세요. (black, isort, ruff 의존성 설치 필요)

Repository

본 Repo는 LogicKor 벤치마크의 추론 및 평가 코드, 데이터셋을 담고 있습니다.

Evaluation Example

GPU 0,1 사용, model_len 4096

1. 인퍼런스 결과 생성

python generator.py --model yanolja/EEVE-Korean-Instruct-10.8B-v1.0 --gpu_devices 0,1 --model_len 4096

2. Judge 모델로 평가

OpenAI

python evaluator.py -o ./generated/yanolja/EEVE-Korean-Instruct-10.8B-v1.0 -k sk-somethingsomething -t 30

Azure

export AZURE_ENDPOINT=$AZURE_ENDPOINT
export AZURE_DEPLOYMENT_NAME=$AZURE_DEPLOYMENT_NAME
export AZURE_API_VERSION=$AZURE_API_VERSION

python evaluator.py --azure -o ./generated/yanolja/EEVE-Korean-Instruct-10.8B-v1.0 -k sk-somethingsomething -t 30

3. 결과 확인

python score.py -p ./evaluated/yanolja/EEVE-Korean-Instruct-10.8B-v1.0/default.jsonl

logickor's People

Contributors

Stargazers

Watchers

logickor's Issues

Register more prompt templates

Link

https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py

Request additional support for models not supported by vLLM

Currently, the generator.py code uses the LLM class of vLLM when loading the model, and I would like you to provide it as an option so that we can test for models that are not supported by vLLM.

Notes

Add Azure openai client

Note

We need Japanese translation for LogicKor (LogicJP)

Scoring inconsistency in multi-turn reasoning questions

I noticed an error in the scoring criteria for the multi-turn dialogue questions in the reasoning section. The question asks the model to provide a summary in English, but when it comes to scoring, the use of English is considered a violation and results in a score of 0. This contradictory scoring criterion needs to be addressed and corrected.

{"id": 1, "category": "추론(Reasoning)", "questions": ["각국의 법률에서는 정의라는 개념이 자주 등장하며, 법령의 형성과 해석에 있어 매우 중요한 부분을 차지한다. 하지만 정의란 명확히 규정할 수 없는 개념이기에 해석의 논란이 있을 수 있다. 그렇다면 사회구성원의 대다수가 납득할 수 있는 보편적 정의를 입증하는 방법은 무엇일지 생각해보아라.", "위 답변을 영어 문장 3개로 요약한 후. 해당 문장에 있는 단어 3개를 선택하여 단어의 의미를 설명해라."], "outputs": ["정의의 개념은 철학과 윤리학에서 오랫동안 논의되어 온 주제이며, 법적 맥락에서도 마찬가지로 복잡하고 도전적인 개념입니다. 사회구성원의 대다수가 납득할 수 있는 보편적 정의를 입증하는 것은 쉽지 않은 과제이지만, 몇 가지 방법을 고려해 볼 수 있습니다.\n\n1. 윤리적 원칙과 가치: 정의의 개념은 종종 공정성, 평등, 자유, 권리 존중과 같은 윤리적 원칙과 가치와 연관되어 있습니다. 이러한 원칙과 가치를 명확히 정의하고, 사회적 합의를 통해 보편적으로 받아들여질 수 있는 기준을 확립하는 것이 중요합니다. 예를 들어, 모든 사람의 기본적인 권리와 존엄성을 존중하는 것이 정의의 핵심 원칙 중 하나가 될 수 있습니다.\n\n2. 사회적 계약: 정의의 개념은 종종 사회적 계약 이론과 연관되어 있습니다. 사회적 계약 이론은 사람들이 사회를 형성하고 정부를 수립할 때, 그들의 권리와 의무에 대해 합의한다고 주장합니다. 이 이론에 따르면, 사회구성원 간의 계약에 기반한 정의의 개념을 확립할 수 있습니다. 즉, 사회구성원의 권리와 의무에 대한 합의가 정의의 기준이 될 수 있다는 것입니다.\n\n3. 체계적 접근: 정의의 개념을 체계적으로 접근하기 위해, 다양한 학문 분야의 통찰력을 통합할 수 있습니다. 예를 들어, 법학, 철학, 사회학, 심리학 등의 분야에서 정의에 대한 이해를 통합하여, 보다 포괄적이고 다면적인 정의의 개념을 확립할 수 있습니다.\n\n4. 대중 참여와 토론: 정의의 개념은 사회구성원의 가치관과 신념에 뿌리를 두고 있으므로, 대중의 참여와 토론을 통해 보편적 정의에 대한 이해를 발전시킬 수 있습니다. 공개 토론회, 여론 조사, 시민 자문단 등의 방법을 통해 다양한 관점과 의견을 수렴하고, 사회적 합의를 도출할 수 있습니다.\n\n5. 문화적 상대주의 고려: 정의의 개념은 문화적 맥락에 따라 달라질 수 있다는 점을 인식해야 합니다. 따라서 다양한 문화적 전통과 가치관을 존중하면서도, 보편적으로 받아들여질 수 있는 정의의 기준을 확립해야 합니다.\n\n결론적으로, 사회구성원의 대다수가 납득할 수 있는 보편적 정의를 입증하기 위해서는 윤리적 원칙과 가치의 확립, 사회적 계약에 대한 논의, 다학제적 접근, 대중 참여와 토론, 문화적 상대주의의 고려 등이 필요합니다. 이러한 노력을 통해 정의의 개념에 대한 사회적 합의를 도출하고, 법령의 형성과 해석에 있어 보다 명확하고 공정한 기준을 확립할 수 있을 것입니다.", "Sure, here is a condensed version of the response in three sentences, along with explanations of three key words:\n\n1. Establishing a universally accepted concept of justice involves defining ethical principles, engaging in social contracts, and integrating insights from various academic disciplines. \n - "Justice": This refers to the concept of moral rightness and fairness, often concerning the administration of law and the distribution of rights, rewards, and punishments. \n\n2. Public discussions, opinion polls, and citizen advisory groups can help gather diverse perspectives to form a social consensus on the definition of justice. \n - "Consensus": It means general agreement or harmony, especially in opinion or sentiment, and it is crucial for establishing a universally accepted understanding of justice. \n\n3. Cultural relativism should also be considered to respect diverse cultural values while defining a universally applicable standard of justice. \n - "Cultural Relativism": This perspective suggests that an individual's beliefs, values, and practices must be understood within their own cultural context, and it is essential to embrace diversity when defining universal standards."], "references": [null, null], "query_single": {"judge_message": "모델은 정의의 개념에 대한 사회적 합의를 이루는 방법으로 윤리적 원칙과 가치의 확립, 사회적 계약, 다학제적 접근, 대중 참여와 토론, 문화적 상대주의의 고려 등을 제시하며, 이러한 방법들이 법령의 형성과 해석에 있어 보다 명확하고 공정한 기준을 확립하는 데 기여할 수 있다고 설명하고 있다. 답변은 질문의 요구사항을 충실히 반영하고 있으며, 가독성과 완성도 면에서도 높은 수준을 보여준다.", "judge_score": 10.0}, "query_multi": {"judge_message": "모델은 영어로 답변을 제공했으며, 이는 언어 요구사항을 위반한 것입니다. 한국어로 답변을 제공해야 했음에도 불구하고, 모델은 영어로 요약과 단어 설명을 제공했습니다.", "judge_score": 0.0}}

Reporting Bugs and Errors

좋은 벤치마크를 만들어주셔서 감사드리며, 한국어 LLM 발전에 많은 도움이 되는 것 같습니다.
사용하다 보니 약간의 에러가 있는 것 같아서 리포트 드립니다.

1. id:1 의 경우 "위 답변을 영어 문장 3개로 요약한 후. 해당 문장에 있는 단어 3개를 선택하여 단어의 의미를 설명해라." 가 멀티턴 question 인데, 정상적으로 영어로 답변 후 의미를 한국어로 설명하여도 영어로 답변 했다는 이유로 judge_score 가 0.0점입니다.
https://github.com/StableFluffy/LogicKor/blob/abd987c141a6e639d842a3be8c9884f4533dc8f5/results/judge_HyperClovaX.jsonl#L4

2. EVEE-instruct 모델의 템플릿은

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {prompt}
Assistant:

와 같은데,
https://github.com/StableFluffy/LogicKor/blob/abd987c141a6e639d842a3be8c9884f4533dc8f5/templates/template-EEVE.json#L1-L4
와 같이 잘못된 템플릿 파일이 적용되는 것 같습니다.

3. 개인 의견 입니다만. 모델 답변 생성시 Sampling Parameter이 beam=4 로 고정되어 있는데, 모델별 Sampling Parameter에 따른 생성결과 차이 편차가 큰 것 같습니다.
모델의 생성결과가 deterministic 한지를 고려하여 greedy 혹은 beam을 사용하는 것이 맞을 수 있으나,
API 모델들(gpt, claude, hyperclovaX)은 추정컨데 beam 결과가 아님을 감안하면
다른 sampling 방식을 적용할 수 있도록 수정하면 어떨까 싶습니다.
많은 open 모델들은 beam 사용시 오히려 점수가 하락하는 것 같습니다.

bug in generator.py

버그 설명

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.
        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
            if name == 'main':
                freeze_support()
                ...
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

평가 엔진으로는 vllm을 사용하였습니다.

해결 방법

if __name__ == '__main__':
    freeze_support()
    main()

이런 식으로 코드를 수정해서 해결했습니다. 그러나 제 환경에서만 발생하는 에러일수도 있기에 이슈로 달아둡니다.

reference 오류

안녕하세요, 38번 문제의 reference에 오류가 있는 것 같습니다.

{"id": 38, "category": "문법(Grammar)", "questions": ["두 문장 사이에 가장 알맞은 접속어를 고르고, 이유를 설명하시오.\n\n이 세상에는 아름다운 것들이 참으로 많다고들 한다. 나는 그런 아름다움을 모른다.\n\n1. 그래서\n2. 그러나\n3. 그리고\n4. 따라서", "'그러나'와 비슷한 역할을 하는 접속어는 어떤 게 있는지 설명하시오."], "references": ["3. 그리고", null]}
-> 첫 번째 문제의 정답은 "2. 그러나"로 볼 수도 있다고 생각합니다. 두 문장의 내용이 대조적이기 때문입니다. 그러나 현재 references에는 "3. 그리고"라고만 기입되어 있습니다. 이 부분이 평가에 영향을 끼칠 수 있을 것 같습니다.

위 레퍼런스를 수정해주시면 벤치마크의 정확도와 신뢰도를 높일 수 있을 것 같습니다.
좋은 벤치마크를 만들어주셔서 감사합니다!

request for add chat template for our model (akallama, mirlab/AkaLlama-llama3-70b-v0.1)

We released new model, AkaLlama. We want to register our model to benchmark, so I made this issue to ask about supporting the system message that we use.

Is there any procedure I have to do for registering our model to open leaderboard?
Thanks.
seungwon lim

add chat template for llama3

reference: lmstudio-ai/.github#43

EEVE 템플릿 질문

https://github.com/instructkr/LogicKor?tab=readme-ov-file#evaluation-example
EEVE 템플릿 사용한다고 readme.md에 적혀있는데 구체적으로 judge template 을 의미하는 것인가요?

LogicKor/templates.py

Line 98 in 43c10e5

JUDGE_TEMPLATE = {

Proposal for adding Korean knowledge evaluation data

Hello,
My name is Jungseob Lee, a Ph.D. candidate at the NLP&AI Lab at Korea University.

Thank you for your excellent work in advancing the Korean NLP.

I would like to propose adding Korean knowledge evaluation (which most models like LLaMA3 70B struggle with and find difficult to align with English knowledge) and the evaluation of understanding Korean neologisms and idioms (which models like GPT-4 also find confusing) to your LogicKor.

I would like to discuss this matter further. Could I have your email address?
My email is [email protected].

Thank you for your significant contributions to Korean NLP!

instructkr / logickor Goto Github PK

logickor's Introduction

LogicKor

Benchmark Website

Note

Repository

Evaluation Example

1. 인퍼런스 결과 생성

2. Judge 모델로 평가

OpenAI

Azure

3. 결과 확인

logickor's People

Contributors

Stargazers

Watchers

Forkers

logickor's Issues

Link

Notes

Note

버그 설명

해결 방법

Recommend Projects

Recommend Topics

Recommend Org