Topic: evaluation Goto Github

Some thing interesting about evaluation

👇 Here are 1048 public repositories matching this topic...

abo-abo / lispy

evaluation,Short and sweet LISP editing

User: abo-abo

Home Page: http://oremacs.com/lispy/

navigation evaluation refactoring emacs-lisp clojure common-lisp scheme python

amenra / ranx

evaluation,⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

User: amenra

Home Page: https://amenra.github.io/ranx

ranking-metrics numba python evaluation evaluation-metrics information-retrieval recommender-systems information-retrieval-evaluation information-retrieval-metrics data-fusion

bochinski / iou-tracker

evaluation,Python implementation of the IOU Tracker

User: bochinski

Home Page: http://www.nue.tu-berlin.de

tracker evaluation demo-script mot iou-tracker python detrac ua-detrac tracking-by-detection

caserec / caserecommender

evaluation,Case Recommender: A Flexible and Extensible Python Framework for Recommender Systems

Organization: caserec

rating-prediction recommender-systems python evaluation ranking top-k recommendation-system

cbluebenchmark / cblue

evaluation,中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Organization: cbluebenchmark

Home Page: https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us

benchmark chineseblue corpus dataset chinese biomedical-tasks evaluation acl2022

chrisjbryant / errant

evaluation,ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.

User: chrisjbryant

automatic annotation grammatical-framework classifier evaluation

cloud-cv / evalai

evaluation,:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

Organization: cloud-cv

Home Page: https://eval.ai

ai machine-learning django angularjs python ai-challenges docker reproducible-research reproducibility evaluation

cluebenchmark / superclue

evaluation,SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

Organization: cluebenchmark

Home Page: https://www.superclueai.com

chatgpt chinese evaluation foundation-models gpt-4

codingseb / expressionevaluator

evaluation,A Simple Math and Pseudo C# Expression Evaluator in One C# File. Can also execute small C# like scripts

User: codingseb

expression-evaluator evaluation math expression evaluate parser mathematical-expressions-evaluator mathematical-expressions script scripting

continualai / avalanche

evaluation,Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

Organization: continualai

Home Page: http://avalanche.continualai.org

continual-learning deep-learning pytorch lifelong-learning framework benchmarks strategies metrics continualai evaluation

danthedeckie / simpleeval

evaluation,Simple Safe Sandboxed Extensible Expression Evaluator for Python

User: danthedeckie

python evaluation library

davidstutz / superpixel-benchmark

evaluation,An extensive evaluation and comparison of 28 state-of-the-art superpixel algorithms on 5 datasets.

User: davidstutz

superpixel-algorithms benchmark evaluation superpixels computer-vision image-procesing opencv

dbolya / tide

evaluation,A General Toolbox for Identifying Object Detection Errors

User: dbolya

Home Page: https://dbolya.github.io/tide

object-detection instance-segmentation evaluation toolbox errors error-detection

ethicalml / xai

evaluation,XAI - An eXplainability toolbox for machine learning

Organization: ethicalml

Home Page: https://ethical.institute/principles.html#commitment-3

explainability xai ml ai bias artificial-intelligence bias-evaluation explainable-ai explainable-ml machine-learning

google-deepmind / long-form-factuality

evaluation,Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".

Organization: google-deepmind

Home Page: https://arxiv.org/abs/2403.18802

benchmark dataset evaluation factuality language language-modeling large-language-models metrics

google / fuzzbench

evaluation,FuzzBench - Fuzzer benchmarking as a service.

Organization: google

Home Page: https://google.github.io/fuzzbench/

fuzzing benchmarking benchmark-framework evaluation security

grumpyzhou / image-matching-toolbox

evaluation,This is a toolbox repository to help evaluate various methods that perform image matching from a pair of images.

User: grumpyzhou

image-matching image-correspondences evaluation

huggingface / evaluate

evaluation,🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Organization: huggingface

Home Page: https://huggingface.co/docs/evaluate

evaluation machine-learning

ianarawjo / chainforge

evaluation,An open-source visual programming environment for battle-testing prompts to LLMs.

User: ianarawjo

Home Page: https://chainforge.ai

ai evaluation large-language-models llmops llms prompt-engineering

jkkummerfeld / text2sql-data

evaluation,A collection of datasets that pair questions with SQL queries.

User: jkkummerfeld

Home Page: http://jkk.name/text2sql-data/

dataset sql nlp natural-language-processing evaluation neural-network dynet database natural-language-interface

knetic / govaluate

evaluation,Arbitrary expression evaluation for golang

User: knetic

go evaluation parsing expression

evaluation,🪢 Open source LLM engineering platform. Observability, metrics, evals, prompt management, testing, prompt playground, datasets, LLM evaluations -- 🍊YC W23 🤖 integrate via Typescript, Python / Decorators, OpenAI, Langchain, LlamaIndex, Litellm, Instructor, Mistral, Perplexity, Claude, Gemini, Vertex

Organization: langfuse

Home Page: https://langfuse.com/docs

analytics llm llmops gpt large-language-models openai self-hosted ycombinator monitoring observability

lunary-ai / lunary

evaluation,The production toolkit for LLMs. Observability, prompt management and evaluations.

Organization: lunary-ai

Home Page: https://lunary.ai

ai llm logs monitoring openai self-hosted langchain testing hacktoberfest evaluation

maluuba / nlg-eval

evaluation,Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Organization: maluuba

Home Page: http://arxiv.org/abs/1706.09799

natural-language-generation natural-language-processing nlg nlp evaluation bleu bleu-score meteor cider rouge

michaelgrupp / evo

evaluation,Python package for the evaluation of odometry and SLAM

User: michaelgrupp

Home Page: https://michaelgrupp.github.io/evo/

slam odometry evaluation metrics robotics trajectory benchmark ros kitti tum

microsoft / promptbench

evaluation,A unified evaluation framework for large language models

Organization: microsoft

Home Page: http://aka.ms/promptbench

adversarial-attacks chatgpt evaluation large-language-models robustness prompt prompt-engineering benchmark

mlgroupjlu / llm-eval-survey

evaluation,The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".

User: mlgroupjlu

Home Page: https://arxiv.org/abs/2307.03109

benchmark evaluation large-language-models llm llms model-assessment

mrgloom / awesome-semantic-segmentation

evaluation,:metal: awesome-semantic-segmentation

User: mrgloom

semantic-segmentation benchmark evaluation deeplearning

open-compass / opencompass

evaluation,OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Organization: open-compass

Home Page: https://opencompass.org.cn/

evaluation benchmark large-language-model chatgpt llm llama2 openai llama3

open-compass / vlmevalkit

evaluation,Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 30+ HF models, 15+ benchmarks

Organization: open-compass

Home Page: https://rank.opencompass.org.cn/leaderboard-multimodal

gpt-4v large-language-models llava multi-modal openai vqa llm openai-api mplug-owl qwen

paesslerag / gval

evaluation,Expression evaluation in golang

Organization: paesslerag

evaluate-expressions expression-evaluator gval expression-language parsing godoc evaluation golang parser go

prbonn / semantic-kitti-api

evaluation,SemanticKITTI API for visualizing dataset, processing data, and evaluating results.

Organization: prbonn

Home Page: http://semantic-kitti.org

semantic-segmentation semantic-scene-completion large-scale-dataset dataset evaluation labels deep-learning machine-learning

promptfoo / promptfoo

evaluation,Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

Organization: promptfoo

Home Page: https://www.promptfoo.dev/

llm prompt-engineering prompts llmops prompt-testing testing rag evaluation evaluation-framework llm-eval

reclist / reclist

evaluation,Behavioral "black-box" testing for recommender systems

Organization: reclist

Home Page: https://reclist.io

recommender-system machine-learning qa-automation evaluation machine-learning-library

sdiehl / write-you-a-haskell

evaluation,Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

User: sdiehl

haskel compiler book evaluation lambda-calculus type type-checking type-system pdf-book functional-programming

sepandhaghighi / pycm

evaluation,Multi-class confusion matrix library in Python

User: sepandhaghighi

Home Page: http://pycm.io

machine-learning confusion-matrix matrix statistics statistical-analysis accuracy ml ai mathematics data-mining

strangerzhang / pysot-toolkit

evaluation,Python Single Object Tracking Evaluation

User: strangerzhang

sot single object tracking evaluation ope eao

tatsu-lab / alpaca_eval

evaluation,An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Organization: tatsu-lab

Home Page: https://tatsu-lab.github.io/alpaca_eval/

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

tecnickcom / tcexam

evaluation,TCExam is a CBA (Computer-Based Assessment) system (e-exam, CBT - Computer Based Testing) for universities, schools and companies, that enables educators and trainers to author, schedule, deliver, and report on surveys, quizzes, tests and exams.

Organization: tecnickcom

Home Page: http://www.tcexam.org

cba cbt computer-based-assessment computer-based-testing e-exam tcexam exam school university testing

thu-keg / evaluationpapers4chatgpt

evaluation,Resource, Evaluation and Detection Papers for ChatGPT

Organization: thu-keg

chatgpt detection evaluation large-language-models resource

toshas / torch-fidelity

evaluation,High-fidelity performance metrics for generative models in PyTorch

User: toshas

pytorch metrics gan generative-model evaluation inception-score frechet-inception-distance kernel-inception-distance perceptual-path-length precision

ucinlp / autoprompt

evaluation,AutoPrompt: Automatic Prompt Construction for Masked Language Models.

Organization: ucinlp

nlp language-model evaluation

uptrain-ai / uptrain

evaluation,UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Organization: uptrain-ai

Home Page: https://uptrain.ai/

machine-learning experimentation llm-prompting llm-test llmops monitoring prompt-engineering autoevaluation evaluation llm-eval