ai-evaluation

Here are 163 public repositories matching this topic...

aden-hive / hive

Outcome driven agent development framework that evolves

python agent automation awesome self-hosted openai autonomous-agents human-in-the-loop claude agent-framework self-improving ai-evaluation anthropic agent-skills claude-code self-improving-ai self-improving-agent observability-ai

Updated Mar 16, 2026
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Mar 13, 2026
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

guestrin-lab / deepscholar

Star

build and benchmark deep research

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Feb 16, 2026
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Nov 18, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 15, 2026
TypeScript

taoAIGC / AICompare

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Mar 4, 2026
JavaScript

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Vvkmnn / awesome-ai-eval

Star

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

Updated Feb 12, 2026

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

HZYAI / RagScore

Star

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

privacy jupyter mcp evaluation colab dataset-generation synthetic-data fine-tuning rag qa-generation ai-evaluation llm llmops local-llm ollama rag-evaluation llm-as-a-judge

Updated Mar 13, 2026
Python

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Mar 10, 2026
Python

HiThink-Research / FinMTM

Star

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

finance benchmark financial-analysis ai-evaluation ai-benchmarking financial-llm

Updated Feb 6, 2026
Python

greynewell / matchspec

Sponsor

Star

Eval framework. Define correct, test against it, get results.

Updated Feb 17, 2026
Go

future-agi / cookbooks

Star

Example Projects integrated with Future AGI Tech Stack for easy AI development

finance marketing development evaluation interview cookbooks healthcare ai-agents mlops ai-evaluation rag-chatbot agentic-ai

Updated Mar 13, 2026
Python

METR / inspect-action

Star

Running UK AISI's Inspect in the Cloud

ai inspect elicitation ai-evaluation evals

Updated Mar 16, 2026
Python

greynewell / evaldriven.org

Sponsor

Star

Ship evals before you ship features.

Updated Feb 25, 2026
Nunjucks

Arnoldlarry15 / ARES-Dashboard

Star

AI Red Team Operations Console

nlp machine-learning jwt typescript ai frontend backend full-stack auth0 api-security red-teaming ai-security responsible-ai trustworthy-ai ai-evaluation llm model-auditing

Updated Jan 29, 2026
TypeScript

hyeonsangjeon / gdpval-realworks

Star

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

Updated Mar 9, 2026
Python

Improve this page

Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation

Here are 163 public repositories matching this topic...

aden-hive / hive

cvs-health / uqlm

lechmazur / confabulations

guestrin-lab / deepscholar

rungalileo / agent-leaderboard

METR / vivaria

taoAIGC / AICompare

kereva-dev / kereva-scanner

Vvkmnn / awesome-ai-eval

lechmazur / deception

solana8800 / langeval

HZYAI / RagScore

meshkovQA / Eval-ai-library

HiThink-Research / FinMTM

greynewell / matchspec

future-agi / cookbooks

METR / inspect-action

greynewell / evaldriven.org

Arnoldlarry15 / ARES-Dashboard

hyeonsangjeon / gdpval-realworks

Improve this page

Add this topic to your repo