llm-as-a-judge

Here are 82 public repositories matching this topic...

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

Updated Feb 11, 2026
TypeScript

prometheus-eval / prometheus-eval

Star

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

Updated Apr 25, 2025
Python

metauto-ai / agent-as-a-judge

Star

👩‍⚖️ Agent-as-a-Judge: The Magic for Open-Endedness

llms llm-as-a-judge agent-as-a-judge

Updated May 14, 2025
Python

MigoXLab / dingo

Star

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

Updated Feb 11, 2026
JavaScript

haizelabs / verdict

Star

Inference-time scaling for LLMs-as-a-judge.

reward-shaping llm llm-as-a-judge test-time-compute inference-time-compute llm-judge test-time-scaling

Updated Nov 5, 2025
Jupyter Notebook

IAAR-Shanghai / xFinder

Star

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Updated Nov 14, 2025
Python

eqtylab / cupcake

Star

A native policy enforcement layer for AI coding agents. Built on OPA/Rego.

hooks mcp opencode alignment opa cursor llm-as-a-judge coding-agents gemini-cli agent-security claude-code factory-ai

Updated Jan 16, 2026
Rust

IAAR-Shanghai / xVerify

Star

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

benchmark regex reliability evaluation llm reliability-tools chatgpt cc-by-nc-nd-4 open-compass llm-as-a-judge deepseek-math judge-model reasoning-models open-r1 xverify math-verify

Updated Nov 13, 2025
Jupyter Notebook

martin-wey / CodeUltraFeedback

Star

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

code-generation dpo large-language-models reinforcement-learning-from-human-feedback llm-as-a-judge codeultrafeedback

Updated Jun 25, 2024
Python

KID-22 / LLM-IR-Bias-Fairness-Survey

Star

This is the repo for the survey of Bias and Fairness in IR with LLMs.

information-retrieval recommender-systems bias ir fairness large-language-models llm chatgpt llm4rec llm4rs llm-as-a-judge llm-as-evaluator llm4ir

Updated Sep 4, 2025

lupantech / ineqmath

Star

Solving Inequality Proofs with Large Language Models.

theorem-proving inequality olympiad llms llm-as-a-judge math-reasoning

Updated Dec 15, 2025
Python

whitecircle-ai / circle-guard-bench

Star

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

benchmarking benchmark ai jailbreak safeguard guardrail guardrails large-language-models llm large-language-model llm-security llm-eval llm-evaluation llm-as-a-judge llm-jailbreaks

Updated Dec 3, 2025
Python

MJ-Bench / MJ-Bench

Star

(NeurIPS 2025) Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

Updated Jun 3, 2025
Jupyter Notebook

Cominclip / OmniVerifier

Star

[ICLR 2026 Oral] Generative Universal Verifier as Multimodal Meta-Reasoner

vision-language-model multimodal-large-language-models llm-as-a-judge multimodal-reasoning verifier-models

Updated Nov 14, 2025
Python

docling-project / docling-sdg

Star

A set of tools to create synthetically-generated data from documents

ai question-answering documents sdg llm-as-a-judge

Updated Aug 15, 2025
Python

HZYAI / RagScore

Star

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

privacy jupyter mcp evaluation colab dataset-generation synthetic-data fine-tuning rag qa-generation ai-evaluation llm llmops local-llm ollama rag-evaluation llm-as-a-judge

Updated Feb 11, 2026
Python

zhaochen0110 / Timo

Star

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

temporal-reasoning sota-model llms rlhf rlaif llm-as-a-judge llm-as-evaluator self-critic-framework colm2024

Updated Oct 23, 2024
Python

SeedLLM / OmicsBench

Star

OmicsBench: Distinguishing Multi-Omics Reasoning from Shortcut Learning in Large Language Models

benchmark omics llm llm-as-a-judge omicsbench

Updated Feb 9, 2026
Python

PKU-ONELab / Themis

Star

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation nlg llm-as-a-judge

Updated Feb 23, 2025
Python

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-a-judge

Here are 82 public repositories matching this topic...

Agenta-AI / agenta

prometheus-eval / prometheus-eval

metauto-ai / agent-as-a-judge

MigoXLab / dingo

haizelabs / verdict

IAAR-Shanghai / xFinder

eqtylab / cupcake

IAAR-Shanghai / xVerify

martin-wey / CodeUltraFeedback

KID-22 / LLM-IR-Bias-Fairness-Survey

lupantech / ineqmath

whitecircle-ai / circle-guard-bench

MJ-Bench / MJ-Bench

Cominclip / OmniVerifier

docling-project / docling-sdg

HZYAI / RagScore

zhaochen0110 / Timo

SeedLLM / OmicsBench

PKU-ONELab / Themis

minnesotanlp / cobbler

Improve this page

Add this topic to your repo