Skip to content

A benchmark for multi-turn debate judgment in large language models.

License

Notifications You must be signed in to change notification settings

shippy/DebateFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DebateFlow

A benchmark for multi-turn debate judgment in large language models.

What this is

Current argumentation benchmarks evaluate argument quality in isolation -- a single text scored along rhetorical or logical dimensions. DebateFlow tests whether LLMs can judge multi-turn debates: given a four-turn transcript and a scoring rubric, predict the winner and score each side along dimensions that require attending to the full arc of the exchange.

Each debate follows the Karl Popper format: four turns (Affirmative opening, Negative response, Affirmative rebuttal, Negative closing) on a stated resolution. Debates are generated synthetically via LLM-vs-LLM, with one side optionally receiving an injected weakness (weak evidence, argument dropping, logical gaps, or burden-of-proof failure). This gives each debate a known ground-truth failure mode for fine-grained error analysis.

Evaluation rubric

Dimension What it measures
Clash engagement Did each side address the opponent's arguments or talk past them?
Burden fulfillment Did each side meet its burden of proof?
Rebuttal quality Specificity and depth of refutations
Argument extension Did arguments develop across turns, or merely repeat the opening?
Strategic adaptation Did speakers adjust their approach based on the opponent's actual moves?

The last two dimensions are central to competitive debate judging but absent from existing argument quality taxonomies.

Project structure

pyproject.toml              Project config and dependencies
resolutions.yaml            12 seed resolutions (policy, values, empirical)
plans/
    SPEC.md                  Benchmark specification
    PLAN.md                  Implementation plan
    VOICE-SPEC.md            Voice synthesis spec
    TELEGRAM-JUDGING-SPEC.md Telegram judging interface spec
src/debateflow/
    models.py                Pydantic data models
    providers.py             LLM provider factory (Anthropic + OpenAI)
    prompts.py               System prompts and weakness injection templates
    generator.py             4-turn debate generation pipeline
    compile.py               JSONL compilation and statistics
    publish.py               HuggingFace Hub publication
    dataset_card.py          Dataset card template
    cli.py                   Typer CLI entry point
    server.py                Annotation server with on-demand TTS
    voice.py                 ElevenLabs TTS wrapper
    telegram_judging.py      Telegram judging session management
    agreement.py             Inter-annotator agreement computation
    static/
        annotate.html        Browser-based annotation tool
        review.html          Annotation review tool
output/
    debates/                 Generated debate JSON files
    annotations/             Human annotation JSON files
    audio/                   Cached TTS audio (MP3)
tests/
    test_models.py
    test_prompts.py

Development setup

Requires Python 3.11+ and uv.

git clone <repo-url> && cd debateflow
uv sync

Copy .env.example to .env and fill in the keys you need:

DF_ANTHROPIC_API_KEY=...    # for debate generation (Anthropic models)
DF_OPENAI_API_KEY=...       # for debate generation (OpenAI models)
DF_ELEVENLABS_API_KEY=...   # for voice synthesis (annotation server)
DF_HF_TOKEN=...             # for publishing to HuggingFace Hub
DF_HF_REPO=...              # e.g. your-username/debateflow

Not all keys are needed for every task. Generation requires the LLM provider key(s), the annotation server requires ElevenLabs, and publishing requires HuggingFace.

Generating debates

# Generate 10 debates with default models
uv run debateflow generate -n 10

# Use specific models per side
uv run debateflow generate -n 5 \
    --aff-provider anthropic --aff-model claude-sonnet-4-20250514 \
    --neg-provider openai --neg-model gpt-4o

# Filter by topic category or force a weakness type
uv run debateflow generate -n 5 --category values
uv run debateflow generate -n 3 --weakness argument_dropping

# View dataset statistics
uv run debateflow stats

# Compile individual JSONs into a single JSONL
uv run debateflow compile

Annotating debates

The annotation tool runs in the browser. Start the server:

uv run debateflow serve

Then open http://localhost:5733. The server:

  • Serves the annotation UI at /
  • Loads debates from output/debates/ (click "Load from Server" on the setup screen)
  • Provides on-demand text-to-speech via ElevenLabs -- click Play on any turn to hear it spoken, or Play All for sequential playback
  • Caches synthesized audio to output/audio/ so repeated plays don't hit the API

Enter your annotator ID, load debates, and score each one. Annotations download as JSON files that go into output/annotations/.

Voice playback is optional -- annotation works without an ElevenLabs key, you just won't have the Play buttons functional.

Annotation commands

# Check annotation progress
uv run debateflow annotate-status

# Compute inter-annotator agreement (needs 2+ annotators on same debates)
uv run debateflow annotate-agreement

Publishing

# Dry run -- generates JSONL and dataset card locally
uv run debateflow publish --repo your-username/debateflow --dry-run

# Push to HuggingFace Hub
uv run debateflow publish --repo your-username/debateflow --public

Design docs

See plans/ for the full specifications:

  • SPEC.md — benchmark design, rubric dimensions, and score-level anchors
  • PLAN.md — implementation plan for the generation pipeline
  • VOICE-SPEC.md — ElevenLabs voice synthesis for spoken debate playback
  • TELEGRAM-JUDGING-SPEC.md — Telegram-based annotation flow via OpenClaw

Tests

uv run pytest tests/

About

A benchmark for multi-turn debate judgment in large language models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •