Research corpus

A growing corpus of structured argumentative speech.

Every round on Debate AI produces a motion-tagged, format-aware, graded exchange of arguments. The internal corpus feeds a nightly learning loop that sharpens the AI on real usage. An opt-in subset is available for licensing to AI research organizations studying argumentative dialogue, voice-mode language models, and debate pedagogy.

Internal corpus rounds
All rounds, used by the nightly learning loop
Voice-round transcripts
Real-time argumentation, no audio stored
Formats actively captured
APDA, BP, WSDC, Asian, PF, LD, Policy, Congress, MUN
Languages represented
Including Hindi, Spanish, Mandarin
Loading live counts…

Why this dataset is hard to replicate

Adversarial, not monologic

Open-web text is mostly one side talking past the other. Every row here is a real-time rebuttal under a format clock, often with POI interruptions. That's the part of training data that scraped op-eds, podcasts, and reddit don't cover.

Format-aware register

Policy spreads tagged cards. APDA stays impromptu. LD argues value/criterion. BP runs whip extensions. Each format has its own grammar of speech, and the corpus captures the register switching that a general-purpose model has no signal for.

Graded outcomes

Rounds get user ratings (1–5) and judge ballots with speaker points and weighing. Preference-pair labels for RLHF or DPO without commissioning a separate human-labeling pipeline.

What's in the licensable subset

Opt-in only. The toggle lives in every user's profile, off by default, with the legal terms in privacy §6. When a user turns it on, future rounds (typed and voice) carry a contributable: true flag; everything else stays internal.

Each row, after anonymization, is shaped roughly:

{ motion: "THBT the means justify the ends", side: "GOV" | "OPP" | "PRO" | "CON" | "AFF" | "NEG" | "...", format: "apda" | "bp" | "worlds" | "pf" | "ld" | "policy" | "...", kind: "case" | "rebuttal" | "judge" | "voice_round" | "...", systemPrompt: "[format-aware system block fed to the model]", userPrompt: "[user-side text + prior turns]", output: "[the AI's reply, or the user-turn block for human rows]", durationMs: 12340, context: { language: "en", persona: "debater", ... }, rating: 4, // 1-5, when given saved: false, contributable: true, // stamped at write time createdAt: "2026-05-25T17:34:01Z" }

Anonymized means stripped of name, email, account id, IP, and any device fingerprints. What remains is the speech and its structural metadata. Voice audio is never stored; only the text transcript is eligible.

Per-format internal counts

Snapshot from the last nightly aggregation. Includes all generations, not just the opt-in subset, so you can see where the volume is concentrated.

Loading…

The growth curve, not the row count

Volume today is small. What's compounding is the architecture: a learning loop that's been writing every generation to the corpus since 2026-05-13, a consent layer that went live 2026-05-25, and a daily distillation pass that re-shapes the AI based on rated outputs. The licensable subset is just starting; the wedge is what the dataset becomes at scale, not what it is this week.

License inquiries

Open to conversations with AI research orgs, academic labs, and dataset aggregators. Happy to share a sample export under NDA and walk through the schema in detail.

feedback@debateai.com Read the consent terms