Why Consensus Is Not Correctness in Multi-Agent AI
When agents unanimously agree on the wrong answer, the real question is not who won the debate — but when it is safe to act.
Multi-agent debate improves LLM reasoning. After several rounds of deliberation, agents update their beliefs, challenge each other's reasoning, and often converge on a shared answer. This convergence feels reassuring — if three different models agree, the answer must be right.
Except when it isn't. Agreement among agents may be informative, but it is not sufficient evidence for safe automated action. When models converge on a wrong answer through social reinforcement, consensus-based stopping commits that error to an automated action with no recourse. In our experiments on MMLU-Pro, we found that 23.9% of initially-disputed cases end in unanimous wrong agreement — the agents debate, converge, and agree on an incorrect answer with high apparent confidence.
The cost of acting without calibrated refusal
The Missing Piece: Calibrated Refusal
The problem is not that debate doesn't work — it does improve reasoning. The problem is what happens after debate. Every deployed pipeline ultimately reduces the debate to a single answer and acts on it. Majority voting, weighted averaging, argmax — all of these discard uncertainty that could have flagged the error.
The real deployment question is not "who won the debate?" but "when is it safe to act?" Current systems have no answer to this question.
LLM agents are known to conform to perceived majority opinions — a form of social reinforcement that can produce wrong consensus: agents converge on an incorrect answer with high apparent confidence. Stability-based stopping detects agreement but cannot tell whether it is correct. The result is uncalibrated commitment: wrong consensus feeds to automated action with no safety check.
Conformal Social Choice: The Framework
We introduce Conformal Social Choice, a post-hoc decision layer that converts debate outputs into calibrated act-versus-escalate decisions. The framework has four stages:
(T rounds)
Pool
Calibration
Escalate
The four-stage Conformal Social Choice pipeline
Stage 1: Verbalized probability elicitation. Heterogeneous agents (in our experiments: Claude Haiku, DeepSeek-R1, and Qwen-3 32B) debate for T rounds. Rather than extracting token-level log-probabilities — unavailable for many proprietary APIs — each agent outputs explicit numerical probabilities for every answer option. This is a fully black-box approach.
Stage 2: Social probability aggregation. Agent beliefs are combined via a linear opinion pool — a weighted mixture that preserves the full probability distribution, not just the top vote. Unlike majority voting, this retains the intensity of preferences: an agent assigning 0.6 to option A contributes differently than one assigning 0.99.
Stage 3: Conformal calibration. Using a held-out calibration set, split conformal prediction transforms the social probabilities into prediction sets with a marginal coverage guarantee: the correct answer is included with probability ≥ 1−α. This requires no assumptions on individual model calibration — only exchangeability.
Stage 4: Hierarchical action policy. This is the key decision mechanism. Singleton sets (one answer) → autonomous action. Larger sets → human escalation. Empty sets → anomaly flag. The prediction set is not just for evaluation — it directly determines whether the system acts or asks for help.
The conformal layer does not make debate more accurate. It makes debate failures actionable. By refusing to act on cases where debate is confidently wrong, the remaining decisions achieve dramatically higher accuracy — a selection effect of calibrated refusal, not a reasoning improvement.
What the Numbers Show
On eight MMLU-Pro domains with three agents, at α=0.05:
Results on MMLU-Pro (8 domains, 3 heterogeneous agents)
The tradeoff is explicit: improved safety comes at reduced automation. Not every case gets resolved autonomously — some get escalated to human review. But the operating point is user-adjustable via α. A stricter α (say 0.01) escalates more cases but provides stronger guarantees; a relaxed α (say 0.20) automates more but with less safety margin.
Why Act-versus-Escalate, Not Just Better Answers
Most work on multi-agent debate asks: how can we make agents reason better? or when should debate stop? These are important questions, but they miss the deployment reality. In practice, the output of a debate feeds into an automated pipeline — placing an order, sending a response, making a recommendation. The question that matters for safety is not whether the debate was good, but whether the system should be allowed to act on its conclusion.
Conformal Social Choice reframes multi-agent debate from an accuracy-maximization problem into a decision problem with calibrated risk control. The system doesn't claim to always be right. It claims that when it chooses to act, its probability of being wrong is bounded — and when it isn't confident enough, it says so.
That distinction — between being right and knowing when you might be wrong — is the difference between a system you demo and a system you deploy.