Structural Uncertainty in LLM Reasoning

Why answer agreement is blind to reasoning instability — and how self-preference rankings reveal what dispersion metrics miss.

Suppose you ask a language model the same logical reasoning question five times. All five answers agree. Standard evaluation would conclude: low uncertainty, high reliability. But what if the model, when asked to compare and rank its own reasoning paths, produces a completely different ordering every time?

This is the gap that motivated our work on structural uncertainty. The core observation is simple but consequential: answer-level agreement can mask reasoning-level instability. Five candidates may all arrive at "20" — the same wrong answer — while the model's own preferences over how they reason flip entirely from one evaluation to the next.

Dispersion-Based View
All 5 candidates answer "20"
Answer entropy = 0
Reports: high confidence
Misses: the answer is wrong
Structural View
All 5 candidates answer "20"
Self-preference rankings unstable
Reports: reasoning inconsistency
Detects: unstable reasoning behind agreement

Same answers, opposite uncertainty signals — the scenario that dispersion methods miss

What Structural Uncertainty Actually Measures

The key idea is to probe not what the model answers, but how stably it evaluates its own reasoning. Given a question, we generate multiple candidate solutions, then ask the same model to judge pairwise preferences among them — essentially: "which of these two reasoning paths is better?" We repeat this across multiple independent trials, each using a different random spanning tree over the candidates to select which pairs to compare.

Sample N
Candidates
Pairwise
Self-Preference
Bradley-Terry
+ PageRank
Entropy
Decomposition

The structural uncertainty pipeline: from candidates to consistency signal

The sparse pairwise judgments are aggregated into a global ranking distribution using Bradley-Terry modeling with PageRank normalization. Each trial produces a ranking distribution π(m) over candidates. Across M trials, these distributions form an ensemble that we decompose into two complementary entropy-based signals.

Two Components, Two Failure Modes

The decomposition separates total structural uncertainty into:

Across-Trial Instability
Different spanning trees → different rankings
The model can't stably decide which reasoning is best
Correlates negatively with accuracy
Signals: unreliable reasoning
Within-Trial Ambiguity
Ranking spread across candidates in a single trial
Multiple plausible reasoning paths compete
Correlates positively with accuracy on reasoning tasks
Signals: rich solution space

The two components of structural uncertainty relate to accuracy in opposite directions

This asymmetry is one of the most interesting findings. Across-trial instability is bad — when the model can't consistently decide which reasoning path is better, the reasoning itself is unreliable. But within-trial ambiguity can actually be a positive signal on reasoning tasks: it means multiple plausible solution strategies coexist, which tends to occur when the model genuinely understands the problem well enough to produce diverse valid approaches.

Instability across evaluations signals unreliable reasoning. Ambiguity within evaluations signals a rich solution space. The distinction matters — they carry opposite information about correctness.

When the Signal Breaks Down

Structural uncertainty is not a universal confidence estimator — and recognizing its limits is itself a contribution. Across five LLMs (Claude Sonnet 4.5, GPT-OSS 20B, Qwen 3 32B, Amazon Nova Premier, DeepSeek R1) and eight benchmarks, a clear pattern emerges:

On mathematical and logical reasoning tasks (MATH-500, AMC-23, AIME, Math-Synth), structural signals provide strong complementary information to answer dispersion. The combination of structural uncertainty with self-consistency entropy improves identification of unreliable reasoning instances.

On factual retrieval tasks (HotpotQA), the structural signal collapses toward uniformity. When the task requires retrieving a fact rather than constructing a reasoning path, there is no meaningful diversity in solution strategies — so self-preference rankings become uninformative.

Reasoning tasks (MATH, AMC, AIME)
Structural uncertainty complements dispersion ↑
Knowledge tasks (MMLU-Pro, TruthfulQA)
Partial complementarity
Retrieval tasks (HotpotQA)
Structural signal collapses ↓

Task regime determines whether structural uncertainty is informative

Key Insight

This regime sensitivity is itself diagnostic. The collapse of structural uncertainty on retrieval tasks identifies the boundary where reasoning-level consistency evaluation ceases to be useful — making structural uncertainty a regime-sensitive evaluator rather than a universal metric.

Why This Matters for Evaluating Logical Reasoning

Standard self-consistency — sampling multiple answers and checking whether they agree — is the most widely used post-hoc reliability signal for LLM reasoning. It works well when errors are diverse. But it has a blind spot: systematic errors that produce agreement. When a model's failure mode is to consistently arrive at the same wrong answer through different flawed reasoning paths, self-consistency reports low uncertainty while the reasoning is fundamentally unreliable.

Structural uncertainty addresses exactly this gap. By probing the stability of the model's own preferences over its reasoning — not just the answers — it detects a form of inconsistency that answer-level metrics cannot see. The combination of both signals provides strictly more information than either alone, at least in the reasoning regimes where structural diversity exists.

The broader implication is that evaluating LLM reasoning requires looking beyond what the model says, to how consistently it evaluates what it says. Consistency of judgment over reasoning paths is a fundamentally different signal from consistency of final answers — and for logical reasoning, it is often the more informative one.