Structural Uncertainty in LLM Reasoning
Why answer agreement is blind to reasoning instability — and how self-preference rankings reveal what dispersion metrics miss.
Suppose you ask a language model the same logical reasoning question five times. All five answers agree. Standard evaluation would conclude: low uncertainty, high reliability. But what if the model, when asked to compare and rank its own reasoning paths, produces a completely different ordering every time?
This is the gap that motivated our work on structural uncertainty. The core observation is simple but consequential: answer-level agreement can mask reasoning-level instability. Five candidates may all arrive at "20" — the same wrong answer — while the model's own preferences over how they reason flip entirely from one evaluation to the next.
Same answers, opposite uncertainty signals — the scenario that dispersion methods miss
What Structural Uncertainty Actually Measures
The key idea is to probe not what the model answers, but how stably it evaluates its own reasoning. Given a question, we generate multiple candidate solutions, then ask the same model to judge pairwise preferences among them — essentially: "which of these two reasoning paths is better?" We repeat this across multiple independent trials, each using a different random spanning tree over the candidates to select which pairs to compare.
Candidates
Self-Preference
+ PageRank
Decomposition
The structural uncertainty pipeline: from candidates to consistency signal
The sparse pairwise judgments are aggregated into a global ranking distribution using Bradley-Terry modeling with PageRank normalization. Each trial produces a ranking distribution π(m) over candidates. Across M trials, these distributions form an ensemble that we decompose into two complementary entropy-based signals.
Two Components, Two Failure Modes
The decomposition separates total structural uncertainty into:
The two components of structural uncertainty relate to accuracy in opposite directions
This asymmetry is one of the most interesting findings. Across-trial instability is bad — when the model can't consistently decide which reasoning path is better, the reasoning itself is unreliable. But within-trial ambiguity can actually be a positive signal on reasoning tasks: it means multiple plausible solution strategies coexist, which tends to occur when the model genuinely understands the problem well enough to produce diverse valid approaches.
Instability across evaluations signals unreliable reasoning. Ambiguity within evaluations signals a rich solution space. The distinction matters — they carry opposite information about correctness.
When the Signal Breaks Down
Structural uncertainty is not a universal confidence estimator — and recognizing its limits is itself a contribution. Across five LLMs (Claude Sonnet 4.5, GPT-OSS 20B, Qwen 3 32B, Amazon Nova Premier, DeepSeek R1) and eight benchmarks, a clear pattern emerges:
On mathematical and logical reasoning tasks (MATH-500, AMC-23, AIME, Math-Synth), structural signals provide strong complementary information to answer dispersion. The combination of structural uncertainty with self-consistency entropy improves identification of unreliable reasoning instances.
On factual retrieval tasks (HotpotQA), the structural signal collapses toward uniformity. When the task requires retrieving a fact rather than constructing a reasoning path, there is no meaningful diversity in solution strategies — so self-preference rankings become uninformative.
Task regime determines whether structural uncertainty is informative
This regime sensitivity is itself diagnostic. The collapse of structural uncertainty on retrieval tasks identifies the boundary where reasoning-level consistency evaluation ceases to be useful — making structural uncertainty a regime-sensitive evaluator rather than a universal metric.
Why This Matters for Evaluating Logical Reasoning
Standard self-consistency — sampling multiple answers and checking whether they agree — is the most widely used post-hoc reliability signal for LLM reasoning. It works well when errors are diverse. But it has a blind spot: systematic errors that produce agreement. When a model's failure mode is to consistently arrive at the same wrong answer through different flawed reasoning paths, self-consistency reports low uncertainty while the reasoning is fundamentally unreliable.
Structural uncertainty addresses exactly this gap. By probing the stability of the model's own preferences over its reasoning — not just the answers — it detects a form of inconsistency that answer-level metrics cannot see. The combination of both signals provides strictly more information than either alone, at least in the reasoning regimes where structural diversity exists.
The broader implication is that evaluating LLM reasoning requires looking beyond what the model says, to how consistently it evaluates what it says. Consistency of judgment over reasoning paths is a fundamentally different signal from consistency of final answers — and for logical reasoning, it is often the more informative one.