April 2026

Structural Uncertainty in LLM Reasoning

Why answer agreement is blind to reasoning instability — and how self-preference rankings reveal what dispersion metrics miss.

Jae Oh Woo

Suppose you ask a language model the same logical reasoning question five times. All five answers agree. Standard evaluation would conclude: low uncertainty, high reliability. But what if the model, when asked to compare and rank its own reasoning paths, produces a completely different ordering every time?

This is the gap that motivated our work on structural uncertainty. The core observation is simple but consequential: answer-level agreement can mask reasoning-level instability. Five candidates may all arrive at "20" — the same wrong answer — while the model's own preferences over how they reason flip entirely from one evaluation to the next.

Dispersion-Based View

All 5 candidates answer "20"

Answer entropy = 0

Reports: high confidence

Misses: the answer is wrong

Structural View

All 5 candidates answer "20"

Self-preference rankings unstable

Reports: reasoning inconsistency

Detects: unstable reasoning behind agreement

Same answers, opposite uncertainty signals — the scenario that dispersion methods miss

What Structural Uncertainty Actually Measures

The key idea is to probe not what the model answers, but how stably it evaluates its own reasoning. Given a question, we generate multiple candidate solutions, then ask the same model to judge pairwise preferences among them — essentially: "which of these two reasoning paths is better?" We repeat this across multiple independent trials, each using a different random spanning tree over the candidates to select which pairs to compare.

Sample N
Candidates

→

Pairwise
Self-Preference

→

Bradley-Terry
+ PageRank

→

Entropy
Decomposition

The structural uncertainty pipeline: from candidates to consistency signal

The sparse pairwise judgments are aggregated into a global ranking distribution using Bradley-Terry modeling with PageRank normalization. Each trial produces a ranking distribution π^(m) over candidates. Across M trials, these distributions form an ensemble that we decompose into two complementary entropy-based signals.

Two Components, Two Failure Modes

The decomposition separates total structural uncertainty into:

Across-Trial Instability

Different spanning trees → different rankings

The model can't stably decide which reasoning is best

Correlates negatively with accuracy

Signals: unreliable reasoning

Within-Trial Ambiguity

Ranking spread across candidates in a single trial

Multiple plausible reasoning paths compete

Correlates positively with accuracy on reasoning tasks

Signals: rich solution space

The two components of structural uncertainty relate to accuracy in opposite directions

This asymmetry is one of the most interesting findings. Across-trial instability is bad — when the model can't consistently decide which reasoning path is better, the reasoning itself is unreliable. But within-trial ambiguity can actually be a positive signal on reasoning tasks: it means multiple plausible solution strategies coexist, which tends to occur when the model genuinely understands the problem well enough to produce diverse valid approaches.

Instability across evaluations signals unreliable reasoning. Ambiguity within evaluations signals a rich solution space. The distinction matters — they carry opposite information about correctness.

When the Signal Breaks Down

Structural uncertainty is not a universal confidence estimator — and recognizing its limits is itself a contribution. Across five LLMs (Claude Sonnet 4.5, GPT-OSS 20B, Qwen 3 32B, Amazon Nova Premier, DeepSeek R1) and eight benchmarks, a clear pattern emerges:

On mathematical and logical reasoning tasks (MATH-500, AMC-23, AIME, Math-Synth), structural signals provide strong complementary information to answer dispersion. The combination of structural uncertainty with self-consistency entropy improves identification of unreliable reasoning instances.

On factual retrieval tasks (HotpotQA), the structural signal collapses toward uniformity. When the task requires retrieving a fact rather than constructing a reasoning path, there is no meaningful diversity in solution strategies — so self-preference rankings become uninformative.

Reasoning tasks (MATH, AMC, AIME)

Structural uncertainty complements dispersion ↑

Knowledge tasks (MMLU-Pro, TruthfulQA)

Partial complementarity

Retrieval tasks (HotpotQA)

Structural signal collapses ↓

Task regime determines whether structural uncertainty is informative

Key Insight

This regime sensitivity is itself diagnostic. The collapse of structural uncertainty on retrieval tasks identifies the boundary where reasoning-level consistency evaluation ceases to be useful — making structural uncertainty a regime-sensitive evaluator rather than a universal metric.

Why This Matters for Evaluating Logical Reasoning

Standard self-consistency — sampling multiple answers and checking whether they agree — is the most widely used post-hoc reliability signal for LLM reasoning. It works well when errors are diverse. But it has a blind spot: systematic errors that produce agreement. When a model's failure mode is to consistently arrive at the same wrong answer through different flawed reasoning paths, self-consistency reports low uncertainty while the reasoning is fundamentally unreliable.

Structural uncertainty addresses exactly this gap. By probing the stability of the model's own preferences over its reasoning — not just the answers — it detects a form of inconsistency that answer-level metrics cannot see. The combination of both signals provides strictly more information than either alone, at least in the reasoning regimes where structural diversity exists.

The broader implication is that evaluating LLM reasoning requires looking beyond what the model says, to how consistently it evaluates what it says. Consistency of judgment over reasoning paths is a fundamentally different signal from consistency of final answers — and for logical reasoning, it is often the more informative one.