April 2026

Reliable Reasoning Requires More Than Accuracy

Can language models tell when they are wrong? The answer reveals a persistent tension between expressiveness and calibration.

Jae Oh Woo

A language model that solves a math problem correctly is useful. A language model that solves a math problem correctly and knows it got it right is far more useful. And a language model that can look at two of its own solutions and reliably pick the better one — that is the capability we actually need for autonomous reasoning systems.

This is the question we set out to formalize: to what extent can LLMs evaluate the reliability of their own reasoning? Not just whether they get the right answer, but whether they can distinguish their correct outputs from incorrect ones, and whether their confidence in that distinction is calibrated.

Self-Evaluation as a Testable Capability

Here is the experimental setup. Give a model a problem. Let it generate two distinct solutions. Then ask the same model: which solution is better, and how confident are you? An external judge ensemble determines the ground truth. This gives us two measurable quantities for each problem category:

Model generates
2 solutions

→

Model picks one
+ reports confidence

→

Compare to
expert judge

→

SEA &
Calibration Error

The self-evaluation protocol: models judge their own reasoning against expert consensus

Self-Evaluation Accuracy (SEA) — how often the model picks the objectively better solution. Calibration Error — how well the model's reported confidence aligns with its actual success rate. A model that picks correctly 70% of the time but claims 95% confidence is poorly calibrated, even if its accuracy is decent.

From Accuracy to VC Theory

The deeper question is not just "how good is this model at self-evaluation?" but "what is the capacity of its self-evaluation?" — how many distinct problem types can it reliably evaluate across?

This is where classical learning theory becomes relevant. The Vapnik-Chervonenkis (VC) dimension measures the capacity of a hypothesis class — how many points it can shatter. But classical VC theory handles binary classifiers, not probabilistic predictors that output confidence scores. LLMs don't just say "correct" or "incorrect" — they say "I'm 85% sure this one is better."

We introduce two extensions that bridge this gap:

PVC — Probabilistic VC Dimension

Extends VC to probabilistic predictors

Measures: how many problem types can the model classify with confidence ≥ γ?

Captures expressiveness of self-evaluation

Higher = more discriminative power

C-PVC — Calibration-Aware PVC

Adds calibration requirement on top of PVC

Requires: confidence scores must align with actual success rates (within tolerance τ)

Captures expressiveness + calibration

Higher = discriminative AND well-calibrated

Two complexity metrics for LLM self-evaluation capability

The key theoretical result: PVC and C-PVC yield generalization bounds and sample complexity guarantees that mirror classical VC theory. The classical structure — O(√(d/n)) convergence, Ω(d/ε²) sample complexity — carries over, with the VC dimension replaced by PVC or C-PVC. This means the framework is not just a metric — it provides formal guarantees about how self-evaluation performance on a test set relates to true self-evaluation capability.

The Expressiveness-Calibration Tradeoff

We tested eleven 7-8B parameter models across three benchmarks: 360 mathematical reasoning problems, TruthfulQA for factual accuracy, and CommonsenseQA for commonsense reasoning. The central empirical finding is a systematic inverse correlation:

Models exhibiting enhanced self-evaluation expressiveness consistently demonstrate diminished calibration fidelity. The better a model is at discriminating correct from incorrect, the worse its confidence scores reflect reality.

Models like s1.1-7B and Qwen2.5-7B-Instruct achieve high PVC-VUS scores — strong discriminative self-assessment — but their confidence is poorly calibrated. Conversely, JiuZhang3.0-7B shows superior calibration (lowest ECE, smallest PVC-VUS gap) but lower discriminative power. No model excels at both simultaneously.

High PVC, Poor Calibration (s1.1-7B, Qwen2.5-Instruct)

"I can tell which answer is better, but my confidence is unreliable"

Balanced (most models)

"Moderate discrimination, moderate calibration"

Better Calibration, Lower PVC (JiuZhang3.0-7B)

"My confidence is honest, but my discrimination is limited"

The expressiveness-calibration spectrum across eleven 7-8B models

Key Insight

This tradeoff is not an artifact of specific architectures or training methods. It persists across different model families (Qwen, Llama, DeepSeek, Mistral), different training paradigms (SFT, DPO, GRPO, distillation, RL), and different cognitive domains (math, factual, commonsense). It suggests a persistent constraint that current probabilistic reasoning systems have not yet overcome.

Domain-Specific Self-Evaluation

The tradeoff is not uniform across domains. Some models show stronger self-evaluation on mathematical reasoning while others excel on factual or commonsense tasks. This suggests that self-evaluation capability is not a single axis — it is domain-dependent, shaped by training data, fine-tuning objectives, and the cognitive structure of the task.

Mathematical reasoning, with its clear logical structure, often allows models to evaluate solution quality through internal consistency checks. Factual and commonsense reasoning, where correctness depends on world knowledge rather than logical derivation, presents a different self-evaluation challenge.

Why This Matters

For autonomous AI systems — those that must operate without constant human oversight — self-evaluation is not optional. A system that can solve problems but cannot assess when it might be wrong is difficult to trust in high-stakes settings.

The PVC/C-PVC framework adds a capacity-based lens for measuring and comparing this capability across models. And the expressiveness-calibration tension it reveals poses a concrete challenge: we need models that are both discriminative and well-calibrated in their self-assessment. Current 7-8B models have not solved this. Whether scaling, better training objectives, or different architectures can ease the tension remains an open question — and one of the most important for building AI systems we can actually trust.