Reliable Reasoning Requires More Than Accuracy
Can language models tell when they are wrong? The answer reveals a persistent tension between expressiveness and calibration.
A language model that solves a math problem correctly is useful. A language model that solves a math problem correctly and knows it got it right is far more useful. And a language model that can look at two of its own solutions and reliably pick the better one — that is the capability we actually need for autonomous reasoning systems.
This is the question we set out to formalize: to what extent can LLMs evaluate the reliability of their own reasoning? Not just whether they get the right answer, but whether they can distinguish their correct outputs from incorrect ones, and whether their confidence in that distinction is calibrated.
Self-Evaluation as a Testable Capability
Here is the experimental setup. Give a model a problem. Let it generate two distinct solutions. Then ask the same model: which solution is better, and how confident are you? An external judge ensemble determines the ground truth. This gives us two measurable quantities for each problem category:
2 solutions
+ reports confidence
expert judge
Calibration Error
The self-evaluation protocol: models judge their own reasoning against expert consensus
Self-Evaluation Accuracy (SEA) — how often the model picks the objectively better solution. Calibration Error — how well the model's reported confidence aligns with its actual success rate. A model that picks correctly 70% of the time but claims 95% confidence is poorly calibrated, even if its accuracy is decent.
From Accuracy to VC Theory
The deeper question is not just "how good is this model at self-evaluation?" but "what is the capacity of its self-evaluation?" — how many distinct problem types can it reliably evaluate across?
This is where classical learning theory becomes relevant. The Vapnik-Chervonenkis (VC) dimension measures the capacity of a hypothesis class — how many points it can shatter. But classical VC theory handles binary classifiers, not probabilistic predictors that output confidence scores. LLMs don't just say "correct" or "incorrect" — they say "I'm 85% sure this one is better."
We introduce two extensions that bridge this gap:
Two complexity metrics for LLM self-evaluation capability
The key theoretical result: PVC and C-PVC yield generalization bounds and sample complexity guarantees that mirror classical VC theory. The classical structure — O(√(d/n)) convergence, Ω(d/ε²) sample complexity — carries over, with the VC dimension replaced by PVC or C-PVC. This means the framework is not just a metric — it provides formal guarantees about how self-evaluation performance on a test set relates to true self-evaluation capability.
The Expressiveness-Calibration Tradeoff
We tested eleven 7-8B parameter models across three benchmarks: 360 mathematical reasoning problems, TruthfulQA for factual accuracy, and CommonsenseQA for commonsense reasoning. The central empirical finding is a systematic inverse correlation:
Models exhibiting enhanced self-evaluation expressiveness consistently demonstrate diminished calibration fidelity. The better a model is at discriminating correct from incorrect, the worse its confidence scores reflect reality.
Models like s1.1-7B and Qwen2.5-7B-Instruct achieve high PVC-VUS scores — strong discriminative self-assessment — but their confidence is poorly calibrated. Conversely, JiuZhang3.0-7B shows superior calibration (lowest ECE, smallest PVC-VUS gap) but lower discriminative power. No model excels at both simultaneously.
The expressiveness-calibration spectrum across eleven 7-8B models
This tradeoff is not an artifact of specific architectures or training methods. It persists across different model families (Qwen, Llama, DeepSeek, Mistral), different training paradigms (SFT, DPO, GRPO, distillation, RL), and different cognitive domains (math, factual, commonsense). It suggests a persistent constraint that current probabilistic reasoning systems have not yet overcome.
Domain-Specific Self-Evaluation
The tradeoff is not uniform across domains. Some models show stronger self-evaluation on mathematical reasoning while others excel on factual or commonsense tasks. This suggests that self-evaluation capability is not a single axis — it is domain-dependent, shaped by training data, fine-tuning objectives, and the cognitive structure of the task.
Mathematical reasoning, with its clear logical structure, often allows models to evaluate solution quality through internal consistency checks. Factual and commonsense reasoning, where correctness depends on world knowledge rather than logical derivation, presents a different self-evaluation challenge.
Why This Matters
For autonomous AI systems — those that must operate without constant human oversight — self-evaluation is not optional. A system that can solve problems but cannot assess when it might be wrong is difficult to trust in high-stakes settings.
The PVC/C-PVC framework adds a capacity-based lens for measuring and comparing this capability across models. And the expressiveness-calibration tension it reveals poses a concrete challenge: we need models that are both discriminative and well-calibrated in their self-assessment. Current 7-8B models have not solved this. Whether scaling, better training objectives, or different architectures can ease the tension remains an open question — and one of the most important for building AI systems we can actually trust.