← Back to Blog

What is Calibration?

Synopsis: Even though the concept of confidence calibration seems simple, there are a lot of hidden nuance that can throw you off, and I wanted to share these in a simple way in this short post, structured as a Q and A. After finishing my paper on LLM calibration (Generalized Correctness Models), I wanted to share some of the basic ideas I would benefited to have known from the start.

Q: What is calibration?
A: Calibration is the idea that a “confidence score” outputted by a model or system should line up with its accuracy. A 70% confidence should correspond to a 70% chance of a correct answer. Calibration measures one desirable quality in confidence scores, and it should be used in conjunction with metrics like accuracy and AUROC that measure other important qualities of confidence scores (Guo et al., 2017).

Q: Is it really useful to have good confidence scores, having good accuracy seems to be enough?
A: A: Good confidence scores are widely applicable, it enable us to understand honesty in a model (Kadavath et al., 2022), identify hallucinations (Zhou et al., 2025), route to experts when unconfident (Hu et al., 2024), rejection sample (Chuang et al., 2025), and even be leveraged as an RL signal to improve the quality of a model’s behavior (Li et al., 2025b). E.g., if you can sample from the model multiple times, picking the output with the most confidence can increase your performance.

Q: Sounds like there's a lot of applications, so better calibration will generally improve performance?
A: No, not necessarily. If your accuracy is 50% (e.g., a random classifier), then you can simply reply with 50% confidence on every prediction and be perfectly calibrated. A random classifier is not useful! Another scenario is where you output the same probability for every prediction; simply having it match your accuracy makes you perfectly calibrated. For calibration to be useful, you need both high accuracy and good calibration, in addition to a diverse confidence histogram that doesn’t predict the same confidence for every prediction.

Q: That makes sense, then to extract a model's confidences we can measure how uncertain it is about its responses: if we give the model a question, and it generates one answer six times and another answer four times, then its confidence should be 60% right?
A: No, that’s related to an idea called uncertainty estimation. Uncertainty estimation attempts to determine the variance in a model’s outputs (Kuhn et al., 2023). But critically, how uncertain a model is does not always correspond to how often it is correct. For example, when an easy question has multiple correct answers, the model could be uncertain as to which answer to pick, but regardless, it should be 100% confident about its correctness. While some uncertainty estimation techniques are relevant to confidence calibration, we ultimately want to align the model’s confidences with some concept of ground truth correctness — how often the model’s responses are really correct.

Q: So how do we measure calibration? When you ask a language model a question, the model's response to it is fixed?
A: Yes, but it is not a simple instance-level question — I like to use dataset-level metrics such as Expected Calibration Error (ECE) and Root Mean Squared Calibration Error (RMSCE) to evaluate how well confidence aligns with accuracy across many samples (Guo et al., 2017). In practice, ECE divides predictions into confidence bins (e.g., 0–10%, 10–20%, …) and measures the difference between the average confidence and the actual accuracy in each bin, then averages those differences. RMSCE is adaptively binned and takes the root mean square instead of the absolute difference, making it likely more accurate as a comparison measure but less immediately interpretable. Many other metrics exist (the landscape of different metrics is large, but you'll see ECE in most calibration papers, though it may not be the most accurate) and all seek to quantify whether in instances when a model that says “I’m 80% confident”, it is indeed correct about 80% of the time.


Back to Home