Beyond Binary Rewards:
Training LMs to Reason About Their Uncertainty

Main Figure 1
Back to Top

Intro

Example Outputs

Recent advances in reasoning training — particularly RL with verifiable rewards (RLVR) — have improved the accuracy of large language models (LLMs). These approaches optimize for correctness, encouraging models to output the right answer when possible.

However, these same methods tend to produce overconfident models that are more prone to hallucination. Because the reward signal focuses solely on final answer correctness, models are incentivized to guess — even when they are uncertain. This is especially problematic in high-stakes settings like healthcare and law, where confident errors can be harmful or even dangerous.

RLCR addresses this gap. We introduce a reinforcement learning framework that trains models to reason about their own uncertainty — rewarding not just accuracy, but also calibrated confidence.

Instead of encouraging blind certainty, RLCR incentivizes models to reflect, evaluate their own outputs, and communicate uncertainty when appropriate. The result is a model that performs better (higher accuracy ✅) and knows when it's likely to be wrong (better calibration 🎯).

In domains where trust, safety, and interpretability matter, this dual optimization — getting the right answer, and knowing when you might not — makes all the difference.

Back to Top

Examples

Below, you can see examples of RLCR in action. Click on the tabs to see the model's confidence and answer for each example and compare it to RLVR and Base model outputs!


Loading...

These examples are from the RLCR models trained on Math.
Back to Top

Method

RLCR makes a simple shift: instead of rewarding only correctness, we reward both accuracy and calibrated confidence.

  • 💡 We reward models for being right and for knowing when they're right.
  • 💡 We move uncertainty reasoning into training — teaching the model to reflect on its own uncertainty while solving the task.
  • 💡 With a simple tweak to the reward function, RLCR enables LLMs to improve both performance and self-awareness.

During training, the model reasons jointly about the task and its own uncertainty, producing both an answer and a confidence estimate.

Our reward combines two terms:

  • 🎯 Correctness — Is the answer right?
  • 📏 Calibration — Does the confidence reflect actual correctness?

Confidently wrong or uncertainly right responses are penalized. This encourages models to learn not just what to answer, but how sure to be.

Reward Functions

In RLCR, the reward for giving an incorrect answer is never better than the reward for giving a correct answer, which prevents models from losing accuracy.

Back to Top

Traditional RLVR reward:

$$ R_{\text{RLVR}} = \text{Correctness} = \mathbb{1}(\text{prediction} = \text{label}) $$

RLCR reward:

$$ R_{\text{RLCR}} = \text{Correctness} - \left( \text{Confidence} - \text{Correctness} \right)^2 $$

where Correctness is 1 if the prediction is correct, 0 otherwise; and Confidence is the model’s self-estimated probability that its prediction is correct.


What We Prove:

✅ RLCR provably optimizes for both accuracy & calibration.

✅ Calibration comes at no cost to accuracy.

✅ Works with any bounded proper scoring rule — we use the Brier score.

Back to Top

Results 📊

On diverse QA & math benchmarks (both in-domain and out-of-distribution):

  • Accuracy stays on par (or better) than RL baselines, with calibration error reduced by up to 90%.
  • ✨ Outperforms post-hoc classifiers & probes on calibration.
  • RLVR degrades calibration in OOD tasks, while RLCR significantly improves it.
Results Chart
Back to Top

Can confidence scores help at test time? 🚀

Yes — confidence estimates can be directly integrated into test-time scaling strategies to improve performance when additional compute is available.

We explore two simple yet effective techniques:

  • Max-Confidence Selection: Choose the response with the highest self-reported confidence.
  • Confidence-Weighted Majority Voting: Aggregate multiple responses, weighting each vote by its confidence score.

These strategies yield better accuracy and calibration as compute scales. 📈

Inference Scaling Strategies
Back to Top

Does Explicitly Reasoning About Uncertainty Help? 🧠

To investigate the impact of uncertainty reasoning within the chain-of-thought (CoT), we trained two types of classifiers:


  • Baseline: Trained on model solutions and final answers.
  • Analysis: Trained on the same, but with explicit uncertainty reasoning included in the CoT.

  • Result: The analysis classifier outperformed the baseline, particularly for smaller model sizes. Larger models can infer confidence from solutions alone, but smaller ones benefit from having uncertainty reasoning made explicit.


    🔍 Classifier capacity shapes the optimal CoT content — a key insight for future work in prompting and model alignment.

    Classifier Results
    Back to Top

    BibTeX

    @misc{damani2025binaryrewardstraininglms,
      title={Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty},
      author={Mehul Damani and Isha Puri and Stewart Slocum and Idan Shenfeld and Leshem Choshen and Yoon Kim and Jacob Andreas},
      year={2025},
      eprint={2507.16806},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.16806},
    }


    Back to Top

    Code

    RLCR repo card


    For all code, models, and data, check out the RLCR GitHub Repo. We provide a detailed README with instructions for setting up, training, and evaluating the model.