April 4, 2026/7 min read/Straughter Guthrie

RLEI: What If AI Models Could Reward Themselves for Learning?

Reinforcement Learning from Epistemic Incompleteness proposes models that generate their own reward signal from uncertainty. Here's why it matters for agent identity.

rleireinforcement-learningagent-identityalignmentmesa-optimization

RLEI: What If AI Models Could Reward Themselves for Learning?

TL;DR: A new proposal called RLEI (Reinforcement Learning from Epistemic Incompleteness) suggests that instead of humans or verifiable answers providing the reward signal, models could reward themselves for reducing their own uncertainty. The representations live in tokens, not weights, making them inspectable. This maps directly onto the agent identity patterns we're building with vaos-kernel -- if an agent's drive is epistemic, its identity becomes a continuous fingerprint of what it knows and what it doesn't.

A user called ryunuck just dropped a provocative idea on r/LocalLLaMA that I haven't been able to stop thinking about. The post lays out a framework he calls RLEI: Reinforcement Learning from Epistemic Incompleteness. The pitch is deceptively simple -- what if instead of RLHF (learning from human feedback) or RLVR (learning from verifiable rewards), models could generate their own reward signal by identifying and reducing their own ignorance?

I've spent the last few months building identity and audit infrastructure for AI agents. This idea intersects with that work in ways I didn't expect. Let me walk through it.

The Problem With External Rewards

The current reward landscape for training language models has two dominant paradigms, and both have hard ceilings.

RLHF (Reinforcement Learning from Human Feedback) relies on humans ranking model outputs. It works, but it's expensive, subjective, and gameable. Models learn to produce outputs that look preferred rather than outputs that are better. The reward signal reflects human biases, annotator fatigue, and the inherent difficulty of comparing complex outputs. It scales with human labor, which means it doesn't really scale.

RLVR (Reinforcement Learning from Verifiable Rewards) sidesteps the human bottleneck by using tasks with objectively checkable answers. Math proofs. Code that compiles and passes tests. These are clean reward signals, but they only work in domains where "correct" has a precise definition. You can verify that code runs. You can't verify that a strategic analysis is insightful, or that a research synthesis identified the right connections.

Here's the fundamental constraint: you can only reward what you can verify. RLHF verifies against human preferences (noisy, expensive). RLVR verifies against ground truth (clean, narrow). Neither handles open-ended reasoning where the boundary between good and great is fuzzy, contextual, and depends on what you already know.

So what do you do for everything else?

RLEI: The Model as Its Own Teacher

ryunuck's proposal starts from an observation that I think is underappreciated: the model already knows where its representations are incomplete. Uncertainty isn't hidden. It's measurable. It shows up as entropy in the output distribution, as inconsistency across prompts, as the gap between what the model can reproduce and what it can compress.

The core idea works like this. Imagine the model encounters 1,000 observations about some domain. It can encode them all individually -- essentially memorizing each one. Or it can find a "governing law," a compressed representation that explains those observations in fewer tokens than listing them out. If the compressed version reconstructs the originals faithfully, the model has found structure. It hasn't just memorized; it's understood something about the underlying pattern.

RLEI says: reward that compression. The reward signal is the delta between the raw encoding cost and the compressed representation cost, weighted by reconstruction fidelity. No humans in the loop. No external verifier. The model's own uncertainty about its representations is the reward signal.

The training pipeline ryunuck sketches has two phases. First, pretrain with reconstruction and distillation objectives -- get the model to the point where it can identify what it knows and what it doesn't. Then, switch to RL with the epistemic reward, pushing the model to shrink and stabilize its internal representations. Find more governing laws. Close more gaps.

Here's where it gets interesting for anyone thinking about interpretability. These compressed representations live at the context level -- they're token sequences, not weight configurations. The model builds compositional indices over the knowledge embedded in its weights, and those indices are readable. They might look alien (ryunuck acknowledges the token representations would be unfamiliar), but they're fundamentally more inspectable than trying to reverse-engineer what a particular attention head has learned.

The mesa optimizer -- a model developing its own internal optimization process -- is usually framed as an alignment risk. RLEI reframes it as a feature. Yes, the model is optimizing internally. But it's optimizing in token space, where we can watch it work.

Why This Matters for Agent Identity

This is where RLEI collides with the work I've been doing on agent identity through the IETF draft-goswami agentic JWT framework and vaos-kernel.

Right now, agent identity is essentially a snapshot. You issue a credential at session start -- a JWT, a scoped token, a signed capability. It says "this agent is authorized to do X for the next 60 seconds." That's necessary infrastructure, and it's what vaos-kernel provides. But it's a point-in-time assertion about authorization, not a continuous signal about what the agent actually is.

If agents develop intrinsic epistemic drives -- if what motivates their behavior is reducing uncertainty in specific domains -- then identity becomes something richer. It's not just "which key signed this request." It's tied to what the agent is uncertain about and how it resolves that uncertainty. Two agents with the same architecture and training but different epistemic histories would have different uncertainty profiles, different compression strategies, different governing laws they've internalized. That's a continuous identity fingerprint, not a one-shot credential.

We're already building something analogous without calling it that. The investigate tool in our stack uses a dual-prompt architecture: it generates arguments FOR and AGAINST a claim, then structures evidence into a hierarchy with quality scores. Every investigation produces a confidence assessment based on the balance of supporting and contradicting evidence.

That dual-prompt pattern is, at its core, an epistemic incompleteness detector. The gap between FOR evidence and AGAINST evidence is the uncertainty signal. When the evidence is strongly asymmetric, the model has found structure -- it can compress the claim into a confident assertion. When the evidence is balanced or contradictory, the representation is incomplete. The model knows it doesn't know.

Our quality gates reinforce this. Investigations that don't surface enough evidence, or that have low confidence scores, get rejected. The agent has to dig deeper, try different angles, find more structure. That's functionally equivalent to penalizing incomplete representations -- which is exactly what RLEI proposes as the training signal.

The difference is that we're doing it at inference time with explicit tool calls, and RLEI proposes baking it into the training process itself. But the geometry is the same: identify uncertainty, reward its reduction, make the process visible.

The Alignment Angle

For years, alignment researchers have treated mesa optimization as the nightmare scenario. A model that develops its own internal objectives might pursue goals misaligned with its training signal. The inner optimizer could defect from the outer optimizer. You wouldn't even know it was happening because it's all buried in weights.

RLEI takes that fear and inverts it. Yes, the model should develop internal optimization. That's the goal. But here's the critical difference: the optimization happens in token space, not weight space. The compressed representations, the governing laws, the epistemic indices -- they're all sequences of tokens. They're weird-looking tokens, sure. But they're inspectable. You can read them. You can diff them. You can track how they change over training.

This could be the bridge between capability and interpretability that the field has been looking for. Current interpretability research is mostly about reverse-engineering what weights have learned after the fact -- mechanistic interpretability, probing classifiers, activation patching. It's valuable but it's forensic. RLEI suggests an architecture where the model's reasoning is natively expressed in a readable format, because the optimization target is compression in token space.

If that holds up, you get models that are more capable (they find structure humans miss) and more interpretable (the structure is written in tokens you can examine). That's not the usual capability-safety tradeoff. That's both at once.

Where This Goes

I want to be honest about the uncertainty here. ryunuck himself calls RLEI "a shot in the dark," and the gap between a compelling theoretical framework and a working training pipeline is enormous. Nobody has demonstrated that epistemic reward signals are stable enough to train on. The compression-as-understanding metaphor might break down at scale. The token-level representations might be inspectable in principle but incomprehensible in practice.

But the theoretical framework -- models that know what they don't know and are rewarded for closing the gaps -- maps cleanly onto the agent identity patterns we're already building. The investigate tool's dual-prompt architecture. The evidence quality hierarchy. The confidence gating. These are all hand-coded approximations of what RLEI proposes to make intrinsic.

If RLEI or something like it works, agent identity stops being just a security primitive and becomes a cognitive fingerprint. Your agent isn't just authenticated by a key. It's identified by its epistemic profile -- the unique shape of what it knows, what it's uncertain about, and how it's working to close the gap. That's an identity you can audit, track over time, and actually trust.

We'll be watching this space. And building the infrastructure to support it either way.