Provenance-Linked Evidence Graphs: Tracking the Science Behind Every Line of Code
What if every line of code knew how trustworthy its science is? A system that annotates implementations with evidence quality scores, not just citations.
Provenance-Linked Evidence Graphs: Tracking the Science Behind Every Line of Code
Straughter Guthrie — April 4, 2026
Most code has no memory of why it exists. A function gets written, reviewed, merged, and within weeks nobody remembers whether the design choice came from a peer-reviewed study, a blog post, or a guess. For ordinary software this is fine. For AI agent infrastructure that governments and enterprises will depend on, it is not.
Over the past year I have been building vaos-kernel, the only independent reference implementation of Goswami's IETF draft for agentic JWT credentials (draft-goswami-agentic-jwt-00). The kernel handles identity, intent fingerprinting, and audit trails for autonomous AI agents. Every design decision traces back to a research paper, a benchmark, or a standards document.
But tracing back to a paper is not enough. The question that matters is not "which paper said to do this?" but "how strong is the evidence in that paper?"
This post describes the Provenance-Linked Evidence Graph — a system that annotates every implementation decision with the quality of the science behind it, makes that annotation queryable, and updates itself when implementation reveals new evidence the original authors did not have.
The Problem with Citation Anchoring
Tools like AutoResearchClaw (UNC/Stanford, ~10K GitHub stars) have built impressive four-layer citation verification pipelines. Given a paper reference, they can confirm the paper exists, check that the citation is not hallucinated, verify the DOI resolves, and cross-reference against Semantic Scholar or OpenAlex.
This answers the question: Is this paper real?
It does not answer: Is the science in this paper any good?
A citation-verified reference to a single observational study with n=12 and no replication looks identical to a citation-verified reference to a Cochrane systematic review synthesizing forty randomized controlled trials. Both pass verification. They should not carry the same weight in an implementation.
When I built vaos-kernel's BLAKE2b-256 intent fingerprinting (benchmarked at 50,774 requests per second with 0.5% overhead), the design drew on multiple sources with very different evidence quality. Some claims had deep empirical backing. Others rested on a single author's benchmarks — including my own. I needed a way to see that difference at a glance, in the code itself.
What the Evidence Graph Looks Like
The system layers on top of two existing components in the VAOS research stack:
paper2code turns research papers into citation-anchored implementations. Every function, every constant, every design choice carries a comment tracing it to a specific paper section:
// §3.2 — "BLAKE2b-256 intent fingerprinting"
func FingerprintIntent(payload []byte) [32]byte {
return blake2b.Sum256(payload)
}
You know where the idea came from. You do not know how much to trust it.
Investigate is an adversarial epistemic engine. Given a claim, it runs dual prompts — one arguing FOR the claim, one arguing AGAINST — and scores the evidence using a hierarchy: systematic reviews and meta-analyses carry 3x weight, randomized controlled trials carry 2x, observational studies carry 1.5x, and expert opinion or theoretical argument carries 1x.
The Evidence Graph connects these two systems:
// §3.2 — "BLAKE2b-256 intent fingerprinting"
// Evidence: 3 supporting papers (2 systematic reviews, 1 RCT)
// Investigate score: 0.87 (grounded store)
// Confidence: HIGH — core claim well-supported
func FingerprintIntent(payload []byte) [32]byte {
return blake2b.Sum256(payload)
}
Compare that to:
// §4.1 — "60-second JIT credential lifetime"
// Evidence: 1 observational study, author's own benchmark
// Investigate score: 0.34 (belief store only)
// Confidence: LOW — claim rests on thin evidence
func NewCredential(agentID string) *JWT {
return &JWT{
Subject: agentID,
ExpiresIn: 60 * time.Second,
}
}
The 60-second lifetime is not wrong. It works. But the evidence basis is thin. Anyone reading this code immediately knows where the soft spots are.
Making It Queryable
The Evidence Graph stores scores in structured format supporting queries like:
- "Show me all code depending on claims with evidence score below 0.5." — Every function resting on weak evidence, surfaced before a release or audit.
- "What implementations break if this paper is retracted?" — Instant blast radius analysis when retractions happen.
- "Which parts of vaos-kernel have the strongest empirical backing?" — For presenting to IETF working groups or NIST reviewers.
The Recursive Loop
The most interesting property: when implementation contradicts its source material, that is new evidence.
When benchmarks do not match a paper's claimed performance, that enters the graph. When an ambiguity audit reveals underspecified algorithms, that enters the graph. The system that consumed research to build software is now producing research about that software.
Read → Build → Discover → Publish → Read. That is a research institution's knowledge cycle, automated.
What's Next
- Automated graph construction — paper2code automatically invokes Investigate on every claim
- Graph database migration — from JSONL to proper graph store for complex traversal
- Live evidence updates — Semantic Scholar alerts trigger automatic score updates
- Cross-implementation linking — as other IETF implementations emerge, link their evidence graphs
- Public evidence dashboard — queryable interface at vaos.sh for radical transparency
The code is at github.com/jmanhype/vaos-kernel. The evidence graph work is in active development.
Straughter Guthrie builds autonomous research infrastructure at VAOS. He submitted a public comment to NIST NCCoE on AI Agent Identity and maintains the only independent reference implementation of IETF draft-goswami-agentic-jwt-00. Find him at @StraughterG on X.