VAOS is managed OpenClaw hosting with persistent memory. Your AI agent remembers everything across sessions and learns from your corrections. No fine-tuning required.

How is VAOS different from vanilla OpenClaw?

Vanilla OpenClaw agents forget everything between sessions. VAOS injects persistent memories and behavioral rules at every boot, so your agent compounds knowledge over time.

How much does VAOS cost?

VAOS starts at $29/month for the Starter plan (1 agent, basic models, Telegram). Pro is $49/month with 3 agents and premium models. Enterprise pricing is custom.

2026-03-25

How the Investigate Tool Works: Adversarial Verification for AI Agents

A 1,700-line Elixir module that argues against its own findings before reporting them. Here's the architecture.

investigateadversarial-verificationvaoselixir

How the Investigate Tool Works: Adversarial Verification for AI Agents

Most AI agent tools retrieve information. The investigate tool retrieves information, then builds a case against it.

The module lives at lib/optimal_system_agent/tools/builtins/investigate.ex in vas-swarm. It is roughly 1,700 lines of Elixir. Its job is to take a natural language question, search the academic literature, extract claims with citations, and then run an adversarial process that stress-tests its own findings before producing a final report.

This post walks through what it does, how it does it, and where it falls short.

The Architecture: Two Stores, Not One

The investigate tool borrows its core design from Active Epistemic Control (AEC), a framework described in Qu 2025. The paper addresses a straightforward problem: when an agent uses a learned model to fill knowledge gaps, how do you prevent model errors from contaminating decisions?

AEC's answer is a strict two-store separation:

Belief store -- information the model predicts or that is commonly asserted. Useful for guiding search. Never treated as ground truth.
Grounded store -- information verified against the environment. The only store that gates commitment to conclusions.

In the original paper, this applies to robotic planning. In the investigate tool, it applies to claims about the world. A claim that "homeopathy has clinical support" might enter the belief store because it appears in popular discourse. It stays out of the grounded store unless peer-reviewed evidence backs it.

This distinction is the load-bearing wall of the entire system.

The Pipeline

Here is the full flow from input question to final report:

 INPUT: Natural language question
   |
   v
 [1] QUERY FORMULATION
   |   Topic -> structured search queries
   |   (multiple angles, synonyms, negations)
   |
   v
 [2] LITERATURE SEARCH
   |   Semantic Scholar API
   |   OpenAlex API
   |   (parallel requests, result merging)
   |
   v
 [3] EVIDENCE EXTRACTION
   |   Parse abstracts
   |   Extract claims + citation metadata
   |   Tag each claim: supports / opposes / neutral
   |
   v
 [4] TWO-STORE CLASSIFICATION
   |   +------------------+-------------------+
   |   | BELIEF STORE     | GROUNDED STORE    |
   |   | Popular claims,  | Peer-reviewed,    |
   |   | common assertions| empirical data,   |
   |   | model predictions| verified findings |
   |   +------------------+-------------------+
   |
   v
 [5] ADVERSARIAL DUAL-PROMPT
   |   Generate COUNTER-ARGUMENT against findings
   |   Evaluate both sides (pro and counter)
   |   Score argument strength
   |
   v
 [6] CITATION VERIFICATION
   |   Do cited papers actually say
   |   what the agent claims they say?
   |   Cross-reference abstracts against claims
   |
   v
 [7] FINAL SYNTHESIS
      Confidence levels
      Evidence quality ratings
      Asymmetry assessment
      Structured report

 OUTPUT: Investigation report with verdict

Each stage deserves explanation.

Stage 1: Query Formulation

The tool does not search for the question as-is. It decomposes the topic into multiple structured queries designed to surface evidence from different angles. If the input is "Is homeopathy effective?", the queries might include terms for clinical trials, meta-analyses, placebo comparisons, and mechanism-of-action studies. This increases the chance of finding contradictory evidence, which is the point.

Stage 2: Literature Search

Two APIs run in parallel:

Semantic Scholar -- returns papers with abstracts, citation counts, and venue metadata.
OpenAlex -- broader coverage, different ranking, fills gaps in Semantic Scholar's index.

Results are merged and deduplicated. Without a Semantic Scholar API key, the tool is rate-limited and typically returns around 5 papers per investigation. With a key, that number climbs to 15-20. This is a real constraint discussed below.

Stage 3: Evidence Extraction

For each paper, the tool pulls the abstract and uses an LLM call to extract discrete claims. Each claim gets tagged with its citation source and a directional label: does this claim support the hypothesis, oppose it, or remain neutral?

The extraction prompt is specific. It asks for empirical findings, not author opinions or speculative discussion sections.

Stage 4: Two-Store Classification

This is where AEC earns its keep. Every extracted claim goes into one of the two stores:

Belief store: claims that reflect common understanding, traditional use, anecdotal reports, or theoretical arguments without controlled data. These are tracked but do not contribute to the final verdict.
Grounded store: claims backed by controlled trials, systematic reviews, or meta-analyses with reported effect sizes and sample counts. These are the only claims that influence the conclusion.

The classification is not binary in practice -- the tool assigns a grounding confidence to each claim and uses a threshold to sort them. But the conceptual split is strict: beliefs guide the search, grounded facts determine the answer.

Stage 5: Adversarial Dual-Prompt

This is the unusual part. After the tool assembles its evidence and reaches a preliminary conclusion, it generates a structured counter-argument against that conclusion. The counter-argument prompt receives the same evidence and is instructed to build the strongest possible case in the opposite direction.

Then a second evaluation pass scores both the original argument and the counter-argument on internal consistency, evidence quality, and logical coherence. If the counter-argument scores higher, the tool revises its conclusion.

This is not a formality. The adversarial pass exists because LLMs exhibit confirmation bias during evidence synthesis -- they tend to weight the first coherent narrative they construct. Forcing a counter-argument disrupts that pattern.

Stage 6: Citation Verification

The tool checks its own work. For each claim attributed to a specific paper, it cross-references the claim text against the actual abstract content. If a claim cannot be substantiated by the abstract it cites, it gets flagged and downweighted.

This catches a common failure mode: the LLM generates a plausible-sounding claim and attaches it to a real paper that does not actually make that claim. Citation verification does not eliminate this problem, but it catches the obvious cases.

Stage 7: Final Synthesis

The output is a structured report containing:

A verdict (e.g., asymmetric_evidence_against, supported, inconclusive)
Confidence level derived from grounded store evidence weight
Evidence quality ratings per source
The count of verified vs. unverified citations
The adversarial evaluation summary

The Homeopathy Test Case

The clearest demonstration of the pipeline is the homeopathy investigation.

Input: "Is homeopathy effective?"

Result: asymmetric_evidence_against

The numbers:

Metric	Count
Belief store citations	5
Grounded store citations	3
Verified citations against homeopathy	4
Verified citations for homeopathy	0

The belief store captured claims about traditional use and patient-reported satisfaction. The grounded store captured meta-analyses and systematic reviews, all of which found no effect beyond placebo. The adversarial pass attempted to build a case for homeopathy using the available evidence and failed -- the counter-argument scored lower than the original conclusion on every metric.

The tool got this right. It distinguished between "people believe this works" and "controlled experiments show this works," then correctly reported the asymmetry.

Provider

The investigate tool runs on GLM-4.7 via Zhipu (OSA_DEFAULT_PROVIDER=zhipu). This is the default inference provider for VAOS agent operations. GLM-4.7 handles the multi-turn prompting required for evidence extraction, adversarial generation, and citation verification within a single investigation cycle.

Limitations

These are real, not hypothetical.

Paper coverage is thin without an API key. The Semantic Scholar free tier returns roughly 5 papers per query. With an authenticated key, coverage expands to 15-20. That is a 3x difference in evidence surface area. For well-studied topics like homeopathy, 5 papers suffice. For niche questions, it may not.

Investigations take 15-20 minutes. The pipeline involves multiple sequential LLM calls (query formulation, evidence extraction per paper, adversarial generation, citation verification) plus API latency for literature search. This is not a real-time tool. It is a background process.

Ranking is term-overlap, not semantic similarity. When merging results from Semantic Scholar and OpenAlex, the tool ranks by keyword overlap rather than embedding-based similarity. This means it can miss relevant papers that use different terminology for the same concepts.

No cross-investigation caching. If you investigate "Is homeopathy effective?" and then investigate "Do alternative medicines work?", the second investigation starts from zero. There is no shared evidence cache between related topics. Each investigation is independent.

What This Means

The investigate tool is not a search engine. It is closer to a structured literature review that argues with itself. The two-store architecture prevents the agent from conflating popular belief with empirical evidence. The adversarial pass prevents the agent from over-committing to its first interpretation.

The result is a tool that takes 15 minutes instead of 15 seconds, but produces output with citation-level traceability and an explicit record of how it stress-tested its own conclusions.

That tradeoff is intentional. For questions where getting the answer wrong has consequences, speed is the wrong optimization target. Verification is the right one.