From 0 to 3,320 Tests: What We Learned Building VAOS

Real bugs, real fixes, and what 3,320 tests actually cover in a 100K-line Elixir agent framework.

engineeringtestingelixirvaosretrospective

From 0 to 3,320 Tests: What We Learned Building VAOS

VAOS is three Elixir libraries and a chunk of Rust. Here are the numbers:

Total: roughly 100,000 lines of Elixir, 20,000 lines of Rust, 3,636 tests across the ecosystem.

This post is not about how great our test suite is. It is about what we got wrong, what the tests actually cover, and what they do not. If you are evaluating VAOS, this is the honest engineering state of the project as of March 2025.

Six Bugs Worth Talking About

1. The gRPC Client That Does Nothing

Open grpc_client.ex and look at call_grpc. It returns fake {:ok, ...} responses for every call. The entire gRPC integration is stubs.

The gRPC layer was specced during the initial architecture phase. Interfaces were defined, modules were built against those interfaces, and tests were written that exercise the module boundaries. The actual connection to a real gRPC service never happened.

We kept the stubs because removing them would break the module interface that other components depend on. The stubs are documented as non-functional in the module doc and in our internal tracker.

The lesson: stub code that passes tests gives false confidence. Our test suite reports green for gRPC-related paths, but nothing is actually being tested. The coverage number lies. We know it lies, and we have chosen to leave it that way for now rather than rip out a module boundary we will need later.

2. CostTracker Crashes on Boot

The cost tracking GenServer crashes immediately on startup. The SQLite3 schema migration does not match the table structure the GenServer expects. The routes still reference the GenServer, so any request to /api/costs returns a 500.

This happened because the schema was added after the GenServer was written. Over several weeks of development, the two drifted apart. The migration creates columns the GenServer does not know about, and the GenServer expects columns the migration does not create.

The lesson: database migration ordering matters more than it seems. When the schema and the code that reads it are authored at different times by different people (or the same person in different mental states), drift is the default outcome. We need migration tests that assert schema shape, not just migration success.

3. Process.sleep in a GenServer

x_publisher.ex contains Process.sleep chains inside a handle_call callback. During publishing, the GenServer blocks for 30 or more seconds.

This was written as rate limiting for the Twitter API. The X API enforces rate limits, and the "solution" was to sleep between posts. The problem: handle_call is synchronous. While the process sleeps, every other message in its mailbox waits. If anything else in the system sends a call to this GenServer during that window, it times out or queues up behind a 30-second wall.

The fix is straightforward: use Process.send_after to schedule the next publish, or offload the work to a Task that can sleep without blocking the GenServer's mailbox. We have not applied the fix yet because the publishing cadence is low enough that it has not caused a production incident. But it will.

The lesson: never sleep in a GenServer. The BEAM gives you better tools. Use them.

4. Blanket Rescues That Hide Real Errors

memory_store.ex has more than 10 functions that look like this:

def some_operation(args) do
  try do
    # actual logic
  rescue
    _ -> {:error, :unknown}
  end
end

Every function catches every exception and returns {:error, :unknown}. Out-of-memory errors, argument errors, pattern match failures -- they all become :unknown. When something goes wrong in the memory store, the logs say nothing useful. We have spent hours debugging issues that would have been obvious if we had let the process crash with a real stack trace.

The lesson: rescue specific exceptions. rescue ArgumentError -> is fine. rescue _ -> is not. The entire point of OTP supervisors is to handle unexpected crashes. Let them do their job. Swallowing errors does not make your system more reliable. It makes failures invisible.

5. Bcrypt: Present in Code, Absent in Dependencies

auth.ex calls Bcrypt.hash_pwd_salt/1 for password hashing. The module compiles without errors. At runtime, it crashes because bcrypt_elixir is not listed in mix.exs.

Elixir compiles module calls as atoms. The compiler does not verify that a module will be available at runtime -- it assumes you know what you are doing. We did not.

This was caught by a manual test, not by the test suite. None of the auth tests exercise the actual hashing path because the test setup uses pre-hashed fixtures.

The lesson: compilation does not mean correctness. If your tests use fixtures that bypass the code path where a dependency is called, you will not know the dependency is missing until someone actually runs it.

6. 200% CPU From a BEAM Bug, Not the LLM

OSA (our agent runtime) was consuming 200% CPU continuously. Fans spun up, machines throttled, and eventually processes crashed. The assumption was obvious: it must be model inference. Running an LLM agent framework, high CPU usage, clearly the model is the bottleneck.

It was not. The root cause was a memory and CPU bug in the BEAM process itself -- unrelated to any model calls. After profiling with :observer and :recon, we identified the hot path and fixed it. OSA now runs at 117MB memory usage and 7% CPU.

The lesson: profile before you assume. "It is probably the LLM" is a reasonable first guess and it was wrong. The fix took an afternoon once we stopped guessing and started measuring.

What 3,320 Tests Actually Cover

The test suite is not a vanity number. Here is what it exercises:

What They Do Not Cover

The Honest State

VAOS works for its intended use case: single-node agent orchestration with epistemic governance. An agent can receive a signal, classify it, select tools, execute a plan, track its own reasoning, and produce auditable output. That pipeline is well-tested and runs in production.

It is not production-hardened for multi-tenant SaaS. There is no tenant isolation, no horizontal scaling story, and no battle-tested auth layer (see bug number 5). The test suite catches regressions but does not prove correctness. Property-based tests get us closer, but 3,636 tests across 120,000 lines of code is roughly one test per 33 lines. That is decent coverage by count, but count is a poor proxy for confidence.

We publish these numbers and these bugs because the alternative is to say "we have thousands of tests" and let people assume everything works. It does not all work. Some of it is stubbed. Some of it crashes on boot. Some of it will block your GenServer for 30 seconds if you look at it wrong.

What we can say: the core agent orchestration pipeline -- signal in, reasoning out, tools executed, results tracked -- is solid. We know where the gaps are, they are documented, and we are closing them in priority order.

The code is the proof. The tests are the evidence. The bugs are the context you need to evaluate both.