Skip to content

Evaluation History

AgentDocs keeps dogfood results as a timeline, not a single polished number. That matters because the product is improving along two axes at once:

  • compiler reliability: can docs be collected, normalized, built, searched, and audited deterministically;
  • agent workflow quality: can a coding agent safely reuse the right context across sessions without mixing stale, deprecated, or wrong-version evidence.

Prepared website crawl artifacts are rebuilt from stored normalized pages unless a run explicitly says it was a live recrawl.

Run History

DateRunWhat changedResult
June 11, 2026Initial real-world baselineRan AgentDocs across self-docs, Hono, Fastify, Supabase, TanStack Query, Next.js, Octokit, and Prisma preparation.Established baseline failures and risks: Supabase MDX stopped the build, Fastify local retrieval favored v3 migration evidence, TanStack broad retrieval mixed frameworks, Prisma preparation was blocked on Windows filenames.
June 12, 2026Post-hardening rerunAdded context facets, tolerant MDX diagnostics, repo-source hardening, and regression assertions.Successful prepared targets rebuilt deterministically. Supabase completed with diagnostics; Fastify and TanStack filtered retrieval became context-safe; broad mixed-context retrieval emitted warnings.
June 16, 2026Agent workflow layer rerunAdded status, handoff, verify-context, setup-agent, rebuild --changed, watch, agent-brief.md, build-state freshness, and richer MCP tools.All documented prepared targets passed dogfood regression again. status reported fresh across all 9 rerun targets. Workflow verification passed where a matching task pack existed and exposed missing exact-goal task packs elsewhere.
June 20, 2026Metrics and routing instrumentationAdded a metrics reference and dogfood routing benchmark capture with explicit --routing-goal and --expect-route flags.Offline fixtures now verify one expected task-pack route. Future dogfood rows can report routing accuracy without making all routing goals hard failures.
June 20, 2026Routing improvementsAdded built-in route-handler, query-invalidation, and schema-validation task families, then expanded the offline routing fixture.Offline fixtures now verify four expected task-pack routes and keep routing accuracy separate from readiness score.
June 20, 2026Full Phase 5 dogfood rerunReran the nine documented prepared targets and populated routing metrics in .dogfood/regression-summary.csv.Stable repeated builds across all targets. Fastify local, TanStack local, Next.js prepared crawl, Supabase, and AgentDocs exact routing passed. Hono quickstart routing failed on local and prepared-crawl targets.
June 23, 2026Parser format expansionIntegrated Sphinx/reST and AsciiDoc format normalizers, transclusion resolution, and include-gap doctor auditing.Sphinx/reST and AsciiDoc/Antora formats supported and verified against django, cpython, spring-framework, and airflow targets.
June 26, 2026Active Evaluation SandboxVerified the evaluation sandbox harness and added a mock-verified octokit-pagination task to benchmark AgentDocs against control groups.Confirmed successful runs. octokit-pagination on gpt-4o-mini showed a +100% Success Rate Delta (Control failed, Experimental passed in 7 turns). dummy-sdk on gpt-4o showed 2 turns saved (Experimental passed in 5 turns, Control in 7).

June 26, 2026 Active Evaluation Sandbox

These targets benchmarked the active evaluation sandbox harness using the dummy-sdk and a newly created octokit-pagination task (running 14 pages of local Octokit REST documentation).

TaskModelControl GroupExperimental (MCP)Turns SavedSuccess DeltaResult
Dummy SDKgpt-4oPassed (7 turns)Passed (5 turns)20%Experimental used search_docs and get_page to compile exactly what was needed.
Octokit Paginationgpt-4oFailed (10 turns)Passed (7 turns)3+100%Control failed on ESM module boundaries; Experimental passed in 7 turns. Optimized get_page cut token usage by 74.4% (from 55k to 14.1k tokens).
Fastify Validationgpt-4oFailed (10 turns)Passed (5 turns)5+100%Control forgot schema nesting; Experimental passed in 5 turns. Optimized get_page cut token usage by 48.7% (from 66.3k to 33.9k tokens).
AgentDocs Configgpt-4o-miniPassed (4 turns)Passed (5 turns)-10%Custom API discovery. Control "peeked" at the import snippet directly inside the grep match output.
Next.js App Routergpt-4oPassed (8 turns)Passed (7 turns)10%Complex Pages-to-App Router migration. Experimental saved 1 turn, but schema overhead added 17% more tokens.
AWS JS SDK v3gpt-4oPassed (4 turns)Passed (3 turns)10%DynamoDB client pagination. Experimental saved 1 turn and 9.8k tokens (72% saved!) by avoiding grep context bloat.

June 23, 2026 Parser Format Expansion Rerun

These targets test the expanded format parsers (Sphinx/reST and AsciiDoc/Antora) and transclusion resolution plumbing on large-scale real-world documentation estates.

TargetPagesTask packsReadinessSource CoverageStatusResult
Django671892100%PassedSphinx/reST parser support successfully compiling .txt/.rst docs.
CPython55687999.6%PassedFull .rst tree compilation.
Spring Framework46967999.5%PassedAsciiDoc/Antora parser support compiling .adoc files.
Airflow1,617107986%PassedMixed reST parser support compiling .rst/.txt docs with transclusion gap tracking.

June 20, 2026 Phase 5 Full Dogfood Rerun

TargetPagesTask packsReadinessRoutingResult
AgentDocs self-docs134791/1Passed; setup goal routes to installation.
Hono local docs857931/2Build passed; Cloudflare Workers routes to deployment, but quickstart selects installation.
Fastify local docs435912/2Passed; schema validation and migration route exactly.
Supabase local MDX73711791/1Passed; auth/RLS routes to authentication, with MDX coverage gap reported.
TanStack Query local docs4119791/1Passed; React mutation invalidation routes to query-invalidation.
Octokit local docs14493report-onlyPassed; auth request handoff selected authentication without a strict expectation.
Next.js prepared crawl1008881/1Passed; App Router POST route routes to route-handlers.
Hono prepared crawl1004790/1Build passed; quickstart selects authentication and remains a routing gap.
Fastify prepared crawl1006831/1 strictPassed; migration routes exactly and schema-validation was captured report-only.

June 16, 2026 Workflow-Layer Rerun

TargetPagesTask packsReadinessRegressionWorkflow-layer signal
AgentDocs self-docs13390PassedFresh; self-dogfood task remains passed. Exact serve MCP context verification had no matching task pack, which points to task-routing coverage rather than build failure.
Hono local docs85793PassedFresh; handoff selected deployment; verify-context passed for Cloudflare Worker deployment.
Fastify local docs43493PassedFresh; unfiltered migration correctly warns about mixed v3/v4/v5 context. Exact Fastify v5 schema goal needs stronger task-pack routing.
Supabase local MDX737994PassedFresh; handoff selected authentication; verify-context passed for auth and Row Level Security.
TanStack Query local docs411790PassedFresh; broad framework queries warn about mixed context. Exact React mutation-invalidation goal needs stronger task-pack routing.
Octokit local docs14495PassedFresh; compact REST docs baseline remains stable. Exact auth request goal did not map to a generated task pack.
Next.js prepared crawl100790PassedFresh from prepared crawl rebuild; exact App Router POST route goal needs stronger task-pack routing.
Hono prepared crawl100481PassedFresh from prepared crawl rebuild; live recrawl remains opt-in.
Fastify prepared crawl100585PassedFresh from prepared crawl rebuild; migration still routes to the V5 Migration Guide first.

What Improved

  • Freshness is now measurable through .agentdocs/state/build-state.json and agentdocs status.
  • agentdocs handoff gives agents a reusable task entry point instead of forcing them to start with raw search.
  • agentdocs verify-context separates successful builds from task-specific safety. A failure can now mean "the docs compiled, but this exact task needs better task-pack routing or narrower facets."
  • MCP exposes richer task-oriented tools while staying read-only.

What Is Still Open

The workflow-layer rerun did not turn dependency implementation tasks into passes. Most remain unknown until an agent actually completes the task using only AgentDocs context. The new verification failures are useful because they show where built-in task families should expand next: route handlers, schema validation, React mutation invalidation, and SDK request/auth workflows.

Phase 3 adds deterministic measurement for those routing gaps. See Routing Benchmarks Phase 3 for the runner contract and Evaluation Metrics Reference for field definitions. Phase 4-5 begins closing the measured gaps with conservative built-in task families; see Routing Improvements Phase 4-5.

Released under the MIT License.