Evaluation History

AgentDocs keeps dogfood results as a timeline, not a single polished number. That matters because the product is improving along two axes at once:

compiler reliability: can docs be collected, normalized, built, searched, and audited deterministically;
agent workflow quality: can a coding agent safely reuse the right context across sessions without mixing stale, deprecated, or wrong-version evidence.

Prepared website crawl artifacts are rebuilt from stored normalized pages unless a run explicitly says it was a live recrawl.

Run History

Date	Run	What changed	Result
June 11, 2026	Initial real-world baseline	Ran AgentDocs across self-docs, Hono, Fastify, Supabase, TanStack Query, Next.js, Octokit, and Prisma preparation.	Established baseline failures and risks: Supabase MDX stopped the build, Fastify local retrieval favored v3 migration evidence, TanStack broad retrieval mixed frameworks, Prisma preparation was blocked on Windows filenames.
June 12, 2026	Post-hardening rerun	Added context facets, tolerant MDX diagnostics, repo-source hardening, and regression assertions.	Successful prepared targets rebuilt deterministically. Supabase completed with diagnostics; Fastify and TanStack filtered retrieval became context-safe; broad mixed-context retrieval emitted warnings.
June 16, 2026	Agent workflow layer rerun	Added `status`, `handoff`, `verify-context`, `setup-agent`, `rebuild --changed`, `watch`, `agent-brief.md`, build-state freshness, and richer MCP tools.	All documented prepared targets passed dogfood regression again. `status` reported fresh across all 9 rerun targets. Workflow verification passed where a matching task pack existed and exposed missing exact-goal task packs elsewhere.
June 20, 2026	Metrics and routing instrumentation	Added a metrics reference and dogfood routing benchmark capture with explicit `--routing-goal` and `--expect-route` flags.	Offline fixtures now verify one expected task-pack route. Future dogfood rows can report routing accuracy without making all routing goals hard failures.
June 20, 2026	Routing improvements	Added built-in route-handler, query-invalidation, and schema-validation task families, then expanded the offline routing fixture.	Offline fixtures now verify four expected task-pack routes and keep routing accuracy separate from readiness score.
June 20, 2026	Full Phase 5 dogfood rerun	Reran the nine documented prepared targets and populated routing metrics in `.dogfood/regression-summary.csv`.	Stable repeated builds across all targets. Fastify local, TanStack local, Next.js prepared crawl, Supabase, and AgentDocs exact routing passed. Hono quickstart routing failed on local and prepared-crawl targets.
June 23, 2026	Parser format expansion	Integrated Sphinx/reST and AsciiDoc format normalizers, transclusion resolution, and include-gap doctor auditing.	Sphinx/reST and AsciiDoc/Antora formats supported and verified against django, cpython, spring-framework, and airflow targets.
June 26, 2026	Active Evaluation Sandbox	Verified the evaluation sandbox harness and added a mock-verified `octokit-pagination` task to benchmark AgentDocs against control groups.	Confirmed successful runs. `octokit-pagination` on `gpt-4o-mini` showed a +100% Success Rate Delta (Control failed, Experimental passed in 7 turns). `dummy-sdk` on `gpt-4o` showed 2 turns saved (Experimental passed in 5 turns, Control in 7).

June 26, 2026 Active Evaluation Sandbox

These targets benchmarked the active evaluation sandbox harness using the dummy-sdk and a newly created octokit-pagination task (running 14 pages of local Octokit REST documentation).

Task	Model	Control Group	Experimental (MCP)	Turns Saved	Success Delta	Result
Dummy SDK	`gpt-4o`	Passed (7 turns)	Passed (5 turns)	2	0%	Experimental used `search_docs` and `get_page` to compile exactly what was needed.
Octokit Pagination	`gpt-4o`	Failed (10 turns)	Passed (7 turns)	3	+100%	Control failed on ESM module boundaries; Experimental passed in 7 turns. Optimized get_page cut token usage by 74.4% (from 55k to 14.1k tokens).
Fastify Validation	`gpt-4o`	Failed (10 turns)	Passed (5 turns)	5	+100%	Control forgot schema nesting; Experimental passed in 5 turns. Optimized get_page cut token usage by 48.7% (from 66.3k to 33.9k tokens).
AgentDocs Config	`gpt-4o-mini`	Passed (4 turns)	Passed (5 turns)	-1	0%	Custom API discovery. Control "peeked" at the import snippet directly inside the grep match output.
Next.js App Router	`gpt-4o`	Passed (8 turns)	Passed (7 turns)	1	0%	Complex Pages-to-App Router migration. Experimental saved 1 turn, but schema overhead added 17% more tokens.
AWS JS SDK v3	`gpt-4o`	Passed (4 turns)	Passed (3 turns)	1	0%	DynamoDB client pagination. Experimental saved 1 turn and 9.8k tokens (72% saved!) by avoiding grep context bloat.

June 23, 2026 Parser Format Expansion Rerun

These targets test the expanded format parsers (Sphinx/reST and AsciiDoc/Antora) and transclusion resolution plumbing on large-scale real-world documentation estates.

Target	Pages	Task packs	Readiness	Source Coverage	Status	Result
Django	671	8	92	100%	Passed	Sphinx/reST parser support successfully compiling `.txt`/`.rst` docs.
CPython	556	8	79	99.6%	Passed	Full `.rst` tree compilation.
Spring Framework	469	6	79	99.5%	Passed	AsciiDoc/Antora parser support compiling `.adoc` files.
Airflow	1,617	10	79	86%	Passed	Mixed reST parser support compiling `.rst`/`.txt` docs with transclusion gap tracking.

June 20, 2026 Phase 5 Full Dogfood Rerun

Target	Pages	Task packs	Readiness	Routing	Result
AgentDocs self-docs	13	4	79	1/1	Passed; setup goal routes to installation.
Hono local docs	85	7	93	1/2	Build passed; Cloudflare Workers routes to deployment, but quickstart selects installation.
Fastify local docs	43	5	91	2/2	Passed; schema validation and migration route exactly.
Supabase local MDX	737	11	79	1/1	Passed; auth/RLS routes to authentication, with MDX coverage gap reported.
TanStack Query local docs	411	9	79	1/1	Passed; React mutation invalidation routes to query-invalidation.
Octokit local docs	14	4	93	report-only	Passed; auth request handoff selected authentication without a strict expectation.
Next.js prepared crawl	100	8	88	1/1	Passed; App Router POST route routes to route-handlers.
Hono prepared crawl	100	4	79	0/1	Build passed; quickstart selects authentication and remains a routing gap.
Fastify prepared crawl	100	6	83	1/1 strict	Passed; migration routes exactly and schema-validation was captured report-only.

June 16, 2026 Workflow-Layer Rerun

Target	Pages	Task packs	Readiness	Regression	Workflow-layer signal
AgentDocs self-docs	13	3	90	Passed	Fresh; self-dogfood task remains passed. Exact `serve MCP context` verification had no matching task pack, which points to task-routing coverage rather than build failure.
Hono local docs	85	7	93	Passed	Fresh; `handoff` selected `deployment`; `verify-context` passed for Cloudflare Worker deployment.
Fastify local docs	43	4	93	Passed	Fresh; unfiltered migration correctly warns about mixed v3/v4/v5 context. Exact Fastify v5 schema goal needs stronger task-pack routing.
Supabase local MDX	737	9	94	Passed	Fresh; `handoff` selected `authentication`; `verify-context` passed for auth and Row Level Security.
TanStack Query local docs	411	7	90	Passed	Fresh; broad framework queries warn about mixed context. Exact React mutation-invalidation goal needs stronger task-pack routing.
Octokit local docs	14	4	95	Passed	Fresh; compact REST docs baseline remains stable. Exact auth request goal did not map to a generated task pack.
Next.js prepared crawl	100	7	90	Passed	Fresh from prepared crawl rebuild; exact App Router POST route goal needs stronger task-pack routing.
Hono prepared crawl	100	4	81	Passed	Fresh from prepared crawl rebuild; live recrawl remains opt-in.
Fastify prepared crawl	100	5	85	Passed	Fresh from prepared crawl rebuild; migration still routes to the V5 Migration Guide first.

What Improved

Freshness is now measurable through .agentdocs/state/build-state.json and agentdocs status.
agentdocs handoff gives agents a reusable task entry point instead of forcing them to start with raw search.
agentdocs verify-context separates successful builds from task-specific safety. A failure can now mean "the docs compiled, but this exact task needs better task-pack routing or narrower facets."
MCP exposes richer task-oriented tools while staying read-only.

What Is Still Open

The workflow-layer rerun did not turn dependency implementation tasks into passes. Most remain unknown until an agent actually completes the task using only AgentDocs context. The new verification failures are useful because they show where built-in task families should expand next: route handlers, schema validation, React mutation invalidation, and SDK request/auth workflows.

Phase 3 adds deterministic measurement for those routing gaps. See Routing Benchmarks Phase 3 for the runner contract and Evaluation Metrics Reference for field definitions. Phase 4-5 begins closing the measured gaps with conservative built-in task families; see Routing Improvements Phase 4-5.

Evaluation History ​

Run History ​

June 26, 2026 Active Evaluation Sandbox ​

June 23, 2026 Parser Format Expansion Rerun ​

June 20, 2026 Phase 5 Full Dogfood Rerun ​

June 16, 2026 Workflow-Layer Rerun ​

What Improved ​

What Is Still Open ​