Skip to content

Evaluation Metrics Reference

This page defines the beta reporting fields used by AgentDocs dogfood and benchmark runs. The fields are stable enough for public interpretation during the beta, but historical rows may omit newer fields. Missing historical values should be labeled, not treated as zero.

Summary Artifacts

Each pnpm regression:dogfood run writes local machine-readable evidence:

txt
results/
  summary.json
  summary.csv

summary.json is the complete evidence record. summary.csv is the compact single-row form used for cross-target tables such as .dogfood/regression-summary.csv.

Core Fields

FieldMeaning
targetStable label for the prepared docs target.
pagesNormalized source pages accepted into the AgentDocs model. For crawls, this is bounded by scope and page budget.
chunksHeading-aware text units generated from pages and written to chunks.jsonl.
entitiesDeterministic graph entities such as packages, imports, environment variables, CLI commands, routes, versions, warnings, and concepts.
task_packsEvidence-linked task bundles generated for families such as quickstart, authentication, migration, errors, deployment, and configuration.
readinessagentdocs doctor score out of 100. This is an audit score, not proof that an agent can complete a task.
repeat_build_hash_matchWhether two consecutive builds produced the same generated-artifact hash.
broken_linksKnown broken internal links among collected pages. Uncollected out-of-scope links are tracked separately by doctor.
warningsDoctor warnings that require review but do not necessarily fail a run.
deprecationsDeprecated concepts found through deterministic extraction.

Coverage Fields

FieldMeaning
source_coverage_ratiocompiledFiles / intendedFiles for the configured source scope.
source_coverage_gapReason or severity for incomplete source coverage.
supportedFilesMarkdown/MDX files in scope that AgentDocs can ingest today.
unsupportedFilesDocs-like files in scope that AgentDocs currently counts but does not parse, such as reST or AsciiDoc.
compiledFilesSupported files that produced usable or degraded normalized pages.

Use source coverage to avoid false confidence. A one-page Markdown result from a mostly reST or AsciiDoc corpus is a coverage gap, not a representative pass.

Search And Routing Fields

FieldMeaning
search_auth_goodHuman judgment for whether the standard authentication search was useful.
search_quickstart_goodHuman judgment for whether the standard quickstart search was useful.
topSearchResultsTop five search results for standard and target-specific queries.
routing_goalsNumber of task-routing goals evaluated with agentdocs handoff and agentdocs verify-context.
routing_expectedNumber of routing goals with an explicit expected task-pack ID.
routing_passedExpected routing goals whose selected task pack matched the expected ID list.
routing_failedExpected routing goals whose selected task pack was missing or unexpected.
routing_accuracyrouting_passed / routing_expected, or blank when no expected routes were declared.

Routing classifications are deterministic:

ClassificationMeaning
matched_exactSelected task pack is one of the expected task-pack IDs.
matched_relatedA task pack was selected, but no expectation was declared or the selected pack was not expected.
fallbackNo task pack was selected; AgentDocs fell back to source search and goal-bundle evidence.
unsafe_mixed_contextVerification found mixed task or search context, such as multiple versions or frameworks.

Routing benchmarks are report-only unless a run declares --expect-route.

Agent Task Field

agent_task_passed is a human judgment. It remains unknown until an agent actually completes the target workflow using the generated AgentDocs context. Readiness, search quality, and routing accuracy are supporting evidence, not a replacement for this task result.

Missing Metric Reasons

Use these labels instead of unexplained N/A:

ReasonMeaning
unsupported_formatThe docs corpus exists, but the dominant format is not parsed yet.
scale_limitedThe corpus is supported but too large for the current run budget.
scope_mismatchThe selected source scope is not representative of the intended docs.
retrieval_mismatchThe build succeeded, but common task queries returned irrelevant material.
historical_metric_not_capturedThe run predates this metric.
preparation_blockedThe target could not be prepared before AgentDocs ran.

Worked Example

If a row says:

txt
pages=85, task_packs=7, readiness=93, repeat_build_hash_match=true,
routing_expected=2, routing_passed=1, routing_failed=1,
agent_task_passed=unknown

read it as:

AgentDocs compiled the target deterministically and produced strong audit signals, but one declared task goal did not route to the expected task pack. Because the implementation task has not been completed by an agent, agent_task_passed stays unknown.

Released under the MIT License.