Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion apps/web/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ export default defineConfig({
sidebar: [
{ label: 'Getting Started', autogenerate: { directory: 'docs/getting-started' } },
{ label: 'Evaluation', autogenerate: { directory: 'docs/evaluation' } },
{ label: 'Evaluators', autogenerate: { directory: 'docs/evaluators' } },
{ label: 'Graders', autogenerate: { directory: 'docs/graders' } },
{ label: 'Targets', autogenerate: { directory: 'docs/targets' } },
{ label: 'Tools', autogenerate: { directory: 'docs/tools' } },
{ label: 'Guides', autogenerate: { directory: 'docs/guides' } },
Expand Down
18 changes: 9 additions & 9 deletions apps/web/src/content/docs/docs/evaluation/batch-cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ Use batch CLI evaluation when:
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
- The runner reads the eval YAML directly to extract all tests
- Output is JSONL with records keyed by test `id`
- Each test has its own evaluator to validate its corresponding output record
- Each test has its own grader to validate its corresponding output record

## Execution Flow

1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
2. **Batch runner** reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by `id`
3. **AgentV** parses the JSONL and routes each record to its matching test by `id`
4. **Per-test evaluators** validate the output for each test independently
4. **Per-test graders** validate the output for each test independently

## Eval File Structure

Expand Down Expand Up @@ -109,7 +109,7 @@ JSONL where each line is a JSON object with an `id` matching a test:
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
```

The `id` field must match the test `id` for AgentV to route output to the correct evaluator.
The `id` field must match the test `id` for AgentV to route output to the correct grader.

### Output with Tool Trajectory

Expand Down Expand Up @@ -138,11 +138,11 @@ To enable `tool_trajectory` evaluation, include `output` with `tool_calls`:
}
```

AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` evaluators.
AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` graders.

## Evaluator Implementation
## Grader Implementation

Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard `code_grader` input via stdin.
Each test has its own grader that validates the batch runner output. The grader receives the standard `code_grader` input via stdin.

**Input (stdin):**
```json
Expand All @@ -164,7 +164,7 @@ Each test has its own evaluator that validates the batch runner output. The eval
}
```

### Example Evaluator
### Example Grader

```typescript
import fs from 'node:fs';
Expand Down Expand Up @@ -233,7 +233,7 @@ expected_output:
reasons: []
```

The evaluator extracts these fields and compares them against the parsed candidate output.
The grader extracts these fields and compares them against the parsed candidate output.

## Target Configuration

Expand All @@ -259,7 +259,7 @@ Key settings:

## Best Practices

1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct evaluator
1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct grader
2. **Structured input** -- put structured data in `user.content` for the runner to extract
3. **Structured expected_output** -- define expected output as objects for easy comparison
4. **Deterministic runners** -- batch runners should produce consistent output for reliable testing
Expand Down
42 changes: 21 additions & 21 deletions apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar:
order: 2
---

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional grader overrides.

## Basic Structure

Expand All @@ -29,9 +29,9 @@ tests:
| `expected_output` | No | Expected response for comparison (string, object, or message array). Alias: `expected_output` |
| `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
| `workspace` | No | Per-case workspace config (overrides suite-level) |
| `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts |
| `metadata` | No | Arbitrary key-value pairs passed to graders and workspace scripts |
| `rubrics` | No | Structured evaluation criteria |
| `assertions` | No | Per-test evaluators |
| `assertions` | No | Per-test graders |

## Input

Expand All @@ -55,7 +55,7 @@ When suite-level `input` is defined in the eval file, those messages are prepend

## Expected Output

Optional reference response for comparison by evaluators. A string expands to a single assistant message:
Optional reference response for comparison by graders. A string expands to a single assistant message:

```yaml
expected_output: "42"
Expand All @@ -71,7 +71,7 @@ expected_output:

## Per-Case Execution Overrides

Override the default target or evaluators for specific tests:
Override the default target or graders for specific tests:

```yaml
tests:
Expand All @@ -87,7 +87,7 @@ tests:
prompt: ./graders/depth.md
```

Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
Per-case `assertions` graders are **merged** with root-level `assertions` graders — test-specific graders run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:

```yaml
assertions:
Expand Down Expand Up @@ -162,11 +162,11 @@ Operational checkout state belongs under `workspace.repos[].checkout.base_commit

## Per-Test Assertions

The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
The `assertions` field defines graders directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.

### Deterministic Assertions

These evaluators run without an LLM call and produce binary (0 or 1) scores:
These graders run without an LLM call and produce binary (0 or 1) scores:

| Type | Value | Description |
|------|-------|-------------|
Expand Down Expand Up @@ -251,7 +251,7 @@ tests:
value: ["true/false", "boolean", "expected value"]
```

Assertion evaluators auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
Assertion graders auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).

### Rubric Assertions

Expand Down Expand Up @@ -283,7 +283,7 @@ tests:

### Required Gates

Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.
Any grader in `assertions` can be marked as `required`. When a required grader fails, the overall test verdict is `fail` regardless of the aggregate score.

| Value | Behavior |
|-------|----------|
Expand All @@ -303,23 +303,23 @@ assertions:
weight: 1.0
```

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`.
Required gates are evaluated after all graders run. If any required grader falls below its threshold, the verdict is forced to `fail`.

### Assertions Merge Behavior

`assertions` can be defined at both suite and test levels:

- Per-test `assertions` evaluators run first.
- Suite-level `assertions` evaluators are appended automatically.
- Per-test `assertions` graders run first.
- Suite-level `assertions` graders are appended automatically.
- Set `execution.skip_defaults: true` on a test to skip suite-level defaults.

## How `criteria` and `assertions` Interact

The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present.
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an grader itself — how it gets used depends on whether `assertions` is present.

### No `assertions` — implicit LLM grader

When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:
When a test has no `assertions` field, a default `llm-grader` grader runs automatically and uses `criteria` as the evaluation prompt:

```yaml
tests:
Expand All @@ -342,14 +342,14 @@ tests:
input: Generate the spreadsheet report
```

### `assertions` present — explicit evaluators only
### `assertions` present — explicit graders only

When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
When `assertions` is defined, only the declared graders run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.

If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:

```
Warning: Test 'my-test': criteria is defined but no evaluator in assertions
Warning: Test 'my-test': criteria is defined but no grader in assertions
will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
if it is documentation-only.
```
Expand All @@ -367,7 +367,7 @@ tests:
value: "fix"
```

When you need a custom file conversion for only one grader, add `preprocessors` directly to that evaluator:
When you need a custom file conversion for only one grader, add `preprocessors` directly to that grader:

```yaml
preprocessors:
Expand All @@ -389,7 +389,7 @@ tests:

## Metadata

Pass additional context to evaluators via the `metadata` field:
Pass additional context to graders via the `metadata` field:

```yaml
tests:
Expand Down
8 changes: 4 additions & 4 deletions apps/web/src/content/docs/docs/evaluation/eval-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar:
order: 1
---

Evaluation files define the test cases, targets, and evaluators for an evaluation run. AgentV supports two formats: YAML and JSONL.
Evaluation files define the test cases, targets, and graders for an evaluation run. AgentV supports two formats: YAML and JSONL.

## Suites

Expand Down Expand Up @@ -41,7 +41,7 @@ tests:
| `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) |
| `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/docs/guides/workspace-pool/#external-workspace-config) |
| `tests` | Array of individual tests, or a string path to an external file |
| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |

### Metadata Fields
Expand Down Expand Up @@ -76,7 +76,7 @@ tests:

### Suite-level Assertions

The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.
The `assertions` field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test's graders unless a test sets `execution.skip_defaults: true`.

```yaml
description: API response validation
Expand All @@ -92,7 +92,7 @@ tests:
input: Check API health
```

`assertions` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
`assertions` supports all grader types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.

### Assertion Includes

Expand Down
12 changes: 6 additions & 6 deletions apps/web/src/content/docs/docs/evaluation/examples.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ tests:
```
````

## Multi-Evaluator
## Multi-Grader

Combine a code grader and an LLM grader on the same test:

Expand All @@ -86,7 +86,7 @@ tests:
- name: json_format_validator
type: code-grader
command: [uv, run, validate_json.py]
cwd: ./evaluators
cwd: ./graders
- name: content_evaluator
type: llm-grader
prompt: ./graders/semantic_correctness.md
Expand Down Expand Up @@ -363,11 +363,11 @@ tests:
- The batch runner reads the eval YAML via `--eval` flag and outputs JSONL keyed by `id`
- Put structured data in `user.content` as objects for the runner to extract
- Use `expected_output` with object fields for structured expected output
- Each test has its own evaluator to validate its portion of the output
- Each test has its own grader to validate its portion of the output

## Suite-level Input

Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators:
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for graders:

```yaml
description: Travel assistant evaluation
Expand Down Expand Up @@ -418,11 +418,11 @@ See the [suite-level-input example](https://github.com/EntityProcess/agentv/tree
- Show the pattern, not rigid templates
- Allow for natural language variation
- Focus on semantic correctness over exact matching
- Evaluators handle the actual validation logic
- Graders handle the actual validation logic

## Showcases

For complete end-to-end workflows that combine multiple features, see the showcases in [`examples/showcase/`](https://github.com/EntityProcess/agentv/tree/main/examples/showcase):

- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted evaluators, measures variability, and compares results side-by-side.
- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted graders, measures variability, and compares results side-by-side.
- **[Export Screening](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/export-screening)** — classification eval with confusion matrix metrics and CI gating.
6 changes: 3 additions & 3 deletions apps/web/src/content/docs/docs/evaluation/rubrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ tests:
- States time complexity
```

All strings are collected into a single rubrics evaluator automatically.
All strings are collected into a single rubrics grader automatically.

### Full form for advanced options

Expand Down Expand Up @@ -120,9 +120,9 @@ score = sum(criterion_score / 10 * weight) / sum(total_weights)

## Authoring Rubrics

Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic evaluators, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the evaluator choice driven by the criteria rather than one fixed recipe.
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.

## Combining with Other Evaluators
## Combining with Other Graders

Rubrics work alongside code and LLM graders:

Expand Down
8 changes: 4 additions & 4 deletions apps/web/src/content/docs/docs/evaluation/running-evals.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ agentv eval --dry-run evals/my-eval.yaml
```

:::note
Dry-run returns mock responses that don't match evaluator output schemas. Use it only for testing harness flow, not evaluator logic.
Dry-run returns mock responses that don't match grader output schemas. Use it only for testing harness flow, not grader logic.
:::

### Custom Output Directory
Expand Down Expand Up @@ -163,7 +163,7 @@ Each eval test case produces a trace with:
- **LLM call spans** (`chat <model>`) — model name, token usage (input/output/cached)
- **Tool call spans** (`execute_tool <name>`) — tool name, arguments, results (with `--otel-capture-content`)
- **Turn spans** (`agentv.turn.N`) — groups messages by conversation turn (with `--otel-group-turns`)
- **Evaluator events** — per-grader scores attached to the root span
- **Grader events** — per-grader scores attached to the root span

:::tip[Claude provider + trace-claude-code plugin]
When using the Claude provider, AgentV injects `CC_PARENT_SPAN_ID` and `CC_ROOT_SPAN_ID` into the Claude subprocess. If the [trace-claude-code](https://github.com/braintrustdata/braintrust-claude-plugin) plugin is installed, it attaches Claude Code CLI-level tool spans (Read, Write, Bash, etc.) as children of the AgentV eval trace, giving you full visibility into both the eval framework and the agent's internal actions.
Expand Down Expand Up @@ -331,14 +331,14 @@ This is the same interface that agent-orchestrated evals use — the EVAL.yaml t

## Offline Grading

Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:
Grade existing agent sessions without re-running them. Import a transcript, then run deterministic graders:

```bash
# List sessions and import one
agentv import claude --list
agentv import claude --session-id <uuid>

# Run evaluators against the imported transcript
# Run graders against the imported transcript
agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl
```

Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/evaluation/sdk.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ export default defineCodeGrader(({ trace, outputText }) => ({

`defineCodeGrader` graders are referenced in YAML with `type: code-grader` and `command: [bun, run, grader.ts]`. `defineAssertion` uses convention-based discovery instead — just place in `.agentv/assertions/` and reference by name.

For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/evaluators/code-graders/).
For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/graders/code-graders/).

## Programmatic API

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,5 +72,5 @@ Results appear in `.agentv/results/runs/<timestamp>/index.jsonl` with scores, re

- Learn about [eval file formats](/docs/evaluation/eval-files/)
- Configure [targets](/docs/targets/configuration/) for different providers
- Create [custom evaluators](/docs/evaluators/custom-evaluators/)
- Create [custom graders](/docs/graders/custom-graders/)
- If setup drifts, rerun: `agentv init`
Loading
Loading