Python: feat(evals): add ground_truth support for similarity evaluator#5234
Open
chetantoshniwal wants to merge 3 commits intomainfrom
Open
Python: feat(evals): add ground_truth support for similarity evaluator#5234chetantoshniwal wants to merge 3 commits intomainfrom
chetantoshniwal wants to merge 3 commits intomainfrom
Conversation
- Include expected_output as ground_truth in Foundry JSONL dataset rows - Add ground_truth to item schema and data mapping for similarity evaluator - Add expected_output parameter to evaluate_workflow - Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples - Add tests for ground_truth in dataset, schema, and evaluate_workflow
Contributor
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||||||||||||
Contributor
There was a problem hiding this comment.
Pull request overview
Adds “ground truth” support to Foundry-backed similarity evaluation by mapping EvalItem.expected_output into the Foundry JSONL field ground_truth, and exposing expected_output on evaluate_workflow() to enable reference-based evaluators (e.g., SIMILARITY) for workflows.
Changes:
- Extend Foundry JSONL item schema + data mapping to include
ground_truthfor similarity evaluation. - Add
expected_outputparameter toevaluate_workflow()and stamp it onto overall workflowEvalItems when running viaqueries. - Update Foundry eval samples to include a similarity/ground-truth pattern; add tests covering schema/mapping/dataset output.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/foundry/agent_framework_foundry/_foundry_evals.py | Adds ground-truth evaluator mapping + schema support; stamps expected_output into JSONL ground_truth. |
| python/packages/core/agent_framework/_evaluation.py | Adds expected_output to evaluate_workflow() and stamps it onto overall items in the run+evaluate path. |
| python/packages/foundry/tests/test_foundry_evals.py | Adds tests for ground-truth mapping/schema and workflow expected_output stamping/validation. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py | Adds Pattern 3 sample demonstrating similarity evaluation with ground truth. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py | Adds Pattern 3 similarity sample with expected_output. |
Comments suppressed due to low confidence (1)
python/packages/foundry/agent_framework_foundry/_foundry_evals.py:717
- Similarity (and any evaluator in _GROUND_TRUTH_EVALUATORS) always gets a data_mapping for ground_truth, but there’s no local validation that all items provide expected_output/ground_truth. If a caller requests similarity without expected_output, this will still create an eval definition and fail provider-side with a less actionable error. Add a preflight check here to raise a clear ValueError when a ground-truth-required evaluator is selected and any item is missing expected_output (or alternatively filter out those evaluators similarly to _filter_tool_evaluators).
# Resolve evaluators with auto-detection
resolved = _resolve_default_evaluators(self._evaluators, items=items)
# Filter tool evaluators if items don't have tools
resolved = _filter_tool_evaluators(resolved, items)
# Standard JSONL dataset path
return await self._evaluate_via_dataset(items, resolved, eval_name)
# -- Internal evaluation paths --
async def _evaluate_via_dataset(
self,
items: Sequence[EvalItem],
evaluators: list[str],
eval_name: str,
) -> EvalResults:
"""Evaluate using JSONL dataset upload path."""
dicts: list[dict[str, Any]] = []
for item in items:
# Build JSONL dict directly from split_messages + converter
# to avoid splitting the conversation twice.
effective_split = item.split_strategy or self._conversation_split
query_msgs, response_msgs = item.split_messages(effective_split)
query_text = " ".join(m.text for m in query_msgs if m.role == "user" and m.text).strip()
response_text = " ".join(m.text for m in response_msgs if m.role == "assistant" and m.text).strip()
d: dict[str, Any] = {
"query": query_text,
"response": response_text,
"query_messages": AgentEvalConverter.convert_messages(query_msgs),
"response_messages": AgentEvalConverter.convert_messages(response_msgs),
}
if item.tools:
d["tool_definitions"] = [
{"name": t.name, "description": t.description, "parameters": t.parameters()} for t in item.tools
]
if item.context:
d["context"] = item.context
if item.expected_output:
d["ground_truth"] = item.expected_output
dicts.append(d)
has_context = any("context" in d for d in dicts)
has_ground_truth = any("ground_truth" in d for d in dicts)
has_tools = any("tool_definitions" in d for d in dicts)
eval_obj = await self._client.evals.create(
name=eval_name,
data_source_config={ # type: ignore[arg-type] # pyright: ignore[reportArgumentType]
"type": "custom",
"item_schema": _build_item_schema(
has_context=has_context, has_ground_truth=has_ground_truth, has_tools=has_tools
),
"include_sample_schema": True,
},
testing_criteria=_build_testing_criteria( # type: ignore[arg-type] # pyright: ignore[reportArgumentType]
evaluators,
self._model,
include_data_mapping=True,
),
)
python/packages/foundry/agent_framework_foundry/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
alliscode
approved these changes
Apr 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
#5135
Description