Skip to content

Python: feat(evals): add ground_truth support for similarity evaluator#5234

Open
chetantoshniwal wants to merge 3 commits intomainfrom
feature/similarity-ground-truth
Open

Python: feat(evals): add ground_truth support for similarity evaluator#5234
chetantoshniwal wants to merge 3 commits intomainfrom
feature/similarity-ground-truth

Conversation

@chetantoshniwal
Copy link
Copy Markdown
Contributor

  • Include expected_output as ground_truth in Foundry JSONL dataset rows
  • Add ground_truth to item schema and data mapping for similarity evaluator
  • Add expected_output parameter to evaluate_workflow
  • Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples
  • Add tests for ground_truth in dataset, schema, and evaluate_workflow

Motivation and Context

#5135

Description

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

- Include expected_output as ground_truth in Foundry JSONL dataset rows
- Add ground_truth to item schema and data mapping for similarity evaluator
- Add expected_output parameter to evaluate_workflow
- Add similarity Pattern 3 to evaluate_agent and evaluate_workflow samples
- Add tests for ground_truth in dataset, schema, and evaluate_workflow
Copilot AI review requested due to automatic review settings April 13, 2026 22:07
@chetantoshniwal chetantoshniwal changed the title feat(evals): add ground_truth support for similarity evaluator .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026
@github-actions github-actions bot changed the title .NET feat(evals): add ground_truth support for similarity evaluator Python: .NET feat(evals): add ground_truth support for similarity evaluator Apr 13, 2026
@moonbox3
Copy link
Copy Markdown
Contributor

moonbox3 commented Apr 13, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _evaluation.py6557289%164, 172, 485, 487, 615, 618, 697–699, 704, 741–744, 800–801, 804, 810–812, 816, 849–851, 907, 943, 955–957, 962, 986–991, 1084, 1162–1163, 1165–1169, 1175, 1214, 1562, 1564, 1572, 1582, 1586, 1631, 1649–1650, 1728, 1730, 1736, 1744, 1759, 1797, 1803–1807, 1839, 1870–1871, 1873, 1898–1899, 1904
packages/foundry/agent_framework_foundry
   _foundry_evals.py249498%444, 449, 629, 694
TOTAL27220318388% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5449 20 💤 0 ❌ 0 🔥 1m 23s ⏱️

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “ground truth” support to Foundry-backed similarity evaluation by mapping EvalItem.expected_output into the Foundry JSONL field ground_truth, and exposing expected_output on evaluate_workflow() to enable reference-based evaluators (e.g., SIMILARITY) for workflows.

Changes:

  • Extend Foundry JSONL item schema + data mapping to include ground_truth for similarity evaluation.
  • Add expected_output parameter to evaluate_workflow() and stamp it onto overall workflow EvalItems when running via queries.
  • Update Foundry eval samples to include a similarity/ground-truth pattern; add tests covering schema/mapping/dataset output.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/packages/foundry/agent_framework_foundry/_foundry_evals.py Adds ground-truth evaluator mapping + schema support; stamps expected_output into JSONL ground_truth.
python/packages/core/agent_framework/_evaluation.py Adds expected_output to evaluate_workflow() and stamps it onto overall items in the run+evaluate path.
python/packages/foundry/tests/test_foundry_evals.py Adds tests for ground-truth mapping/schema and workflow expected_output stamping/validation.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py Adds Pattern 3 sample demonstrating similarity evaluation with ground truth.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py Adds Pattern 3 similarity sample with expected_output.
Comments suppressed due to low confidence (1)

python/packages/foundry/agent_framework_foundry/_foundry_evals.py:717

  • Similarity (and any evaluator in _GROUND_TRUTH_EVALUATORS) always gets a data_mapping for ground_truth, but there’s no local validation that all items provide expected_output/ground_truth. If a caller requests similarity without expected_output, this will still create an eval definition and fail provider-side with a less actionable error. Add a preflight check here to raise a clear ValueError when a ground-truth-required evaluator is selected and any item is missing expected_output (or alternatively filter out those evaluators similarly to _filter_tool_evaluators).
        # Resolve evaluators with auto-detection
        resolved = _resolve_default_evaluators(self._evaluators, items=items)
        # Filter tool evaluators if items don't have tools
        resolved = _filter_tool_evaluators(resolved, items)

        # Standard JSONL dataset path
        return await self._evaluate_via_dataset(items, resolved, eval_name)

    # -- Internal evaluation paths --

    async def _evaluate_via_dataset(
        self,
        items: Sequence[EvalItem],
        evaluators: list[str],
        eval_name: str,
    ) -> EvalResults:
        """Evaluate using JSONL dataset upload path."""
        dicts: list[dict[str, Any]] = []
        for item in items:
            # Build JSONL dict directly from split_messages + converter
            # to avoid splitting the conversation twice.
            effective_split = item.split_strategy or self._conversation_split
            query_msgs, response_msgs = item.split_messages(effective_split)

            query_text = " ".join(m.text for m in query_msgs if m.role == "user" and m.text).strip()
            response_text = " ".join(m.text for m in response_msgs if m.role == "assistant" and m.text).strip()

            d: dict[str, Any] = {
                "query": query_text,
                "response": response_text,
                "query_messages": AgentEvalConverter.convert_messages(query_msgs),
                "response_messages": AgentEvalConverter.convert_messages(response_msgs),
            }
            if item.tools:
                d["tool_definitions"] = [
                    {"name": t.name, "description": t.description, "parameters": t.parameters()} for t in item.tools
                ]
            if item.context:
                d["context"] = item.context
            if item.expected_output:
                d["ground_truth"] = item.expected_output
            dicts.append(d)

        has_context = any("context" in d for d in dicts)
        has_ground_truth = any("ground_truth" in d for d in dicts)
        has_tools = any("tool_definitions" in d for d in dicts)

        eval_obj = await self._client.evals.create(
            name=eval_name,
            data_source_config={  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                "type": "custom",
                "item_schema": _build_item_schema(
                    has_context=has_context, has_ground_truth=has_ground_truth, has_tools=has_tools
                ),
                "include_sample_schema": True,
            },
            testing_criteria=_build_testing_criteria(  # type: ignore[arg-type]  # pyright: ignore[reportArgumentType]
                evaluators,
                self._model,
                include_data_mapping=True,
            ),
        )

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chetantoshniwal chetantoshniwal changed the title Python: .NET feat(evals): add ground_truth support for similarity evaluator Python: feat(evals): add ground_truth support for similarity evaluator Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants