feat(eval): add ClassifierEvaluator (pure-metadata aggregator) by ajay-kesavan · Pull Request #1674 · UiPath/uipath-python

ajay-kesavan · 2026-05-21T17:00:57Z

What

Adds run-level aggregators to the eval framework — starting with a classification aggregator that builds a confusion matrix + precision/recall/F1 across a fixed class list. Works for both coded and low-code agents.

Architectural decisions (full record in Confluence: Design for Precision and Recall §5)

1. Aggregator config lives ON the evaluator, not as a separate evaluator.
We rejected a standalone Classifier evaluator (with a source_evaluator ID pointer) because it forced users to add two evaluators and copy an opaque ID, and the pointer broke during low-code→coded conversion. Config now lives on ExactMatch.aggregators — single source of truth, no cross-evaluator FK, travels with the evaluator JSON file.

2. Transport via per-datapoint justification, not the evaluator snapshot.
The snapshot mechanism is coded-only (low-code uses a different entity). The justification is persisted by BOTH pipelines, so it's the portable channel. ExactMatchJustification / the legacy details string carry the aggregators list per datapoint (identical content; deduped downstream).

Changes in this PR (SDK)

ExactMatchEvaluatorConfig.aggregators (optional) + ExactMatchJustification.aggregators; evaluate() embeds it.
New _aggregators.py: AggregatorSpec / ClassificationAggregatorSpec.
LegacyExactMatchEvaluator (low-code): gains an aggregators field; emits a JSON-string details of {expected, actual, aggregators} (legacy J=str, so the string passes through _serialize_justification verbatim → lands in EvalScore.Justification on the C# side).
Reverted the _build_evaluator_snapshot aggregators extension (no longer the transport).
Deleted the old standalone ClassifierEvaluator + EvaluatorType.CLASSIFIER.
Version 2.10.70 → 2.10.72.

Compatibility

Optional config field (default None) → existing evaluators unchanged. When unset, no aggregators in the justification and the downstream pass no-ops.

Test plan

mypy clean
ExactMatch round-trips config with/without aggregators
End-to-end with coded ExactMatch + classification aggregator
End-to-end with low-code (legacy) ExactMatch + classification aggregator

Adds a new evaluator type whose role is to carry a `classes` list and a `source_evaluator` name to downstream consumers. It does not compute classification metrics per datapoint — that work moves to the Studio Web C# backend, which reads each datapoint's agent output and the source evaluator's expected label after the per-datapoint loop finishes, scans the output for each configured class, and builds the confusion matrix. The per-datapoint evaluate() returns score=0.0 with a ClassifierJustification(classes, source_evaluator) details payload. This payload survives the existing CLI -> backend wire path via StudioWebProgressReporter._serialize_justification (json.dumps of the model_dump), arriving in the backend as a JSON string inside CodedEvaluatorScore.Justification where the C# layer can read it. Replaces the design in earlier draft PRs #1669 and #5307: the SDK no longer owns the dataset-level computation. The pure-config approach is ~50 LOC instead of ~1500 LOC of dataset-evaluator framework + worker workflow + factory + child workflow plumbing. Files: src/uipath/eval/evaluators/classifier_evaluator.py new (~90 LOC) src/uipath/eval/evaluators/__init__.py re-export + EVALUATORS list src/uipath/eval/evaluators/evaluator.py discriminator + Union entry src/uipath/eval/models/models.py EvaluatorType.CLASSIFIER tests/evaluators/test_classifier_evaluator.py 9 unit tests, all passing Verified: pytest tests/evaluators tests/cli/eval --no-cov -> 824 passed ruff check / ruff format / mypy -> clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A minimal 3-class intent classification agent (book / cancel / reschedule) that exercises the new ClassifierEvaluator end-to-end via `uipath eval`. Mirrors the wire shape Studio Web will see once the C# backend and frontend PRs land, so SDK changes can be validated standalone before the full stack is brought up. Layout: main.py — keyword classifier returning {"intent": "..."} evaluations/ eval-sets/main.json evaluators/ intent_match.json per-datapoint ExactMatch on .intent intent_classifier.json new uipath-classifier with classes + sourceEvaluator README.md — Path A (SDK CLI) + Path B (Studio Web) instructions Each datapoint has `evaluationCriterias.intent_classifier: {}` (the runtime skips evaluators that aren't keyed there). 6/9 datapoints are correctly classified by design; the resulting (expected, actual) pairs flow through the existing CLI -> backend wire path inside the classifier's justification payload as classes/source_evaluator metadata. Verified live: - ExactMatch averages to 0.7 (6/9 correct). - ClassifierEvaluator emits {"expected":"","actual":"","classes":[...], "source_evaluator":"intent_match"} per datapoint. - Plugging the (expected, actual) pairs from the resulting output into the same confusion-matrix math the C# helper implements yields macro F1 of 0.667 on this fixture — the number Studio Web's Aggregations panel would render once the backend pipeline is live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pydantic's generic resolution leaves T = typing.Any when a TypeVar is parameterized with its own bound (BaseEvaluationCriteria here), so BaseEvaluator[BaseEvaluationCriteria, ...] tripped the runtime's "X must be a subclass of BaseEvaluationCriteria" guard at load time: Failed to create evaluator from file 'evaluations/evaluators/classifier-*.json': typing.Any must be a subclass of BaseEvaluationCriteria. Introduce an empty ClassifierEvaluationCriteria(BaseEvaluationCriteria) subclass and parameterize Config + Evaluator with it. Mirrors how every other built-in evaluator (ExactMatch via OutputEvaluationCriteria, etc.) provides a concrete criteria type even when no per-datapoint fields are needed.

Replaces the standalone ClassifierEvaluator with an `aggregators` config field on per-datapoint evaluators (ExactMatch first). Run-level classification metrics are now driven by the host evaluator's config, not by a separate evaluator with a source-evaluator ID reference. Design rationale (see Confluence "Design for Precision and Recall" §5.2): the standalone evaluator forced users to add TWO evaluators and copy an opaque ID between them. Moving aggregator config onto the evaluator that already emits the labels keeps the source of truth in one place and makes the JSON file portable across conversions (e.g. low-code -> coded). - New module `_aggregators.py` with AggregatorSpec / ClassificationAggregatorSpec - ExactMatchEvaluatorConfig gains optional `aggregators: list[AggregatorSpec] | None` The Python runtime ignores the field; it's metadata for the downstream C# aggregation pass. - `_progress_reporter.py:_build_evaluator_snapshot` now also emits `aggregators` so the field flows into EvaluatorRun.EvaluatorSnapshot and the C# layer can discover it without consulting the eval set definition file separately. Bug fix: previously the builder only emitted prompt+model (LLM-judge only), so for ExactMatch the dict was empty and the snapshot ended up null in the wire payload. - ClassifierEvaluator, ClassifierEvaluationCriteria, ClassifierJustification, ClassifierEvaluatorConfig: all deleted. - EvaluatorType.CLASSIFIER enum value removed. - Discriminator union in evaluator.py drops the Classifier branch. Version bump 2.10.70 -> 2.10.72 (the previous .71 was an unused dev cache-bust). The new ExactMatch.aggregators field is a public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-24T02:21:40Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 90%)

See analysis details on SonarQube Cloud

Switches aggregator transport from the evaluator snapshot to the per-datapoint justification (the snapshot path was coded-only; the justification path works for both coded and low-code). - ExactMatchJustification gains an optional `aggregators` field; evaluate() embeds config.aggregators into the justification it already emits. - Reverts the _build_evaluator_snapshot extension (no longer the transport). Design: aggregator config lives on the evaluator (single source of truth, no cross-evaluator FK), travels per-datapoint in the justification, and is computed once by the C# post-pass. See Confluence "Design for Precision and Recall" §5. uv.lock: sync uipath 2.10.70 -> 2.10.72 (version bumped for the public ExactMatch.aggregators field + to invalidate uv's build cache during dev). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds run-level aggregator support to the legacy (low-code) ExactMatch so it reaches parity with the coded ExactMatch. - LegacyExactMatchEvaluatorConfig / the evaluator gain an optional `aggregators` field (top-level, aliased "aggregators"), deserialized from the legacy evaluator JSON authored by the low-code editor. - evaluate() emits a JSON-string `details` of {expected, actual, aggregators} when aggregators are configured. Legacy justification is typed `str` (J=str), so _serialize_justification passes the string through verbatim — it lands in EvalScore.Justification on the C# side, where the low-code aggregation pass reads it. No behavioral change when aggregators is unset (details stays None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Chibionos

PR #1674 Review — eval aggregators (`ExactMatch.aggregators`)

Verdict: 🔴 Request changes — does not work as shipped.

The design pivoted mid-PR (commit e92e734) from a standalone ClassifierEvaluator to run-level aggregators on ExactMatch. The production SDK code reflects the new design and is architecturally sound — but the test file, the sample, and the README were left behind referencing the deleted design.

I verified everything empirically on the PR head (checked out, ran the suites, ran the sample):

pytest tests/evaluators → 1 error, 0 tests collected (import crash — C1)
with the broken file removed → 3 FAILED, 812 passed (H2)
sample README "Path A" uipath eval run → crashes (C2)

3 critical / 2 high / 2 medium / 1 low. Findings inline.

✅ Architecture (verified OK — not blocking)

The SDK plumbing itself is correct: _serialize_justification does json.dumps(model_dump()) for BaseModel justifications and passes legacy str through verbatim; validate_justification re-validates the dict back into ExactMatchJustification; the aggregator round-trip is alias-safe (populate_by_name=True, no multi-word fields). The shipped code works — the scaffolding is stale.

🎚️ Slop meter

28/100 slop · 0/100 tunnel-vision. ⚠️ The meter's behavior_without_test=0 is a false negative — it sees a test file present but can't tell it tests deleted code (see H3).

🤖 Codex recheck

Could not run — local Codex auth token expired.

🔵 Before merge

Branch is CONFLICTING and bumps 2.10.70 → 2.10.72 (also collides with #1632's bump). Rebase on main.

Adversarial + deep-review pass; all claims verified on the PR head.

Chibionos · 2026-06-02T01:23:32Z

 [project]
 name = "uipath"
-version = "2.10.70"
+version = "2.10.72"


🔵 L1 — version bump + conflict

2.10.70 → 2.10.72; the comment in the original commit notes .71 was an unused dev cache-bust. Branch is CONFLICTING and this line will collide with #1632's → 2.10.68. Rebase before merge.

Ack — handing the rebase + version-bump conflict back to @ajay-kesavan to resolve locally. Leaving this thread open until the rebase lands so the conversation tracks the final version number.

- Replace stale ClassifierEvaluator tests with real ExactMatch and LegacyExactMatch aggregator round-trip coverage (C1, H3). - Rebuild classifier_demo to attach aggregators to intent_match ExactMatch config instead of a separate dead evaluator (C2, M1, H1). - Pin ExactMatchJustification in existing justification-type tests (H2). - Modernize typing: Optional[X] -> X | None (M2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends the eval framework to support run-level aggregators (starting with a classification aggregator) by attaching aggregator config to ExactMatch and carrying it through to downstream consumers via per-datapoint justifications (coded: structured justification model; low-code/legacy: JSON-string details).

Changes:

Add aggregator spec models (ClassificationAggregatorSpec) and expose them from uipath.eval.evaluators.
Extend coded ExactMatchEvaluator to accept aggregators in config and emit them in ExactMatchJustification; extend legacy LegacyExactMatchEvaluator to emit aggregator metadata via JSON-string details.
Add tests pinning the wire shape and add an end-to-end sample project; bump SDK version to 2.10.72.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/uipath/uv.lock	Bumps locked package version to 2.10.72.
packages/uipath/pyproject.toml	Bumps project version to 2.10.72.
packages/uipath/src/uipath/eval/evaluators/_aggregators.py	Introduces aggregator spec model(s) (classification).
packages/uipath/src/uipath/eval/evaluators/exact_match_evaluator.py	Adds `aggregators` to config and new `ExactMatchJustification` carrying aggregator metadata.
packages/uipath/src/uipath/eval/evaluators/legacy_exact_match_evaluator.py	Adds optional `aggregators` field and emits JSON-string `details` when configured.
packages/uipath/src/uipath/eval/evaluators/init.py	Re-exports aggregator spec types.
packages/uipath/tests/evaluators/test_exact_match_aggregators.py	New tests validating aggregator propagation and wire-format round-trip.
packages/uipath/tests/evaluators/test_evaluator_schemas.py	Updates justification schema expectations for ExactMatch.
packages/uipath/tests/evaluators/test_evaluator_methods.py	Updates justification-type assertions for ExactMatch.
packages/uipath/samples/classifier_demo/uipath.json	Adds sample function manifest.
packages/uipath/samples/classifier_demo/README.md	Documents the end-to-end demo flow for classification aggregation.
packages/uipath/samples/classifier_demo/pyproject.toml	Adds sample project metadata/deps.
packages/uipath/samples/classifier_demo/main.py	Adds sample keyword classifier agent.
packages/uipath/samples/classifier_demo/bindings.json	Adds sample bindings.
packages/uipath/samples/classifier_demo/evaluations/evaluators/intent_match.json	Sample evaluator JSON showing ExactMatch + classification aggregator config.
packages/uipath/samples/classifier_demo/evaluations/eval-sets/main.json	Sample eval set fixture for the demo.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,42 @@
+"""Tiny intent-classification agent for the ClassifierEvaluator demo.


+[project]
+name = "classifier-demo"
+version = "0.0.1"
+description = "Tiny intent-classification agent that exercises the new ClassifierEvaluator end-to-end via `uipath eval`."


github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 21, 2026

ajay-kesavan marked this pull request as ready for review May 21, 2026 17:34

This was referenced May 22, 2026

feat(eval): add dataset-level evaluator framework with precision/recall/f-score #1669

Draft

feat(eval): classification evaluator schemas + sample projects + e2e tests #1663

Draft

ajay-kesavan and others added 2 commits May 27, 2026 09:28

Chibionos reviewed Jun 2, 2026

View reviewed changes

Chibionos mentioned this pull request Jun 2, 2026

feat(eval): add --simulation flag to uipath debug #1632

Open

2 tasks

Copilot AI review requested due to automatic review settings June 17, 2026 22:44

Copilot started reviewing on behalf of ajay-kesavan June 17, 2026 22:44 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

ajay-kesavan closed this Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674
ajay-kesavan wants to merge 7 commits into
mainfrom
feat/eval-classifier-evaluator

ajay-kesavan commented May 21, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Uh oh!

Chibionos left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chibionos Jun 2, 2026

Uh oh!

ajay-kesavan Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,42 @@
		"""Tiny intent-classification agent for the ClassifierEvaluator demo.

Conversation

ajay-kesavan commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Architectural decisions (full record in Confluence: Design for Precision and Recall §5)

Changes in this PR (SDK)

Compatibility

Test plan

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Quality Gate failed

Uh oh!

Chibionos left a comment

Choose a reason for hiding this comment

PR #1674 Review — eval aggregators (ExactMatch.aggregators)

✅ Architecture (verified OK — not blocking)

🎚️ Slop meter

🤖 Codex recheck

🔵 Before merge

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chibionos Jun 2, 2026

Choose a reason for hiding this comment

🔵 L1 — version bump + conflict

Uh oh!

ajay-kesavan Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajay-kesavan commented May 21, 2026 •

edited

Loading

PR #1674 Review — eval aggregators (`ExactMatch.aggregators`)