feat(eval): classification evaluator schemas + sample projects + e2e tests by ajay-kesavan · Pull Request #1663 · UiPath/uipath-python

ajay-kesavan · 2026-05-20T00:48:06Z

Summary

Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:

Generated type schemas — BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json under packages/uipath/src/uipath/eval/evaluators_types/, produced by python -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker, uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.
Sample projects under packages/uipath/samples/:
- binary_classification_agent/ — rule-based spam/ham classifier wired to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.
- multiclass_classification_simple/ — rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator with averaging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.
End-to-end test at packages/uipath/tests/cli/eval/test_classification_samples_e2e.py — loads each sample's eval set, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math.

Why split this PR

PR #1397 added the Python implementation and registered the new evaluator type IDs (uipath-binary-classification, uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.

Test plan

pytest tests/cli/eval/test_classification_samples_e2e.py — both samples pass
ruff check tests/cli/eval/test_classification_samples_e2e.py — clean
ruff format --check — clean
cat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.json exposes positive_class, metric_type, f_value in evaluatorConfigSchema.properties
cat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.json exposes classes, averaging, metric_type, f_value
CI passes

Related PRs

chore(eval): resync evaluator type schemas with Python source #1664 — companion PR that refreshes the 11 unrelated stale schemas in the same directory (split out for review hygiene; no functional overlap with this PR).
UiPath/cli#2128 — TypeScript-side flow-tool registry entries that wire these evaluators into the Flow UI evaluator picker.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the new evaluators added in #1397 so external tooling (Flow UI evaluator picker, `uip maestro flow eval`) can read the config / criteria / justification schemas. Files produced by `python -m uipath.eval.evaluators_types.generate_types`, restricted to the two new evaluator types. A companion PR refreshes the other 11 stale schemas in evaluators_types/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tors Adds two sample projects under packages/uipath/samples/ that double as end-to-end test fixtures for the binary and multiclass classification evaluators added in #1397: - binary_classification_agent — rule-based spam/ham classifier wired up to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive. - multiclass_classification_simple — rule-based 3-class router (payments / support / spam) wired up to the multiclass classification evaluator with macro-averaged F1. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = 26/30. Adds tests/cli/eval/test_classification_samples_e2e.py which loads each sample's eval-sets/default.json, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math, not just per-row correct/incorrect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10.69 examples/dataset_evaluators_demo.py walks the new dataset-level evaluators (Precision / Recall / F-score) through five scenarios that exercise the math end-to-end at the SDK layer: 1. Balanced 3-class — symmetric confusion matrix, macro == micro 2. Imbalanced 2-class — shows where macro and micro diverge 3. Same data, four metrics (Precision, Recall, F1, F2) — proves the F-beta knob actually moves per-class numbers 4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops 5. Realistic 4-class intent classifier — uneven per-class performance Each scenario prints the confusion matrix as a table, the per-class TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper will surface to the frontend. Run:: cd packages/uipath && uv run python examples/dataset_evaluators_demo.py uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69) already in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajay-kesavan · 2026-05-22T03:38:18Z

Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach.

…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock

…valuator-types

…figs Pivot dataset evaluators from a separate hierarchy with source_evaluator pointers to an embedded aggregator-spec design: each per-datapoint classification evaluator's config carries a self-contained list of aggregators (precision / recall / fscore), each with its own classes, averaging, and f_value. No properties are shared up to the evaluator level — aggregators are fully self-describing. - Drop source_evaluator pointer from BaseDatasetEvaluatorConfig. - Add discriminated AggregatorSpec union (precision/recall/fscore). - Add aggregators field to Binary/Multiclass classification configs. - Refactor build_dataset_evaluator + compute_dataset_evaluator_results to consume aggregator specs from per-datapoint configs directly. - Drop EvaluationSet.dataset_evaluator_refs (no separate list). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…evaluator-types

…tors Update binary_classification_agent and multiclass_classification_simple sample evaluator JSONs to include the new aggregators[] field. Each aggregator carries its own classes, averaging, and (for fscore) fValue. Update the e2e test to also assert the dataset-level results land in UiPathEvalOutput.dataset_evaluator_results, keyed "{evaluator_name}.{aggregator_type}". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator switching on spec.type; factory becomes a one-liner. - Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData helpers; switch _ConfusionData to @DataClass(slots=True). - Drop dead get_evaluator_id() abstract + 3 overrides + matching EvaluatorType enum entries (factory dispatches on spec.type). - Pull repeated model_config into a private _AggregatorSpecBase. - Drop registry + impossible-case ValueError in dataset_evaluator_factory (pydantic discriminator catches unknown types). - Have _coerce_justification return the typed justification object. - Drop the _source_evaluator private/property pair on BaseDatasetEvaluator. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…evaluator-types

- Add BaseEvaluatorJustification.try_from classmethod and collapse the three duplicate "instance | dict | other" coercion blocks in classification_dataset_evaluators, binary_classification_evaluator, and multiclass_classification_evaluator down to one line each. - Replace the 80-line ASCII confusion-matrix pretty-printer in dataset_evaluators_demo with the structured JSON wire shape — the thing readers actually want to inspect. Deferred from this PR: dropping reduce_scores / _micro_metric / _macro_metric on Binary/Multiclass evaluators, and the matching metric_type/averaging/f_value config fields. The runtime calls GenericBaseEvaluator.reduce_scores per-evaluator to compute the top-level evaluator score; the dataset evaluator framework adds {source}.{type}-keyed metrics in addition to that score, it doesn't replace it. Removing them would break the existing per-evaluator headline. Worth a follow-up that either makes reduce_scores delegate to the dataset evaluator framework or formally splits the two paths. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- M2: drop _METRIC_NAME indirection. metric field on ClassificationDetails now uses spec.type verbatim ("fscore" not "f_score"), matching the discriminator on the wire. - M3: document confusion_matrix orientation via Field(description=...). Matrix is [predicted_idx][expected_idx], opposite of sklearn's convention. Add a regression test pinning the orientation. - M4: _metric raises ValueError on unknown metric_type instead of silently falling through to the F-beta formula. Defense in depth on top of pydantic's discriminator. - M6: replace defensive getattr chain in compute_dataset_evaluator_ results with isinstance narrowing on the classification config types. Mypy-clean; intent is now "classification configs declare aggregators" rather than "anything might have an aggregators attribute". - L1: rename duplicate test_two_class_macro tests so pytest output disambiguates Precision vs Recall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…evaluator-types

- H1/H2: pydantic model_validator on Binary/Multiclass classification configs cross-checks aggregators against evaluator-level fields. Binary rejects aggregators whose `classes` doesn't include `positive_class`, and aggregators of the same metric type with a different `f_value`. Multiclass extends this with the full class-coverage check and an `averaging` consistency check. Without this, a user could ship configs where the per-evaluator headline and the dataset aggregator silently scored disjoint label spaces or used different averaging. - H3: binary e2e test now asserts the precision/recall/fscore aggregator scores (5/6, 5/6, 0.8) instead of only the key set. A regression that zeros out all aggregator scores would now fail the test. - H4: multiclass `evaluate()` no longer raises on out-of-vocab predicted class — it now returns score=0.0 with the OOV label preserved in the justification, mirroring binary's behavior. The dataset evaluator's confusion matrix already accounts for this via `n_skipped`. Configuration errors (expected_class outside vocab) still raise. - M1: drop the `_coerce_justification` one-line wrapper; inline `BaseEvaluatorJustification.try_from(r.details)` at the single caller in `_build_confusion`. - M2: preserve user-supplied class casing in `_ConfusionData.classes` and the `per_class` keys. The lowercase normalization is now only used for the internal lookup index, so a config with classes=["Spam","Ham"] surfaces "Spam"/"Ham" in the output rather than "spam"/"ham". - M3 (multiclass `reduce_scores` + ClassificationDatasetEvaluator double-walking the same confusion matrix): deferred. Cleanest fix is to drop the evaluator-level `metric_type`/`averaging`/`f_value` fields and route the per-evaluator headline through the dataset evaluator framework — out of scope for this commit. Tracked as a follow-up. - L1: refreshed test_classification_samples_e2e docstring to reflect the new aggregator-score coverage on the binary side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ajay-kesavan · 2026-06-19T05:52:32Z

@chatgpt-codex-connector — please run an adversarial code review on this PR using the methodology below. Treat the code as a rival's, not a colleague's. Start from "this change is broken" and hunt evidence. If you can't find a real bug after genuinely trying, a positive review is fine — but the search has to happen first.

Three non-negotiable promises

Cross-check every invoked interface. When A calls B, open B and read its signature. Prop drift is the Feat/basic invoke process #1 source of silent bugs in AI-generated code.
Read test bodies, not test titles. PR descriptions over-claim, commit messages over-claim, and tests get renamed to match the new implementation rather than the original behavior.
Severity is factual, not emotional. "Loses user data" = 🔴 even when the author is senior. Typo = 🔵 even when it annoys you.

7-step method (skip none)

Inventory the change — PR metadata, files, commits, full diff. Look for: additions ≈ deletions + "no behavior change" claim (behavior probably changed); commit titled "address review" (open the file it claims to forward into; verify the prop actually lands); mergeable=CONFLICTING (surface it).
Read the full diff top-to-bottom in one pass. Bugs live at seams between unrelated files in the same diff.
Cross-check every consumer of every new prop / new component / new DI seam. Grep call sites; confirm every prop both exists on the receiver's type AND is used (not silently dropped in one branch of an if).
Trace data flow end-to-end. user input → onChange → state → validation → persist → reload. Flag silent drops, empty-string coercion to {}/null, stale-closure useState(propDefault) that never updates, missing memoization on every-render stringification.
Diff the deleted code against what replaced it. Every - line is an unverified claim. Verify: vanished behavioral features (truncation, maxLines, special cases, fallbacks), edge-case handling (empty string, null, arrays), a11y attrs (id, htmlFor, aria-*), layout-critical CSS classes.
Sanity-check the tests. Read bodies. Ask "does this verify NEW BEHAVIOR or just the NEW IMPLEMENTATION?" Renames like "renders X" → "renders Y" often launder a regression. Any DI seam MUST have a test that exercises the injected path; fallback-only tests prove nothing.
Classify with severity tiers and post inline anchored comments.

Severity tiers

🔴 Critical — blocking — data loss, crash, security, broken contract, a11y regression
🟠 High — should fix before merge — structural regression, missing coverage for new code path, error-handling gap
🟡 Medium — quality nit — smell, redundancy, naming, small perf with evidence
🔵 Low — observation — style, docs, process

Decision rule: if the PR cannot merge without introducing the bug you describe → 🔴. Merges but degrades UX → 🟠. Below that → 🟡 / 🔵.

Output format

Per finding: file:line citation + concrete fix. State issues as facts ("This loses user input" — not "may potentially"). Criticism first, positives last. Drop softening qualifiers (no "might", "perhaps", "consider"). Adversarial ≠ abusive: attack the code, name the pattern, never the person.

Red flags — stop and re-read

"This looks fine" → you haven't cross-checked invoked interfaces
"Tests pass so it's good" → tests assert the new impl, not the old behavior
"Description says no behavior change" → diff the deleted lines
"Small PR, quick review" → small PRs hide big bugs in prop drops
"AI-generated, probably clean" → AI code is exactly where prop drift hides

This PR's specific surface area

aggregators[] embedded in BinaryClassificationEvaluatorConfig and MulticlassClassificationEvaluatorConfig
New model_validators cross-check positive_class / classes containment AND averaging / f_value divergence between evaluator-level and aggregator-level fields when types match
reduce_scores retained on Binary/Multiclass for the per-evaluator headline — the dataset evaluator framework runs additively; verify the two paths agree on identical inputs
BaseEvaluatorJustification.try_from classmethod collapsed 3 duplicate coercion blocks
Sample JSONs (binary_classification_agent/, multiclass_classification_simple/) carry new aggregators[] arrays
E2E test asserts specific aggregated scores (binary: P=R=5/6, F1=0.8). Read the multiclass assertions too — they assert specific fscore numbers

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 027901c96b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI). - Widen examples/dataset_evaluators_demo.py:report() to accept the full EvaluationResult union and narrow once inside with isinstance, fixing 6 mypy "expected NumericEvaluationResult" errors at the call sites. - Address Codex P1 (runtime.py:268 — result-key collision): two aggregators of the same type on the same source (e.g. macro+micro precision) previously produced identical {source}.{type} keys, with the second silently overwriting the first. compute_dataset_evaluator _results now counts type occurrences per source and disambiguates duplicate-type aggregators as {source}.{type}.{averaging} (plus ".fb{f_value}" for fscore variants), preserving the simple key shape for the common single-aggregator case. Docstring updated; 2 new tests cover both the precision-duplicate and fscore-duplicate paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…evaluator-types

…alidator The fscore-duplicate disambiguation test added in 4d6afcc conflicts with the H2 model_validator on #1663, which cross-checks aggregator f_value against the evaluator's f_value when types match. The precision-duplicate test still exercises the new _dataset_result_key path; the FScore branch is exercised by the factory + math tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Regenerate BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the updated pydantic models so schema-driven consumers can discover and validate the new evaluatorConfig.aggregators array + Precision/Recall/FScore variants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sonarqubecloud · 2026-06-19T06:16:27Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
98.2% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-06-19T06:16:36Z

🚨 Heads up: `uipath-integrations` cross-tests are FAILING 🚨

Your changes may break one or more integrations in uipath-integrations-python:

uipath-openai-agents
uipath-google-adk
uipath-agent-framework
uipath-llamaindex
uipath-pydantic-ai

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

github-actions · 2026-06-19T06:23:02Z

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

ajay-kesavan force-pushed the feat/classification-evaluator-types branch from 6931598 to 6b11767 Compare May 20, 2026 00:54

ajay-kesavan mentioned this pull request May 20, 2026

chore(eval): resync evaluator type schemas with Python source #1664

Draft

3 tasks

ajay-kesavan changed the title ~~chore(eval): regenerate evaluator type schemas with classification evaluators~~ feat(eval): add evaluator type schemas for classification evaluators May 20, 2026

ajay-kesavan changed the title ~~feat(eval): add evaluator type schemas for classification evaluators~~ feat(eval): classification evaluator schemas + sample projects + e2e tests May 20, 2026

ajay-kesavan and others added 2 commits May 20, 2026 14:05

ajay-kesavan closed this May 22, 2026

ajay-kesavan added 2 commits June 18, 2026 19:53

Merge remote-tracking branch 'origin/main' into feat/eval-dataset-eva…

e9ba8aa

…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock

Merge remote-tracking branch 'origin/main' into feat/classification-e…

46c24e1

…valuator-types

ajay-kesavan reopened this Jun 19, 2026

ajay-kesavan and others added 3 commits June 18, 2026 21:26

Merge branch 'feat/eval-dataset-evaluators' into feat/classification-…

d4e06b1

…evaluator-types

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels Jun 19, 2026

ajay-kesavan and others added 3 commits June 18, 2026 21:49

Merge branch 'feat/eval-dataset-evaluators' into feat/classification-…

05f6697

…evaluator-types

ajay-kesavan mentioned this pull request Jun 19, 2026

feat(eval): add dataset-level evaluator framework with precision/recall/f-score #1669

Draft

ajay-kesavan and others added 3 commits June 18, 2026 22:27

Merge branch 'feat/eval-dataset-evaluators' into feat/classification-…

cbbaf5f

…evaluator-types

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.json

Comment thread packages/uipath/src/uipath/eval/evaluators/base_dataset_evaluator.py

ajay-kesavan and others added 3 commits June 18, 2026 23:05

Merge branch 'feat/eval-dataset-evaluators' into feat/classification-…

c347fc7

…evaluator-types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
ajay-kesavan wants to merge 19 commits into
mainfrom
feat/classification-evaluator-types

ajay-kesavan commented May 20, 2026 •

edited

Loading

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

ajay-kesavan commented Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajay-kesavan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why split this PR

Test plan

Related PRs

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

ajay-kesavan commented Jun 19, 2026

Three non-negotiable promises

7-step method (skip none)

Severity tiers

Output format

Red flags — stop and re-read

This PR's specific surface area

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented Jun 19, 2026

🚨 Heads up: uipath-integrations cross-tests are FAILING 🚨

Uh oh!

github-actions Bot commented Jun 19, 2026

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajay-kesavan commented May 20, 2026 •

edited

Loading

🚨 Heads up: `uipath-integrations` cross-tests are FAILING 🚨

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨