feat(eval): add dataset-level evaluator framework with precision/recall/f-score by ajay-kesavan · Pull Request #1669 · UiPath/uipath-python

ajay-kesavan · 2026-05-20T21:06:23Z

Summary

Adds a dataset-level evaluator framework that consumes the per-datapoint results of an existing classification evaluator and emits one run-level metric (precision / recall / F-beta) with a structured ClassificationDetails payload (per-class TP/TN/FP/FN/support/value, full confusion matrix, micro/macro, skip counts).

The dataset evaluators' config is embedded in the per-datapoint evaluator's JSON via an aggregators[] array. Each aggregator entry is self-contained — its own classes, averaging, and (for fscore only) fValue — and there is no separate dataset-evaluator entity or external pointer.

Shape

{
  "id": "intent_classifier",
  "evaluatorTypeId": "uipath-multiclass-classification",
  "evaluatorConfig": {
    "name": "intent_classifier",
    "targetOutputKey": "intent",
    "classes": ["book", "cancel", "reschedule"],
    "aggregators": [
      { "type": "precision", "classes": ["book","cancel","reschedule"], "averaging": "macro" },
      { "type": "recall",    "classes": ["book","cancel","reschedule"], "averaging": "macro" },
      { "type": "fscore",    "classes": ["book","cancel","reschedule"], "averaging": "macro", "fValue": 1.0 }
    ]
  }
}

What's in the PR

_aggregator_specs.py — discriminated AggregatorSpec union with type field. Three concrete variants: PrecisionAggregatorSpec, RecallAggregatorSpec, FScoreAggregatorSpec. Pydantic catches unknown type values at parse time.
base_dataset_evaluator.py — BaseDatasetEvaluator abstract class, constructed from (spec, source_evaluator_name). No source_evaluator pointer in the spec itself.
classification_dataset_evaluators.py — single ClassificationDatasetEvaluator class dispatches on spec.type to choose precision / recall / F-beta math. Confusion matrix oriented [predicted_idx][expected_idx] (documented as differing from sklearn's [true][predicted]).
dataset_evaluator_factory.py — build_dataset_evaluator(spec, source_evaluator_name) returns the right instance for the spec's type.
binary_classification_evaluator.py / multiclass_classification_evaluator.py — gained an aggregators: list[AggregatorSpec] | None field on their config. Other evaluator configs are unchanged.
runtime.py — compute_dataset_evaluator_results() walks each evaluator's evaluator_config.aggregators, builds the dataset evaluators on demand, runs them against the per-datapoint results, and returns a dict keyed "{source_evaluator_name}.{aggregator_type}" (e.g. intent_classifier.precision).
Tests — 488 evaluator tests pass, including math verification (2-class, 3-class, macro vs micro, F-beta), confusion matrix orientation, out-of-vocab skipping, malformed-details skipping, factory dispatch, JSON round-trip with fValue camelCase alias.
Demo — examples/dataset_evaluators_demo.py runs the framework end-to-end and prints the resulting ClassificationDetails via model_dump_json(indent=2, by_alias=True).

What's NOT in the PR (consciously)

No EvaluationSet.dataset_evaluator_refs — aggregators live on per-datapoint evaluator configs, not as a separate top-level list.
No EvalHelpers.load_dataset_evaluators / UiPathEvalContext.dataset_evaluators — the SDK runtime discovers aggregators directly from the per-datapoint evaluator configs.
No separate PrecisionDatasetEvaluator / RecallDatasetEvaluator / FScoreDatasetEvaluator classes — collapsed into one ClassificationDatasetEvaluator that switches on spec.type.
No EvaluatorType.DATASET_PRECISION/RECALL/F_SCORE enum entries — the discriminator field on the spec is the dispatch key.

How to test

cd packages/uipath
uv run pytest tests/evaluators -x          # 488 pass
uv run ruff check . && uv run ruff format --check .
uv run mypy src tests
uv run python examples/dataset_evaluators_demo.py

Companion PRs

Repo	Branch	What
uipath-python	`feat/classification-evaluator-types` (#1663)	Sample projects + e2e tests wiring the framework to Binary/Multiclass classification samples
Agents	`feat/eval-dataset-evaluators-backend` (#5307)	Backend storage + python-eval-worker `DatasetEvaluatorsWorkflow` that consumes the embedded aggregator specs

…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10.69 examples/dataset_evaluators_demo.py walks the new dataset-level evaluators (Precision / Recall / F-score) through five scenarios that exercise the math end-to-end at the SDK layer: 1. Balanced 3-class — symmetric confusion matrix, macro == micro 2. Imbalanced 2-class — shows where macro and micro diverge 3. Same data, four metrics (Precision, Recall, F1, F2) — proves the F-beta knob actually moves per-class numbers 4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops 5. Realistic 4-class intent classifier — uneven per-class performance Each scenario prints the confusion matrix as a table, the per-class TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper will surface to the frontend. Run:: cd packages/uipath && uv run python examples/dataset_evaluators_demo.py uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69) already in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajay-kesavan · 2026-05-22T03:38:15Z

Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer.

…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock

…figs Pivot dataset evaluators from a separate hierarchy with source_evaluator pointers to an embedded aggregator-spec design: each per-datapoint classification evaluator's config carries a self-contained list of aggregators (precision / recall / fscore), each with its own classes, averaging, and f_value. No properties are shared up to the evaluator level — aggregators are fully self-describing. - Drop source_evaluator pointer from BaseDatasetEvaluatorConfig. - Add discriminated AggregatorSpec union (precision/recall/fscore). - Add aggregators field to Binary/Multiclass classification configs. - Refactor build_dataset_evaluator + compute_dataset_evaluator_results to consume aggregator specs from per-datapoint configs directly. - Drop EvaluationSet.dataset_evaluator_refs (no separate list). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator switching on spec.type; factory becomes a one-liner. - Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData helpers; switch _ConfusionData to @DataClass(slots=True). - Drop dead get_evaluator_id() abstract + 3 overrides + matching EvaluatorType enum entries (factory dispatches on spec.type). - Pull repeated model_config into a private _AggregatorSpecBase. - Drop registry + impossible-case ValueError in dataset_evaluator_factory (pydantic discriminator catches unknown types). - Have _coerce_justification return the typed justification object. - Drop the _source_evaluator private/property pair on BaseDatasetEvaluator. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- M2: drop _METRIC_NAME indirection. metric field on ClassificationDetails now uses spec.type verbatim ("fscore" not "f_score"), matching the discriminator on the wire. - M3: document confusion_matrix orientation via Field(description=...). Matrix is [predicted_idx][expected_idx], opposite of sklearn's convention. Add a regression test pinning the orientation. - M4: _metric raises ValueError on unknown metric_type instead of silently falling through to the F-beta formula. Defense in depth on top of pydantic's discriminator. - M6: replace defensive getattr chain in compute_dataset_evaluator_ results with isinstance narrowing on the classification config types. Mypy-clean; intent is now "classification configs declare aggregators" rather than "anything might have an aggregators attribute". - L1: rename duplicate test_two_class_macro tests so pytest output disambiguates Precision vs Recall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad32c22c64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI). - Widen examples/dataset_evaluators_demo.py:report() to accept the full EvaluationResult union and narrow once inside with isinstance, fixing 6 mypy "expected NumericEvaluationResult" errors at the call sites. - Address Codex P1 (runtime.py:268 — result-key collision): two aggregators of the same type on the same source (e.g. macro+micro precision) previously produced identical {source}.{type} keys, with the second silently overwriting the first. compute_dataset_evaluator _results now counts type occurrences per source and disambiguates duplicate-type aggregators as {source}.{type}.{averaging} (plus ".fb{f_value}" for fscore variants), preserving the simple key shape for the common single-aggregator case. Docstring updated; 2 new tests cover both the precision-duplicate and fscore-duplicate paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sonarqubecloud · 2026-06-19T06:08:20Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
98.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-06-19T06:15:09Z

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 20, 2026

ajay-kesavan mentioned this pull request May 21, 2026

feat(eval): add ClassifierEvaluator (pure-metadata aggregator) #1674

Closed

4 tasks

ajay-kesavan closed this May 22, 2026

Merge remote-tracking branch 'origin/main' into feat/eval-dataset-eva…

e9ba8aa

…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock

ajay-kesavan reopened this Jun 19, 2026

ajay-kesavan and others added 3 commits June 18, 2026 21:26

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread packages/uipath/src/uipath/eval/runtime/runtime.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
ajay-kesavan wants to merge 7 commits into
mainfrom
feat/eval-dataset-evaluators

ajay-kesavan commented May 20, 2026 •

edited

Loading

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajay-kesavan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Shape

What's in the PR

What's NOT in the PR (consciously)

How to test

Companion PRs

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 19, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented Jun 19, 2026

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajay-kesavan commented May 20, 2026 •

edited

Loading

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨