Skip to content

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669

Draft
ajay-kesavan wants to merge 7 commits into
mainfrom
feat/eval-dataset-evaluators
Draft

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
ajay-kesavan wants to merge 7 commits into
mainfrom
feat/eval-dataset-evaluators

Conversation

@ajay-kesavan

@ajay-kesavan ajay-kesavan commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a dataset-level evaluator framework that consumes the per-datapoint results of an existing classification evaluator and emits one run-level metric (precision / recall / F-beta) with a structured ClassificationDetails payload (per-class TP/TN/FP/FN/support/value, full confusion matrix, micro/macro, skip counts).

The dataset evaluators' config is embedded in the per-datapoint evaluator's JSON via an aggregators[] array. Each aggregator entry is self-contained — its own classes, averaging, and (for fscore only) fValue — and there is no separate dataset-evaluator entity or external pointer.

Shape

{
  "id": "intent_classifier",
  "evaluatorTypeId": "uipath-multiclass-classification",
  "evaluatorConfig": {
    "name": "intent_classifier",
    "targetOutputKey": "intent",
    "classes": ["book", "cancel", "reschedule"],
    "aggregators": [
      { "type": "precision", "classes": ["book","cancel","reschedule"], "averaging": "macro" },
      { "type": "recall",    "classes": ["book","cancel","reschedule"], "averaging": "macro" },
      { "type": "fscore",    "classes": ["book","cancel","reschedule"], "averaging": "macro", "fValue": 1.0 }
    ]
  }
}

What's in the PR

  • _aggregator_specs.py — discriminated AggregatorSpec union with type field. Three concrete variants: PrecisionAggregatorSpec, RecallAggregatorSpec, FScoreAggregatorSpec. Pydantic catches unknown type values at parse time.
  • base_dataset_evaluator.pyBaseDatasetEvaluator abstract class, constructed from (spec, source_evaluator_name). No source_evaluator pointer in the spec itself.
  • classification_dataset_evaluators.py — single ClassificationDatasetEvaluator class dispatches on spec.type to choose precision / recall / F-beta math. Confusion matrix oriented [predicted_idx][expected_idx] (documented as differing from sklearn's [true][predicted]).
  • dataset_evaluator_factory.pybuild_dataset_evaluator(spec, source_evaluator_name) returns the right instance for the spec's type.
  • binary_classification_evaluator.py / multiclass_classification_evaluator.py — gained an aggregators: list[AggregatorSpec] | None field on their config. Other evaluator configs are unchanged.
  • runtime.pycompute_dataset_evaluator_results() walks each evaluator's evaluator_config.aggregators, builds the dataset evaluators on demand, runs them against the per-datapoint results, and returns a dict keyed "{source_evaluator_name}.{aggregator_type}" (e.g. intent_classifier.precision).
  • Tests — 488 evaluator tests pass, including math verification (2-class, 3-class, macro vs micro, F-beta), confusion matrix orientation, out-of-vocab skipping, malformed-details skipping, factory dispatch, JSON round-trip with fValue camelCase alias.
  • Demoexamples/dataset_evaluators_demo.py runs the framework end-to-end and prints the resulting ClassificationDetails via model_dump_json(indent=2, by_alias=True).

What's NOT in the PR (consciously)

  • No EvaluationSet.dataset_evaluator_refs — aggregators live on per-datapoint evaluator configs, not as a separate top-level list.
  • No EvalHelpers.load_dataset_evaluators / UiPathEvalContext.dataset_evaluators — the SDK runtime discovers aggregators directly from the per-datapoint evaluator configs.
  • No separate PrecisionDatasetEvaluator / RecallDatasetEvaluator / FScoreDatasetEvaluator classes — collapsed into one ClassificationDatasetEvaluator that switches on spec.type.
  • No EvaluatorType.DATASET_PRECISION/RECALL/F_SCORE enum entries — the discriminator field on the spec is the dispatch key.

How to test

cd packages/uipath
uv run pytest tests/evaluators -x          # 488 pass
uv run ruff check . && uv run ruff format --check .
uv run mypy src tests
uv run python examples/dataset_evaluators_demo.py

Companion PRs

Repo Branch What
uipath-python feat/classification-evaluator-types (#1663) Sample projects + e2e tests wiring the framework to Binary/Multiclass classification samples
Agents feat/eval-dataset-evaluators-backend (#5307) Backend storage + python-eval-worker DatasetEvaluatorsWorkflow that consumes the embedded aggregator specs

…ll/f-score

Introduces a new BaseDatasetEvaluator concept that runs once per evaluation
set after all per-datapoint evaluators complete. It consumes per-datapoint
EvaluationResultDto values from a named source evaluator and emits a single
run-level EvaluationResult.

Includes three starter evaluators for multiclass classification metrics:

- PrecisionDatasetEvaluator
- RecallDatasetEvaluator
- FScoreDatasetEvaluator (configurable beta)

Each takes a required classes list (populated from the UI), supports micro
or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion
matrix in details. Binary is the 2-class case — no separate binary path.

Architecture: BaseDatasetEvaluator is a parallel hierarchy to
GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop
cannot accidentally pick up a dataset evaluator. Each dataset evaluator
declares a single source_evaluator by name; the runtime groups
per-datapoint results by evaluator name and routes the right list to each
dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json
mirroring the evaluators directory layout.

Patch version bumped: 2.10.68 -> 2.10.69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 20, 2026
…10.69

examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:

  1. Balanced 3-class — symmetric confusion matrix, macro == micro
  2. Imbalanced 2-class — shows where macro and micro diverge
  3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
     F-beta knob actually moves per-class numbers
  4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
  5. Realistic 4-class intent classifier — uneven per-class performance

Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.

Run::

    cd packages/uipath && uv run python examples/dataset_evaluators_demo.py

uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan

Copy link
Copy Markdown
Contributor Author

Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer.

…luators

# Conflicts:
#	packages/uipath/pyproject.toml
#	packages/uipath/uv.lock
@ajay-kesavan ajay-kesavan reopened this Jun 19, 2026
ajay-kesavan and others added 3 commits June 18, 2026 21:26
…figs

Pivot dataset evaluators from a separate hierarchy with source_evaluator
pointers to an embedded aggregator-spec design: each per-datapoint
classification evaluator's config carries a self-contained list of
aggregators (precision / recall / fscore), each with its own classes,
averaging, and f_value. No properties are shared up to the evaluator
level — aggregators are fully self-describing.

- Drop source_evaluator pointer from BaseDatasetEvaluatorConfig.
- Add discriminated AggregatorSpec union (precision/recall/fscore).
- Add aggregators field to Binary/Multiclass classification configs.
- Refactor build_dataset_evaluator + compute_dataset_evaluator_results
  to consume aggregator specs from per-datapoint configs directly.
- Drop EvaluationSet.dataset_evaluator_refs (no separate list).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator
  switching on spec.type; factory becomes a one-liner.
- Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData
  helpers; switch _ConfusionData to @DataClass(slots=True).
- Drop dead get_evaluator_id() abstract + 3 overrides + matching
  EvaluatorType enum entries (factory dispatches on spec.type).
- Pull repeated model_config into a private _AggregatorSpecBase.
- Drop registry + impossible-case ValueError in dataset_evaluator_factory
  (pydantic discriminator catches unknown types).
- Have _coerce_justification return the typed justification object.
- Drop the _source_evaluator private/property pair on BaseDatasetEvaluator.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- M2: drop _METRIC_NAME indirection. metric field on
  ClassificationDetails now uses spec.type verbatim ("fscore" not
  "f_score"), matching the discriminator on the wire.
- M3: document confusion_matrix orientation via Field(description=...).
  Matrix is [predicted_idx][expected_idx], opposite of sklearn's
  convention. Add a regression test pinning the orientation.
- M4: _metric raises ValueError on unknown metric_type instead of
  silently falling through to the F-beta formula. Defense in depth
  on top of pydantic's discriminator.
- M6: replace defensive getattr chain in compute_dataset_evaluator_
  results with isinstance narrowing on the classification config types.
  Mypy-clean; intent is now "classification configs declare
  aggregators" rather than "anything might have an aggregators
  attribute".
- L1: rename duplicate test_two_class_macro tests so pytest output
  disambiguates Precision vs Recall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad32c22c64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/uipath/src/uipath/eval/runtime/runtime.py Outdated
- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI).
- Widen examples/dataset_evaluators_demo.py:report() to accept the full
  EvaluationResult union and narrow once inside with isinstance, fixing
  6 mypy "expected NumericEvaluationResult" errors at the call sites.
- Address Codex P1 (runtime.py:268 — result-key collision): two
  aggregators of the same type on the same source (e.g. macro+micro
  precision) previously produced identical {source}.{type} keys, with
  the second silently overwriting the first. compute_dataset_evaluator
  _results now counts type occurrences per source and disambiguates
  duplicate-type aggregators as {source}.{type}.{averaging} (plus
  ".fb{f_value}" for fscore variants), preserving the simple key shape
  for the common single-aggregator case. Docstring updated; 2 new
  tests cover both the precision-duplicate and fscore-duplicate paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant