feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669ajay-kesavan wants to merge 7 commits into
Conversation
…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10.69
examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:
1. Balanced 3-class — symmetric confusion matrix, macro == micro
2. Imbalanced 2-class — shows where macro and micro diverge
3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
F-beta knob actually moves per-class numbers
4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
5. Realistic 4-class intent classifier — uneven per-class performance
Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.
Run::
cd packages/uipath && uv run python examples/dataset_evaluators_demo.py
uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer. |
…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock
…figs Pivot dataset evaluators from a separate hierarchy with source_evaluator pointers to an embedded aggregator-spec design: each per-datapoint classification evaluator's config carries a self-contained list of aggregators (precision / recall / fscore), each with its own classes, averaging, and f_value. No properties are shared up to the evaluator level — aggregators are fully self-describing. - Drop source_evaluator pointer from BaseDatasetEvaluatorConfig. - Add discriminated AggregatorSpec union (precision/recall/fscore). - Add aggregators field to Binary/Multiclass classification configs. - Refactor build_dataset_evaluator + compute_dataset_evaluator_results to consume aggregator specs from per-datapoint configs directly. - Drop EvaluationSet.dataset_evaluator_refs (no separate list). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator switching on spec.type; factory becomes a one-liner. - Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData helpers; switch _ConfusionData to @DataClass(slots=True). - Drop dead get_evaluator_id() abstract + 3 overrides + matching EvaluatorType enum entries (factory dispatches on spec.type). - Pull repeated model_config into a private _AggregatorSpecBase. - Drop registry + impossible-case ValueError in dataset_evaluator_factory (pydantic discriminator catches unknown types). - Have _coerce_justification return the typed justification object. - Drop the _source_evaluator private/property pair on BaseDatasetEvaluator. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- M2: drop _METRIC_NAME indirection. metric field on
ClassificationDetails now uses spec.type verbatim ("fscore" not
"f_score"), matching the discriminator on the wire.
- M3: document confusion_matrix orientation via Field(description=...).
Matrix is [predicted_idx][expected_idx], opposite of sklearn's
convention. Add a regression test pinning the orientation.
- M4: _metric raises ValueError on unknown metric_type instead of
silently falling through to the F-beta formula. Defense in depth
on top of pydantic's discriminator.
- M6: replace defensive getattr chain in compute_dataset_evaluator_
results with isinstance narrowing on the classification config types.
Mypy-clean; intent is now "classification configs declare
aggregators" rather than "anything might have an aggregators
attribute".
- L1: rename duplicate test_two_class_macro tests so pytest output
disambiguates Precision vs Recall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad32c22c64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI).
- Widen examples/dataset_evaluators_demo.py:report() to accept the full
EvaluationResult union and narrow once inside with isinstance, fixing
6 mypy "expected NumericEvaluationResult" errors at the call sites.
- Address Codex P1 (runtime.py:268 — result-key collision): two
aggregators of the same type on the same source (e.g. macro+micro
precision) previously produced identical {source}.{type} keys, with
the second silently overwriting the first. compute_dataset_evaluator
_results now counts type occurrences per source and disambiguates
duplicate-type aggregators as {source}.{type}.{averaging} (plus
".fb{f_value}" for fscore variants), preserving the simple key shape
for the common single-aggregator case. Docstring updated; 2 new
tests cover both the precision-duplicate and fscore-duplicate paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🚨 Heads up:
|



Summary
Adds a dataset-level evaluator framework that consumes the per-datapoint results of an existing classification evaluator and emits one run-level metric (precision / recall / F-beta) with a structured
ClassificationDetailspayload (per-class TP/TN/FP/FN/support/value, full confusion matrix, micro/macro, skip counts).The dataset evaluators' config is embedded in the per-datapoint evaluator's JSON via an
aggregators[]array. Each aggregator entry is self-contained — its ownclasses,averaging, and (forfscoreonly)fValue— and there is no separate dataset-evaluator entity or external pointer.Shape
{ "id": "intent_classifier", "evaluatorTypeId": "uipath-multiclass-classification", "evaluatorConfig": { "name": "intent_classifier", "targetOutputKey": "intent", "classes": ["book", "cancel", "reschedule"], "aggregators": [ { "type": "precision", "classes": ["book","cancel","reschedule"], "averaging": "macro" }, { "type": "recall", "classes": ["book","cancel","reschedule"], "averaging": "macro" }, { "type": "fscore", "classes": ["book","cancel","reschedule"], "averaging": "macro", "fValue": 1.0 } ] } }What's in the PR
_aggregator_specs.py— discriminatedAggregatorSpecunion withtypefield. Three concrete variants:PrecisionAggregatorSpec,RecallAggregatorSpec,FScoreAggregatorSpec. Pydantic catches unknowntypevalues at parse time.base_dataset_evaluator.py—BaseDatasetEvaluatorabstract class, constructed from(spec, source_evaluator_name). Nosource_evaluatorpointer in the spec itself.classification_dataset_evaluators.py— singleClassificationDatasetEvaluatorclass dispatches onspec.typeto choose precision / recall / F-beta math. Confusion matrix oriented[predicted_idx][expected_idx](documented as differing from sklearn's[true][predicted]).dataset_evaluator_factory.py—build_dataset_evaluator(spec, source_evaluator_name)returns the right instance for the spec's type.binary_classification_evaluator.py/multiclass_classification_evaluator.py— gained anaggregators: list[AggregatorSpec] | Nonefield on their config. Other evaluator configs are unchanged.runtime.py—compute_dataset_evaluator_results()walks each evaluator'sevaluator_config.aggregators, builds the dataset evaluators on demand, runs them against the per-datapoint results, and returns adictkeyed"{source_evaluator_name}.{aggregator_type}"(e.g.intent_classifier.precision).fValuecamelCase alias.examples/dataset_evaluators_demo.pyruns the framework end-to-end and prints the resultingClassificationDetailsviamodel_dump_json(indent=2, by_alias=True).What's NOT in the PR (consciously)
EvaluationSet.dataset_evaluator_refs— aggregators live on per-datapoint evaluator configs, not as a separate top-level list.EvalHelpers.load_dataset_evaluators/UiPathEvalContext.dataset_evaluators— the SDK runtime discovers aggregators directly from the per-datapoint evaluator configs.PrecisionDatasetEvaluator/RecallDatasetEvaluator/FScoreDatasetEvaluatorclasses — collapsed into oneClassificationDatasetEvaluatorthat switches onspec.type.EvaluatorType.DATASET_PRECISION/RECALL/F_SCOREenum entries — the discriminator field on the spec is the dispatch key.How to test
Companion PRs
feat/classification-evaluator-types(#1663)feat/eval-dataset-evaluators-backend(#5307)DatasetEvaluatorsWorkflowthat consumes the embedded aggregator specs