feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
feat(eval): classification evaluator schemas + sample projects + e2e tests#1663ajay-kesavan wants to merge 19 commits into
Conversation
Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the new evaluators added in #1397 so external tooling (Flow UI evaluator picker, `uip maestro flow eval`) can read the config / criteria / justification schemas. Files produced by `python -m uipath.eval.evaluators_types.generate_types`, restricted to the two new evaluator types. A companion PR refreshes the other 11 stale schemas in evaluators_types/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6931598 to
6b11767
Compare
…tors Adds two sample projects under packages/uipath/samples/ that double as end-to-end test fixtures for the binary and multiclass classification evaluators added in #1397: - binary_classification_agent — rule-based spam/ham classifier wired up to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive. - multiclass_classification_simple — rule-based 3-class router (payments / support / spam) wired up to the multiclass classification evaluator with macro-averaged F1. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = 26/30. Adds tests/cli/eval/test_classification_samples_e2e.py which loads each sample's eval-sets/default.json, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math, not just per-row correct/incorrect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10.69
examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:
1. Balanced 3-class — symmetric confusion matrix, macro == micro
2. Imbalanced 2-class — shows where macro and micro diverge
3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
F-beta knob actually moves per-class numbers
4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
5. Realistic 4-class intent classifier — uneven per-class performance
Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.
Run::
cd packages/uipath && uv run python examples/dataset_evaluators_demo.py
uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach. |
…luators # Conflicts: # packages/uipath/pyproject.toml # packages/uipath/uv.lock
…figs Pivot dataset evaluators from a separate hierarchy with source_evaluator pointers to an embedded aggregator-spec design: each per-datapoint classification evaluator's config carries a self-contained list of aggregators (precision / recall / fscore), each with its own classes, averaging, and f_value. No properties are shared up to the evaluator level — aggregators are fully self-describing. - Drop source_evaluator pointer from BaseDatasetEvaluatorConfig. - Add discriminated AggregatorSpec union (precision/recall/fscore). - Add aggregators field to Binary/Multiclass classification configs. - Refactor build_dataset_evaluator + compute_dataset_evaluator_results to consume aggregator specs from per-datapoint configs directly. - Drop EvaluationSet.dataset_evaluator_refs (no separate list). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tors
Update binary_classification_agent and multiclass_classification_simple
sample evaluator JSONs to include the new aggregators[] field. Each
aggregator carries its own classes, averaging, and (for fscore) fValue.
Update the e2e test to also assert the dataset-level results land in
UiPathEvalOutput.dataset_evaluator_results, keyed
"{evaluator_name}.{aggregator_type}".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator switching on spec.type; factory becomes a one-liner. - Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData helpers; switch _ConfusionData to @DataClass(slots=True). - Drop dead get_evaluator_id() abstract + 3 overrides + matching EvaluatorType enum entries (factory dispatches on spec.type). - Pull repeated model_config into a private _AggregatorSpecBase. - Drop registry + impossible-case ValueError in dataset_evaluator_factory (pydantic discriminator catches unknown types). - Have _coerce_justification return the typed justification object. - Drop the _source_evaluator private/property pair on BaseDatasetEvaluator. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add BaseEvaluatorJustification.try_from classmethod and collapse the
three duplicate "instance | dict | other" coercion blocks in
classification_dataset_evaluators, binary_classification_evaluator,
and multiclass_classification_evaluator down to one line each.
- Replace the 80-line ASCII confusion-matrix pretty-printer in
dataset_evaluators_demo with the structured JSON wire shape — the
thing readers actually want to inspect.
Deferred from this PR: dropping reduce_scores / _micro_metric /
_macro_metric on Binary/Multiclass evaluators, and the matching
metric_type/averaging/f_value config fields. The runtime calls
GenericBaseEvaluator.reduce_scores per-evaluator to compute the
top-level evaluator score; the dataset evaluator framework adds
{source}.{type}-keyed metrics in addition to that score, it doesn't
replace it. Removing them would break the existing per-evaluator
headline. Worth a follow-up that either makes reduce_scores delegate
to the dataset evaluator framework or formally splits the two paths.
No behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- M2: drop _METRIC_NAME indirection. metric field on
ClassificationDetails now uses spec.type verbatim ("fscore" not
"f_score"), matching the discriminator on the wire.
- M3: document confusion_matrix orientation via Field(description=...).
Matrix is [predicted_idx][expected_idx], opposite of sklearn's
convention. Add a regression test pinning the orientation.
- M4: _metric raises ValueError on unknown metric_type instead of
silently falling through to the F-beta formula. Defense in depth
on top of pydantic's discriminator.
- M6: replace defensive getattr chain in compute_dataset_evaluator_
results with isinstance narrowing on the classification config types.
Mypy-clean; intent is now "classification configs declare
aggregators" rather than "anything might have an aggregators
attribute".
- L1: rename duplicate test_two_class_macro tests so pytest output
disambiguates Precision vs Recall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- H1/H2: pydantic model_validator on Binary/Multiclass classification configs cross-checks aggregators against evaluator-level fields. Binary rejects aggregators whose `classes` doesn't include `positive_class`, and aggregators of the same metric type with a different `f_value`. Multiclass extends this with the full class-coverage check and an `averaging` consistency check. Without this, a user could ship configs where the per-evaluator headline and the dataset aggregator silently scored disjoint label spaces or used different averaging. - H3: binary e2e test now asserts the precision/recall/fscore aggregator scores (5/6, 5/6, 0.8) instead of only the key set. A regression that zeros out all aggregator scores would now fail the test. - H4: multiclass `evaluate()` no longer raises on out-of-vocab predicted class — it now returns score=0.0 with the OOV label preserved in the justification, mirroring binary's behavior. The dataset evaluator's confusion matrix already accounts for this via `n_skipped`. Configuration errors (expected_class outside vocab) still raise. - M1: drop the `_coerce_justification` one-line wrapper; inline `BaseEvaluatorJustification.try_from(r.details)` at the single caller in `_build_confusion`. - M2: preserve user-supplied class casing in `_ConfusionData.classes` and the `per_class` keys. The lowercase normalization is now only used for the internal lookup index, so a config with classes=["Spam","Ham"] surfaces "Spam"/"Ham" in the output rather than "spam"/"ham". - M3 (multiclass `reduce_scores` + ClassificationDatasetEvaluator double-walking the same confusion matrix): deferred. Cleanest fix is to drop the evaluator-level `metric_type`/`averaging`/`f_value` fields and route the per-evaluator headline through the dataset evaluator framework — out of scope for this commit. Tracked as a follow-up. - L1: refreshed test_classification_samples_e2e docstring to reflect the new aggregator-score coverage on the binary side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@chatgpt-codex-connector — please run an adversarial code review on this PR using the methodology below. Treat the code as a rival's, not a colleague's. Start from "this change is broken" and hunt evidence. If you can't find a real bug after genuinely trying, a positive review is fine — but the search has to happen first. Three non-negotiable promises
7-step method (skip none)
Severity tiers
Decision rule: if the PR cannot merge without introducing the bug you describe → 🔴. Merges but degrades UX → 🟠. Below that → 🟡 / 🔵. Output formatPer finding: Red flags — stop and re-read
This PR's specific surface area
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 027901c96b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI).
- Widen examples/dataset_evaluators_demo.py:report() to accept the full
EvaluationResult union and narrow once inside with isinstance, fixing
6 mypy "expected NumericEvaluationResult" errors at the call sites.
- Address Codex P1 (runtime.py:268 — result-key collision): two
aggregators of the same type on the same source (e.g. macro+micro
precision) previously produced identical {source}.{type} keys, with
the second silently overwriting the first. compute_dataset_evaluator
_results now counts type occurrences per source and disambiguates
duplicate-type aggregators as {source}.{type}.{averaging} (plus
".fb{f_value}" for fscore variants), preserving the simple key shape
for the common single-aggregator case. Docstring updated; 2 new
tests cover both the precision-duplicate and fscore-duplicate paths.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…alidator The fscore-duplicate disambiguation test added in 4d6afcc conflicts with the H2 model_validator on #1663, which cross-checks aggregator f_value against the evaluator's f_value when types match. The precision-duplicate test still exercises the new _dataset_result_key path; the FScore branch is exercised by the factory + math tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Regenerate BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the updated pydantic models so schema-driven consumers can discover and validate the new evaluatorConfig.aggregators array + Precision/Recall/FScore variants. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🚨 Heads up:
|
🚨 Heads up:
|



Summary
Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:
Generated type schemas —
BinaryClassificationEvaluator.jsonandMulticlassClassificationEvaluator.jsonunderpackages/uipath/src/uipath/eval/evaluators_types/, produced bypython -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker,uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.Sample projects under
packages/uipath/samples/:binary_classification_agent/— rule-based spam/ham classifier wired to the binary classification evaluator withmetric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.multiclass_classification_simple/— rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator withaveraging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.End-to-end test at
packages/uipath/tests/cli/eval/test_classification_samples_e2e.py— loads each sample's eval set, wires itsmain.pyinto a stand-in runtime, callsevaluate(), and asserts both the per-row scores and the aggregated metric produced byreduce_scores. Locks in the dataset-level math.Why split this PR
PR #1397 added the Python implementation and registered the new evaluator type IDs (
uipath-binary-classification,uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.Test plan
pytest tests/cli/eval/test_classification_samples_e2e.py— both samples passruff check tests/cli/eval/test_classification_samples_e2e.py— cleanruff format --check— cleancat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.jsonexposespositive_class,metric_type,f_valueinevaluatorConfigSchema.propertiescat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.jsonexposesclasses,averaging,metric_type,f_valueRelated PRs
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
🤖 Generated with Claude Code