Skip to content

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663

Draft
ajay-kesavan wants to merge 19 commits into
mainfrom
feat/classification-evaluator-types
Draft

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
ajay-kesavan wants to merge 19 commits into
mainfrom
feat/classification-evaluator-types

Conversation

@ajay-kesavan

@ajay-kesavan ajay-kesavan commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:

  1. Generated type schemasBinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json under packages/uipath/src/uipath/eval/evaluators_types/, produced by python -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker, uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.

  2. Sample projects under packages/uipath/samples/:

    • binary_classification_agent/ — rule-based spam/ham classifier wired to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.
    • multiclass_classification_simple/ — rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator with averaging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.
  3. End-to-end test at packages/uipath/tests/cli/eval/test_classification_samples_e2e.py — loads each sample's eval set, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math.

Why split this PR

PR #1397 added the Python implementation and registered the new evaluator type IDs (uipath-binary-classification, uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.

Test plan

  • pytest tests/cli/eval/test_classification_samples_e2e.py — both samples pass
  • ruff check tests/cli/eval/test_classification_samples_e2e.py — clean
  • ruff format --check — clean
  • cat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.json exposes positive_class, metric_type, f_value in evaluatorConfigSchema.properties
  • cat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.json exposes classes, averaging, metric_type, f_value
  • CI passes

Related PRs

  • chore(eval): resync evaluator type schemas with Python source #1664 — companion PR that refreshes the 11 unrelated stale schemas in the same directory (split out for review hygiene; no functional overlap with this PR).
  • UiPath/cli#2128 — TypeScript-side flow-tool registry entries that wire these evaluators into the Flow UI evaluator picker.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json
from the new evaluators added in #1397 so external tooling (Flow UI evaluator
picker, `uip maestro flow eval`) can read the config / criteria / justification
schemas.

Files produced by `python -m uipath.eval.evaluators_types.generate_types`,
restricted to the two new evaluator types. A companion PR refreshes the other
11 stale schemas in evaluators_types/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan force-pushed the feat/classification-evaluator-types branch from 6931598 to 6b11767 Compare May 20, 2026 00:54
@ajay-kesavan ajay-kesavan changed the title chore(eval): regenerate evaluator type schemas with classification evaluators feat(eval): add evaluator type schemas for classification evaluators May 20, 2026
…tors

Adds two sample projects under packages/uipath/samples/ that double as
end-to-end test fixtures for the binary and multiclass classification
evaluators added in #1397:

- binary_classification_agent — rule-based spam/ham classifier wired up
  to the binary classification evaluator with metric_type=precision.
  Eval set is designed so 4/5 datapoints pass but precision is 2/3
  because of one deliberate false positive.
- multiclass_classification_simple — rule-based 3-class router (payments
  / support / spam) wired up to the multiclass classification evaluator
  with macro-averaged F1. Eval set forces a misroute that hurts both
  payments precision and support recall, giving macro F1 = 26/30.

Adds tests/cli/eval/test_classification_samples_e2e.py which loads each
sample's eval-sets/default.json, wires its main.py into a stand-in runtime,
calls evaluate(), and asserts both the per-row scores and the aggregated
metric produced by reduce_scores. Locks in the dataset-level math, not just
per-row correct/incorrect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan changed the title feat(eval): add evaluator type schemas for classification evaluators feat(eval): classification evaluator schemas + sample projects + e2e tests May 20, 2026
ajay-kesavan and others added 2 commits May 20, 2026 14:05
…ll/f-score

Introduces a new BaseDatasetEvaluator concept that runs once per evaluation
set after all per-datapoint evaluators complete. It consumes per-datapoint
EvaluationResultDto values from a named source evaluator and emits a single
run-level EvaluationResult.

Includes three starter evaluators for multiclass classification metrics:

- PrecisionDatasetEvaluator
- RecallDatasetEvaluator
- FScoreDatasetEvaluator (configurable beta)

Each takes a required classes list (populated from the UI), supports micro
or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion
matrix in details. Binary is the 2-class case — no separate binary path.

Architecture: BaseDatasetEvaluator is a parallel hierarchy to
GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop
cannot accidentally pick up a dataset evaluator. Each dataset evaluator
declares a single source_evaluator by name; the runtime groups
per-datapoint results by evaluator name and routes the right list to each
dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json
mirroring the evaluators directory layout.

Patch version bumped: 2.10.68 -> 2.10.69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10.69

examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:

  1. Balanced 3-class — symmetric confusion matrix, macro == micro
  2. Imbalanced 2-class — shows where macro and micro diverge
  3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
     F-beta knob actually moves per-class numbers
  4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
  5. Realistic 4-class intent classifier — uneven per-class performance

Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.

Run::

    cd packages/uipath && uv run python examples/dataset_evaluators_demo.py

uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan

Copy link
Copy Markdown
Contributor Author

Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach.

…luators

# Conflicts:
#	packages/uipath/pyproject.toml
#	packages/uipath/uv.lock
@ajay-kesavan ajay-kesavan reopened this Jun 19, 2026
ajay-kesavan and others added 3 commits June 18, 2026 21:26
…figs

Pivot dataset evaluators from a separate hierarchy with source_evaluator
pointers to an embedded aggregator-spec design: each per-datapoint
classification evaluator's config carries a self-contained list of
aggregators (precision / recall / fscore), each with its own classes,
averaging, and f_value. No properties are shared up to the evaluator
level — aggregators are fully self-describing.

- Drop source_evaluator pointer from BaseDatasetEvaluatorConfig.
- Add discriminated AggregatorSpec union (precision/recall/fscore).
- Add aggregators field to Binary/Multiclass classification configs.
- Refactor build_dataset_evaluator + compute_dataset_evaluator_results
  to consume aggregator specs from per-datapoint configs directly.
- Drop EvaluationSet.dataset_evaluator_refs (no separate list).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tors

Update binary_classification_agent and multiclass_classification_simple
sample evaluator JSONs to include the new aggregators[] field. Each
aggregator carries its own classes, averaging, and (for fscore) fValue.
Update the e2e test to also assert the dataset-level results land in
UiPathEvalOutput.dataset_evaluator_results, keyed
"{evaluator_name}.{aggregator_type}".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels Jun 19, 2026
ajay-kesavan and others added 3 commits June 18, 2026 21:49
- Collapse Precision/Recall/FScore into one ClassificationDatasetEvaluator
  switching on spec.type; factory becomes a one-liner.
- Inline _precision_of/_recall_of/_f_score_of and the one-use _ConfusionData
  helpers; switch _ConfusionData to @DataClass(slots=True).
- Drop dead get_evaluator_id() abstract + 3 overrides + matching
  EvaluatorType enum entries (factory dispatches on spec.type).
- Pull repeated model_config into a private _AggregatorSpecBase.
- Drop registry + impossible-case ValueError in dataset_evaluator_factory
  (pydantic discriminator catches unknown types).
- Have _coerce_justification return the typed justification object.
- Drop the _source_evaluator private/property pair on BaseDatasetEvaluator.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add BaseEvaluatorJustification.try_from classmethod and collapse the
  three duplicate "instance | dict | other" coercion blocks in
  classification_dataset_evaluators, binary_classification_evaluator,
  and multiclass_classification_evaluator down to one line each.
- Replace the 80-line ASCII confusion-matrix pretty-printer in
  dataset_evaluators_demo with the structured JSON wire shape — the
  thing readers actually want to inspect.

Deferred from this PR: dropping reduce_scores / _micro_metric /
_macro_metric on Binary/Multiclass evaluators, and the matching
metric_type/averaging/f_value config fields. The runtime calls
GenericBaseEvaluator.reduce_scores per-evaluator to compute the
top-level evaluator score; the dataset evaluator framework adds
{source}.{type}-keyed metrics in addition to that score, it doesn't
replace it. Removing them would break the existing per-evaluator
headline. Worth a follow-up that either makes reduce_scores delegate
to the dataset evaluator framework or formally splits the two paths.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ajay-kesavan and others added 3 commits June 18, 2026 22:27
- M2: drop _METRIC_NAME indirection. metric field on
  ClassificationDetails now uses spec.type verbatim ("fscore" not
  "f_score"), matching the discriminator on the wire.
- M3: document confusion_matrix orientation via Field(description=...).
  Matrix is [predicted_idx][expected_idx], opposite of sklearn's
  convention. Add a regression test pinning the orientation.
- M4: _metric raises ValueError on unknown metric_type instead of
  silently falling through to the F-beta formula. Defense in depth
  on top of pydantic's discriminator.
- M6: replace defensive getattr chain in compute_dataset_evaluator_
  results with isinstance narrowing on the classification config types.
  Mypy-clean; intent is now "classification configs declare
  aggregators" rather than "anything might have an aggregators
  attribute".
- L1: rename duplicate test_two_class_macro tests so pytest output
  disambiguates Precision vs Recall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- H1/H2: pydantic model_validator on Binary/Multiclass classification
  configs cross-checks aggregators against evaluator-level fields. Binary
  rejects aggregators whose `classes` doesn't include `positive_class`,
  and aggregators of the same metric type with a different `f_value`.
  Multiclass extends this with the full class-coverage check and an
  `averaging` consistency check. Without this, a user could ship configs
  where the per-evaluator headline and the dataset aggregator silently
  scored disjoint label spaces or used different averaging.
- H3: binary e2e test now asserts the precision/recall/fscore aggregator
  scores (5/6, 5/6, 0.8) instead of only the key set. A regression that
  zeros out all aggregator scores would now fail the test.
- H4: multiclass `evaluate()` no longer raises on out-of-vocab predicted
  class — it now returns score=0.0 with the OOV label preserved in the
  justification, mirroring binary's behavior. The dataset evaluator's
  confusion matrix already accounts for this via `n_skipped`.
  Configuration errors (expected_class outside vocab) still raise.
- M1: drop the `_coerce_justification` one-line wrapper; inline
  `BaseEvaluatorJustification.try_from(r.details)` at the single caller
  in `_build_confusion`.
- M2: preserve user-supplied class casing in `_ConfusionData.classes` and
  the `per_class` keys. The lowercase normalization is now only used for
  the internal lookup index, so a config with classes=["Spam","Ham"]
  surfaces "Spam"/"Ham" in the output rather than "spam"/"ham".
- M3 (multiclass `reduce_scores` + ClassificationDatasetEvaluator
  double-walking the same confusion matrix): deferred. Cleanest fix is
  to drop the evaluator-level `metric_type`/`averaging`/`f_value` fields
  and route the per-evaluator headline through the dataset evaluator
  framework — out of scope for this commit. Tracked as a follow-up.
- L1: refreshed test_classification_samples_e2e docstring to reflect
  the new aggregator-score coverage on the binary side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ajay-kesavan

Copy link
Copy Markdown
Contributor Author

@chatgpt-codex-connector — please run an adversarial code review on this PR using the methodology below. Treat the code as a rival's, not a colleague's. Start from "this change is broken" and hunt evidence. If you can't find a real bug after genuinely trying, a positive review is fine — but the search has to happen first.

Three non-negotiable promises

  1. Cross-check every invoked interface. When A calls B, open B and read its signature. Prop drift is the Feat/basic invoke process #1 source of silent bugs in AI-generated code.
  2. Read test bodies, not test titles. PR descriptions over-claim, commit messages over-claim, and tests get renamed to match the new implementation rather than the original behavior.
  3. Severity is factual, not emotional. "Loses user data" = 🔴 even when the author is senior. Typo = 🔵 even when it annoys you.

7-step method (skip none)

  1. Inventory the change — PR metadata, files, commits, full diff. Look for: additions ≈ deletions + "no behavior change" claim (behavior probably changed); commit titled "address review" (open the file it claims to forward into; verify the prop actually lands); mergeable=CONFLICTING (surface it).
  2. Read the full diff top-to-bottom in one pass. Bugs live at seams between unrelated files in the same diff.
  3. Cross-check every consumer of every new prop / new component / new DI seam. Grep call sites; confirm every prop both exists on the receiver's type AND is used (not silently dropped in one branch of an if).
  4. Trace data flow end-to-end. user input → onChange → state → validation → persist → reload. Flag silent drops, empty-string coercion to {}/null, stale-closure useState(propDefault) that never updates, missing memoization on every-render stringification.
  5. Diff the deleted code against what replaced it. Every - line is an unverified claim. Verify: vanished behavioral features (truncation, maxLines, special cases, fallbacks), edge-case handling (empty string, null, arrays), a11y attrs (id, htmlFor, aria-*), layout-critical CSS classes.
  6. Sanity-check the tests. Read bodies. Ask "does this verify NEW BEHAVIOR or just the NEW IMPLEMENTATION?" Renames like "renders X" → "renders Y" often launder a regression. Any DI seam MUST have a test that exercises the injected path; fallback-only tests prove nothing.
  7. Classify with severity tiers and post inline anchored comments.

Severity tiers

  • 🔴 Critical — blocking — data loss, crash, security, broken contract, a11y regression
  • 🟠 High — should fix before merge — structural regression, missing coverage for new code path, error-handling gap
  • 🟡 Medium — quality nit — smell, redundancy, naming, small perf with evidence
  • 🔵 Low — observation — style, docs, process

Decision rule: if the PR cannot merge without introducing the bug you describe → 🔴. Merges but degrades UX → 🟠. Below that → 🟡 / 🔵.

Output format

Per finding: file:line citation + concrete fix. State issues as facts ("This loses user input" — not "may potentially"). Criticism first, positives last. Drop softening qualifiers (no "might", "perhaps", "consider"). Adversarial ≠ abusive: attack the code, name the pattern, never the person.

Red flags — stop and re-read

  • "This looks fine" → you haven't cross-checked invoked interfaces
  • "Tests pass so it's good" → tests assert the new impl, not the old behavior
  • "Description says no behavior change" → diff the deleted lines
  • "Small PR, quick review" → small PRs hide big bugs in prop drops
  • "AI-generated, probably clean" → AI code is exactly where prop drift hides

This PR's specific surface area

  • aggregators[] embedded in BinaryClassificationEvaluatorConfig and MulticlassClassificationEvaluatorConfig
  • New model_validators cross-check positive_class / classes containment AND averaging / f_value divergence between evaluator-level and aggregator-level fields when types match
  • reduce_scores retained on Binary/Multiclass for the per-evaluator headline — the dataset evaluator framework runs additively; verify the two paths agree on identical inputs
  • BaseEvaluatorJustification.try_from classmethod collapsed 3 duplicate coercion blocks
  • Sample JSONs (binary_classification_agent/, multiclass_classification_simple/) carry new aggregators[] arrays
  • E2E test asserts specific aggregated scores (binary: P=R=5/6, F1=0.8). Read the multiclass assertions too — they assert specific fscore numbers

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 027901c96b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ajay-kesavan and others added 3 commits June 18, 2026 23:05
- Bump uipath version 2.11.5 -> 2.11.6 (2.11.5 already on PyPI).
- Widen examples/dataset_evaluators_demo.py:report() to accept the full
  EvaluationResult union and narrow once inside with isinstance, fixing
  6 mypy "expected NumericEvaluationResult" errors at the call sites.
- Address Codex P1 (runtime.py:268 — result-key collision): two
  aggregators of the same type on the same source (e.g. macro+micro
  precision) previously produced identical {source}.{type} keys, with
  the second silently overwriting the first. compute_dataset_evaluator
  _results now counts type occurrences per source and disambiguates
  duplicate-type aggregators as {source}.{type}.{averaging} (plus
  ".fb{f_value}" for fscore variants), preserving the simple key shape
  for the common single-aggregator case. Docstring updated; 2 new
  tests cover both the precision-duplicate and fscore-duplicate paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…alidator

The fscore-duplicate disambiguation test added in 4d6afcc conflicts
with the H2 model_validator on #1663, which cross-checks aggregator
f_value against the evaluator's f_value when types match. The
precision-duplicate test still exercises the new
_dataset_result_key path; the FScore branch is exercised by the
factory + math tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Regenerate BinaryClassificationEvaluator.json and
MulticlassClassificationEvaluator.json from the updated pydantic models
so schema-driven consumers can discover and validate the new
evaluatorConfig.aggregators array + Precision/Recall/FScore variants.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

🚨 Heads up: uipath-integrations cross-tests are FAILING 🚨

Your changes may break one or more integrations in uipath-integrations-python:

  • uipath-openai-agents
  • uipath-google-adk
  • uipath-agent-framework
  • uipath-llamaindex
  • uipath-pydantic-ai

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

@github-actions

Copy link
Copy Markdown

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant