Skip to content

feat(eval): id-aware match with sanitised-name fallback (Option C, layer 1/3)#1736

Merged
ajay-kesavan merged 8 commits into
mainfrom
ajay/eval-option-c-sdk
Jun 20, 2026
Merged

feat(eval): id-aware match with sanitised-name fallback (Option C, layer 1/3)#1736
ajay-kesavan merged 8 commits into
mainfrom
ajay/eval-option-c-sdk

Conversation

@ajay-kesavan

Copy link
Copy Markdown
Contributor

Summary

SDK layer of the Option C eval-matching architecture. The matcher prefers id-equality when both sides carry an id; otherwise falls back to sanitised-name equality so display-name-keyed eval-sets (the current default) keep working while new id-keyed eval-sets become rename-safe.

Changes

  • _match_key / _calls_match — id wins first, then sanitised name fallback
  • tool_calls_count_score — try raw key first, then sanitised key on miss
  • _normalize_tool_name — pinned reference implementation of the LangChain sanitiser algorithm (split-on-whitespace → strip non [A-Za-z0-9_-] → cap at 64 chars)

Why

Closes the display-vs-sanitised gap (eval-set "Web Search" vs span "Web_Search"). When the producer side later starts emitting tool.id on the span (companion PRs in uipath-agents-python + flow-workbench), this same matcher uses the id-equality path automatically — same code, two backward-compatible match modes.

Companions

  • uipath-agents-python: inject canvas node id into tool.metadata, instrumentor emits tool.id
  • flow-workbench: picker captures + stores canvas node id

Tests

  • pytest tests/evaluators/test_evaluator_helpers.py::TestSanitizedNameMatch — 19 passed
  • pytest tests/cli/eval/ tests/evaluators/ — 889 passed
  • ruff check / format / mypy — clean

🤖 Generated with Claude Code

… evaluators

Option C of the eval matching architecture:
- _match_key / _calls_match prefer id-equality when both sides carry an id
- name fallback normalises both sides through the LangChain sanitiser so an
  editor-persisted display name ("Web Search") matches a runtime span whose
  tool.name is the sanitised form ("Web_Search")
- tool_calls_count_score's direct dict lookup also tries the sanitised key
  on miss so the count path matches the same semantics

Backward compatibility:
- Old eval-sets keyed by display name match via the sanitised name path
- New eval-sets keyed by canvas node id match via the id path
- No data migration required

Test coverage:
- Pinned reference sanitiser matching uipath_langchain's algorithm
- _match_key id-wins-first + name-fallback paths
- _calls_match for ToolCall + ToolOutput
- count_score display-name + id-keyed cases

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 19, 2026 17:20
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels Jun 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SDK-side evaluator matching logic to prefer stable tool id equality when available, and otherwise fall back to LangChain-style sanitised-name equality so display-name-keyed eval sets continue to match runtime spans whose tool.name is sanitised.

Changes:

  • Added _normalize_tool_name implementing a pinned LangChain sanitiser (whitespace → _, strip non [A-Za-z0-9_-], truncate to 64).
  • Updated _match_key and _calls_match to use id-first matching with sanitised-name fallback.
  • Updated tool_calls_count_score to attempt raw expected keys first, then their sanitised form on lookup miss.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/uipath/src/uipath/eval/_helpers/evaluators_helpers.py Implements sanitised-name normalisation and applies it across id-aware matching and count scoring.
packages/uipath/tests/evaluators/test_evaluator_helpers.py Adds targeted tests that pin the sanitiser behavior and validate display-vs-sanitised fallback plus id-first semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ajay-kesavan ajay-kesavan requested a review from Chibionos June 19, 2026 18:42
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ajay-kesavan and others added 4 commits June 19, 2026 14:11
- count_score: split the two-tier .get() into `is None` short-circuit so
  the sanitiser regex only runs when the raw key is absent (P2). `or`
  would have been wrong: a real count of 0 is a hit, not a fallback
  trigger.
- TestSanitizedNameMatch: hoist nine in-method imports of
  `_match_key`/`_calls_match`/`_normalize_tool_name` to the module-level
  import block (shrink).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sibling `uipath.eval.mocks._mock_context._normalize_tool_name` does
the OPPOSITE transform (`"my_tool"` -> `"my tool"`, snake-case -> words).
Keeping the LangChain-mirroring sanitiser under the same name in a
neighbouring module is a foot-gun for any future reader who grabs the
wrong import. Rename to match the upstream `sanitize_tool_name` and
drop the collision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per repo convention: comments carry one short line for non-obvious
constraints; reasoning lives in commit / PR descriptions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan force-pushed the ajay/eval-option-c-sdk branch from 3840732 to 039b3f5 Compare June 19, 2026 21:13
When an actual ToolCall carries an id, the matcher uses id-only mode and
does not fall back to sanitised-name. When it has no id, sanitised-name is
the only path. This makes the matching behaviour symmetric with the user's
mental model: id-vs-id when ids exist, name-vs-name when they don't, never
the two crossed.

- _match_key: drop the name fallback when actual_id is present.
- _calls_match: when actual.id is set, compare against expected.id (or
  expected.name when picker stored the id under that field). No sanitised-
  name path while id is in play.
- count_tool_calls_by_name_and_id: bucket each call under one key — id
  when present, name otherwise. The dict no longer mixes kinds, so a
  later lookup can't cross-match a display name against an id bucket.

Trade: legacy name-keyed eval-sets stop matching against post-Layer-5
id-bearing spans. Acceptable since the picker now always stores the id
when one is available, and legacy authors can re-pick to upgrade.

Tests updated to reflect the new contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@UiPath UiPath deleted a comment from github-actions Bot Jun 20, 2026
@ajay-kesavan ajay-kesavan merged commit 6d94685 into main Jun 20, 2026
251 of 255 checks passed
@ajay-kesavan ajay-kesavan deleted the ajay/eval-option-c-sdk branch June 20, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants