feat(eval): id-aware match with sanitised-name fallback (Option C, layer 1/3)#1736
Merged
Conversation
… evaluators
Option C of the eval matching architecture:
- _match_key / _calls_match prefer id-equality when both sides carry an id
- name fallback normalises both sides through the LangChain sanitiser so an
editor-persisted display name ("Web Search") matches a runtime span whose
tool.name is the sanitised form ("Web_Search")
- tool_calls_count_score's direct dict lookup also tries the sanitised key
on miss so the count path matches the same semantics
Backward compatibility:
- Old eval-sets keyed by display name match via the sanitised name path
- New eval-sets keyed by canvas node id match via the id path
- No data migration required
Test coverage:
- Pinned reference sanitiser matching uipath_langchain's algorithm
- _match_key id-wins-first + name-fallback paths
- _calls_match for ToolCall + ToolOutput
- count_score display-name + id-keyed cases
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the SDK-side evaluator matching logic to prefer stable tool id equality when available, and otherwise fall back to LangChain-style sanitised-name equality so display-name-keyed eval sets continue to match runtime spans whose tool.name is sanitised.
Changes:
- Added
_normalize_tool_nameimplementing a pinned LangChain sanitiser (whitespace →_, strip non[A-Za-z0-9_-], truncate to 64). - Updated
_match_keyand_calls_matchto use id-first matching with sanitised-name fallback. - Updated
tool_calls_count_scoreto attempt raw expected keys first, then their sanitised form on lookup miss.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/uipath/src/uipath/eval/_helpers/evaluators_helpers.py | Implements sanitised-name normalisation and applies it across id-aware matching and count scoring. |
| packages/uipath/tests/evaluators/test_evaluator_helpers.py | Adds targeted tests that pin the sanitiser behavior and validate display-vs-sanitised fallback plus id-first semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Chibionos
approved these changes
Jun 19, 2026
- count_score: split the two-tier .get() into `is None` short-circuit so the sanitiser regex only runs when the raw key is absent (P2). `or` would have been wrong: a real count of 0 is a hit, not a fallback trigger. - TestSanitizedNameMatch: hoist nine in-method imports of `_match_key`/`_calls_match`/`_normalize_tool_name` to the module-level import block (shrink). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sibling `uipath.eval.mocks._mock_context._normalize_tool_name` does the OPPOSITE transform (`"my_tool"` -> `"my tool"`, snake-case -> words). Keeping the LangChain-mirroring sanitiser under the same name in a neighbouring module is a foot-gun for any future reader who grabs the wrong import. Rename to match the upstream `sanitize_tool_name` and drop the collision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per repo convention: comments carry one short line for non-obvious constraints; reasoning lives in commit / PR descriptions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3840732 to
039b3f5
Compare
When an actual ToolCall carries an id, the matcher uses id-only mode and does not fall back to sanitised-name. When it has no id, sanitised-name is the only path. This makes the matching behaviour symmetric with the user's mental model: id-vs-id when ids exist, name-vs-name when they don't, never the two crossed. - _match_key: drop the name fallback when actual_id is present. - _calls_match: when actual.id is set, compare against expected.id (or expected.name when picker stored the id under that field). No sanitised- name path while id is in play. - count_tool_calls_by_name_and_id: bucket each call under one key — id when present, name otherwise. The dict no longer mixes kinds, so a later lookup can't cross-match a display name against an id bucket. Trade: legacy name-keyed eval-sets stop matching against post-Layer-5 id-bearing spans. Acceptable since the picker now always stores the id when one is available, and legacy authors can re-pick to upgrade. Tests updated to reflect the new contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
SDK layer of the Option C eval-matching architecture. The matcher prefers id-equality when both sides carry an id; otherwise falls back to sanitised-name equality so display-name-keyed eval-sets (the current default) keep working while new id-keyed eval-sets become rename-safe.
Changes
_match_key/_calls_match— id wins first, then sanitised name fallbacktool_calls_count_score— try raw key first, then sanitised key on miss_normalize_tool_name— pinned reference implementation of the LangChain sanitiser algorithm (split-on-whitespace → strip non[A-Za-z0-9_-]→ cap at 64 chars)Why
Closes the display-vs-sanitised gap (eval-set "Web Search" vs span "Web_Search"). When the producer side later starts emitting
tool.idon the span (companion PRs inuipath-agents-python+flow-workbench), this same matcher uses the id-equality path automatically — same code, two backward-compatible match modes.Companions
tool.metadata, instrumentor emitstool.idTests
pytest tests/evaluators/test_evaluator_helpers.py::TestSanitizedNameMatch— 19 passedpytest tests/cli/eval/ tests/evaluators/— 889 passedruff check / format / mypy— clean🤖 Generated with Claude Code