Skip to content

feat(observability): add auth.verify, channel.<adapter>.deliver, schedule.fire spans (closes #187)#200

Merged
initializ-mk merged 1 commit into
mainfrom
feat/issue-187-runtime-spans
Jun 26, 2026
Merged

feat(observability): add auth.verify, channel.<adapter>.deliver, schedule.fire spans (closes #187)#200
initializ-mk merged 1 commit into
mainfrom
feat/issue-187-runtime-spans

Conversation

@initializ-mk

Copy link
Copy Markdown
Contributor

Summary

Three runtime surfaces that own user-perceived latency or causality but didn't show up in traces before. All three landed under one PR per the issue.

Span What it covers Headline payoff
auth.verify wraps Provider.Chain.Verify in forge-core/auth/middleware.go provider HTTP calls (JWKS / STS / IAP / Graph) stop showing as orphan roots; total auth latency is visible
channel.<adapter>.deliver wraps the per-message handler in Slack / Telegram / Teams; the router's internal A2A POST now injects W3C traceparent "Slack→agent latency" is answerable from one trace
schedule.fire wraps Scheduler.fire (file backend) scheduled-job downstream work has a parent instead of orphan agent.execute

All three use the existing global Tracer() — no new tracer install. When tracing is off the no-op tracer makes them zero-allocation. Status=Error on the failure path keeps error-rate dashboards uniform.

Attributes

auth.verify             forge.auth.provider, .token_kind, .decision,
                        .user_id, .org_id, .fail_reason
channel.X.deliver       forge.channel.adapter, .target, .message_id, .user_id
schedule.fire           forge.schedule.id, .cron, .source (yaml|llm)

Cross-cutting

  • Lifted authFailReason from forge-cli/runtime into the exported auth.FailReason(err) helper so the span's forge.auth.fail_reason attribute and the audit auth_fail.reason field share one vocabulary. Single source of truth.
  • New channels.StartDeliverSpan(ctx, adapter, event) (ctx, span, finish func(*error)) helper consumed identically by the three adapters so the span shape stays consistent.
  • Span tests use the upstream tracetest.SpanRecorder swap-the-global-provider pattern; cleanup restores the no-op default.

Out of scope (deferred follow-ups)

  • K8s-backend schedule.fire — the trigger Pod is a separate curl-based Pod, so propagation needs traceparent injected into the rendered CronJob YAML at forge package time. Tracked as a follow-up.
  • Egress allow/block span — decision lives inside the existing http.client span already; a child span would duplicate. If the decision needs visibility, a forge.egress.decision attribute on http.client is the right move.
  • MCP server startup span — one-shot at startup; the mcp_server_started audit suffices.
  • Memory read/write spans — in-process, microseconds; not worth the span-count multiplication without a specific perf question.

Test plan

  • golangci-lint run across all four modules — 0 issues
  • gofmt -w across all modules
  • go test ./... in forge-core/, forge-cli/, forge-plugins/ — all green
  • New unit tests pin:
    • Auth: success records provider / token_kind / decision / user_id / org_id; failure sets decision=fail + fail_reason from the FailReason() vocabulary + Status=Error (5 sub-cases: rejected / invalid / provider_unavailable / not_for_me / infrastructure); missing-bearer path opens a span with missing_token; provider's outbound HTTP calls inherit auth.verify as parent (the issue's motivating use case).
    • Channel: adapter / target / message_id / user_id attributes stamped; error path sets Status=Error; span name is channel.<adapter>.deliver for each of slack / telegram / msteams; returned ctx carries the active span (drives traceparent injection).
    • Schedule: id / cron / source attributes stamped; dispatch ctx is a child of schedule.fire; error path sets Status=Error; yaml vs llm source surfaces correctly.
  • Manual smoke: enable tracing, hit an authenticated tasks/send and confirm the auth.verify span parents the provider's outbound HTTP call.

…dule.fire spans (closes #187)

Three runtime surfaces previously invisible in traces:

auth.verify wraps Provider.Chain.Verify in forge-core/auth/middleware.go.
Pre-187 the provider's outbound HTTP calls (JWKS / STS / IAP / Graph)
appeared as orphan root spans with no "why was this called" context,
and total auth latency had no measurement. The new span parents the
provider's http.client spans and carries forge.auth.provider /
.token_kind / .decision / .user_id / .org_id / .fail_reason attributes
that mirror the audit auth_verify / auth_fail event fields exactly.

channel.<adapter>.deliver wraps the per-message handler in every
channel adapter (Slack / Telegram / Teams) via the new
channels.StartDeliverSpan helper. The internal A2A POST in
forge-cli/channels/router.go now injects the W3C traceparent via the
global propagator, so the downstream a2a.tasks/send span nests under
channel.<adapter>.deliver in the flame graph. Highest user-visible
payoff of the three — "Slack→agent latency" is now answerable from
one trace.

schedule.fire wraps Scheduler.fire in forge-core/scheduler/scheduler.go
around the dispatch. Attributes: forge.schedule.id / cron / source
(yaml vs llm). File-backend only for v1 — the K8s backend's trigger Pod
is a separate curl-based Pod and would need traceparent injected into
the rendered CronJob YAML at forge package time (deferred).

All three use the existing global Tracer() from forge-core/runtime —
no new tracer install. When tracing is off the no-op tracer makes them
zero-allocation. Status=Error on the failure path keeps error-rate
dashboards uniform across the existing span types.

Cross-cutting hygiene: lifted authFailReason from forge-cli/runtime
into the exported auth.FailReason(err) helper so the span attribute
and the audit field share one reason vocabulary, with the cli call
site now delegating.

Pinned by TestAuthVerifySpan_{SuccessRecordsProviderTokenKindDecision,
FailureSetsErrorStatusAndFailReason,
MissingBearerOpensZeroDurationSpan, ParentsProviderHTTPClientSpans},
TestStartDeliverSpan_{StampsAdapterAndEventAttributes,
ErrorSetsStatus, AdapterNameDrivesSpanName,
ChildContextCarriesActiveSpan, NilEventDoesNotCrash},
TestScheduleFireSpan_{StampsAttributesAndParentsDispatch,
ErrorSetsStatusError, SourceSurfacesLLMOriginatedSchedules}.
@initializ-mk initializ-mk merged commit d452224 into main Jun 26, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant