feat(observability): add auth.verify, channel.<adapter>.deliver, schedule.fire spans (closes #187)#200
Merged
Merged
Conversation
…dule.fire spans (closes #187) Three runtime surfaces previously invisible in traces: auth.verify wraps Provider.Chain.Verify in forge-core/auth/middleware.go. Pre-187 the provider's outbound HTTP calls (JWKS / STS / IAP / Graph) appeared as orphan root spans with no "why was this called" context, and total auth latency had no measurement. The new span parents the provider's http.client spans and carries forge.auth.provider / .token_kind / .decision / .user_id / .org_id / .fail_reason attributes that mirror the audit auth_verify / auth_fail event fields exactly. channel.<adapter>.deliver wraps the per-message handler in every channel adapter (Slack / Telegram / Teams) via the new channels.StartDeliverSpan helper. The internal A2A POST in forge-cli/channels/router.go now injects the W3C traceparent via the global propagator, so the downstream a2a.tasks/send span nests under channel.<adapter>.deliver in the flame graph. Highest user-visible payoff of the three — "Slack→agent latency" is now answerable from one trace. schedule.fire wraps Scheduler.fire in forge-core/scheduler/scheduler.go around the dispatch. Attributes: forge.schedule.id / cron / source (yaml vs llm). File-backend only for v1 — the K8s backend's trigger Pod is a separate curl-based Pod and would need traceparent injected into the rendered CronJob YAML at forge package time (deferred). All three use the existing global Tracer() from forge-core/runtime — no new tracer install. When tracing is off the no-op tracer makes them zero-allocation. Status=Error on the failure path keeps error-rate dashboards uniform across the existing span types. Cross-cutting hygiene: lifted authFailReason from forge-cli/runtime into the exported auth.FailReason(err) helper so the span attribute and the audit field share one reason vocabulary, with the cli call site now delegating. Pinned by TestAuthVerifySpan_{SuccessRecordsProviderTokenKindDecision, FailureSetsErrorStatusAndFailReason, MissingBearerOpensZeroDurationSpan, ParentsProviderHTTPClientSpans}, TestStartDeliverSpan_{StampsAdapterAndEventAttributes, ErrorSetsStatus, AdapterNameDrivesSpanName, ChildContextCarriesActiveSpan, NilEventDoesNotCrash}, TestScheduleFireSpan_{StampsAttributesAndParentsDispatch, ErrorSetsStatusError, SourceSurfacesLLMOriginatedSchedules}.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three runtime surfaces that own user-perceived latency or causality but didn't show up in traces before. All three landed under one PR per the issue.
auth.verifyProvider.Chain.Verifyinforge-core/auth/middleware.gochannel.<adapter>.delivertraceparentschedule.fireScheduler.fire(file backend)agent.executeAll three use the existing global
Tracer()— no new tracer install. When tracing is off the no-op tracer makes them zero-allocation. Status=Error on the failure path keeps error-rate dashboards uniform.Attributes
Cross-cutting
authFailReasonfromforge-cli/runtimeinto the exportedauth.FailReason(err)helper so the span'sforge.auth.fail_reasonattribute and the auditauth_fail.reasonfield share one vocabulary. Single source of truth.channels.StartDeliverSpan(ctx, adapter, event) (ctx, span, finish func(*error))helper consumed identically by the three adapters so the span shape stays consistent.tracetest.SpanRecorderswap-the-global-provider pattern; cleanup restores the no-op default.Out of scope (deferred follow-ups)
schedule.fire— the trigger Pod is a separate curl-based Pod, so propagation needstraceparentinjected into the rendered CronJob YAML atforge packagetime. Tracked as a follow-up.http.clientspan already; a child span would duplicate. If the decision needs visibility, aforge.egress.decisionattribute onhttp.clientis the right move.mcp_server_startedaudit suffices.Test plan
golangci-lint runacross all four modules — 0 issuesgofmt -wacross all modulesgo test ./...inforge-core/,forge-cli/,forge-plugins/— all greenchannel.<adapter>.deliverfor each of slack / telegram / msteams; returned ctx carries the active span (drives traceparent injection).yamlvsllmsource surfaces correctly.tasks/sendand confirm the auth.verify span parents the provider's outbound HTTP call.