Skip to content

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727

Merged
prk-Jr merged 446 commits into
mainfrom
feature/edgezero-pr19-cutover-canary
Jun 29, 2026
Merged

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
prk-Jr merged 446 commits into
mainfrom
feature/edgezero-pr19-cutover-canary

Conversation

@prk-Jr

@prk-Jr prk-Jr commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds edgezero_rollout_pct (0–100) to the trusted_server_config store so EdgeZero traffic can be shifted incrementally without a deploy
  • Uses FNV-1a 32-bit hash of client_ip for deterministic, sticky per-user bucket assignment (0–99); routing predicate is bucket < rollout_pct
  • Fail-safe defaults: read error or invalid value → 0 (all legacy); absent key (backward compat) → 100 (all EdgeZero); rollout_pct = 0 is the immediate no-deploy rollback

Changes

File Change
crates/trusted-server-adapter-fastly/src/main.rs Added parse_rollout_pct, fnv1a_bucket, canary_routes_to_edgezero, read_rollout_pct; updated main() with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)
fastly.toml Added edgezero_rollout_pct = "0" to local Viceroy config store with ops comment
docs/internal/EDGEZERO_MIGRATION.md New ops runbook: config store keys, failure modes, pre-flight activation, 4 canary stages, pass/fail thresholds, rollback procedure, monitoring

Closes

Closes #500

Test plan

  • cargo test-fastly — 65 + 956 + 21 + 3 tests green
  • cargo test-axum — 31 tests green
  • cargo clippy-fastly && cargo clippy-axum — no warnings
  • cargo fmt --all -- --check — clean
  • JS tests: cd crates/js/lib && npx vitest run — 291 tests green
  • JS format: cd crates/js/lib && npm run format — clean
  • Docs format: cd docs && npm run format — clean
  • WASM build: cargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1
  • Manual testing via fastly compute serve with edgezero_rollout_pct = "1" and "0" (rollback)

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses log macros (not println!)
  • New code has tests
  • No secrets or credentials committed

prk-Jr added 30 commits April 27, 2026 16:50
…on' into feature/edgezero-pr14-entry-point-dual-path
…dispatch_unregistered_method_returns_405_at_router_level
Incorporates the round-3 review fixes from PR14 (E2E tests, finalize-header
coverage for unregistered methods, allowed_domains documentation) and the
broader PR14 refactoring (EC ID rename from synthetic_id, kv_ops removal from
handle_auction, runtime_services_for_consent_route consent KV approach,
async tokio tests in proxy/orchestrator, PublisherResponse streaming).

Conflict resolutions:
- Take PR14's refactored core files (endpoints, orchestrator, consent,
  publisher, proxy, platform/mod, testlight)
- Keep PR15's deletion of core/compat.rs and storage/mod.rs (Fastly code
  stays in adapter, not core)
- Add from_fastly_response to adapter compat.rs (needed by EdgeZero path
  for entry-point finalize-header wrap introduced in PR14)
- Remove obsolete FastlyConsentKvStore / open_consent_kv / ConsentKvOps
  references from platform.rs and app.rs (trait removed in PR14)
- Fix orchestrator backend_name call to pass services (PR15 trait signature)
- Add minimal backend.rs and storage/mod.rs stubs to core for
  DEFAULT_FIRST_BYTE_TIMEOUT and storage::kv_store module path
- Add tokio as dev-dependency to trusted-server-core for async tests
Conflict resolutions:
- Cargo.lock: regenerated after adding futures dep
- docs/guide/testing.md: kept Axum adapter test command (PR16) and
  EC ID rename + specific-test example (PR15)
- .cargo/config.toml: kept PR16 version (no WASM default target) because
  trusted-server-adapter-axum is now a workspace member and tokio/reqwest
  cannot compile to wasm32-wasip1
- AGENTS.md: updated cargo test instruction to document both required commands
  now that the workspace contains both WASM-only and native-only crates

API breakage from PR15 (handle_publisher_request now returns PublisherResponse):
- Add resolve_publisher_response() helper mirroring the Fastly adapter pattern
  (Buffered / Stream / PassThrough -> Response)
- Drop the removed None arg from handle_auction and handle_publisher_request
  call sites
PR15 merge — modifications to existing files not captured in the merge commit:
- Fastly adapter: migrate to EdgeZero router/middleware, rewrite management API,
  update platform/compat/logging/backend to new EdgeZero abstractions
- Core: remove KV-backed consent layer (consent/kv.rs deleted), update auction
  orchestrator and integrations to new API surface, refactor html_processor,
  cookies, and error types
- Delete stale Fastly KV store test fixtures (counter_store.json, opid_store.json)
- Update fastly.toml, trusted-server.toml, and integration test config to match

PR16 workspace restructuring:
- Add both adapters to workspace members; set default-members=[core, axum] so
  bare `cargo test` runs natively without WASM target conflicts
- Replace test-wasm alias with test-fastly (--workspace --exclude axum
  --target wasm32-wasip1) and keep test-axum (-p trusted-server-adapter-axum)
- Update CI: cargo test-fastly in the Fastly job, cargo test-axum in native job
- Axum admin routes (/admin/keys/rotate, /admin/keys/deactivate) return 501
  instead of falling through to an error handler
- Update CLAUDE.md, AGENTS.md, README, and all guide docs to use new aliases
Resolved conflicts:
- .cargo/config.toml: extend test-fastly alias to also exclude trusted-server-adapter-cloudflare
- .github/workflows/test.yml: use cargo test-fastly alias in CI (from PR16) and keep test-cloudflare job (from PR17)
- CLAUDE.md: keep Cloudflare workspace entry and cargo test-axum alias; merge both adapters' commands
- crates/trusted-server-core/src/proxy.rs: use #[tokio::test] + std::time pattern (consistent with PR16 tests)
- Cargo.lock: regenerated to reflect merged workspace state
Viceroy internally runs `cargo run --bin trusted-server-adapter-fastly`
against the default-run packages. With core + axum as defaults, Cargo
fails to find the binary. Restricting default-members to the fastly
adapter fixes the lookup. Updated stale comments in CLAUDE.md and
.cargo/config.toml to reflect the new setup.
Brings in the default-members fix: restricts workspace default-members
to `trusted-server-adapter-fastly` so Viceroy can locate the binary
via `cargo run --bin` during `cargo test-fastly`.
…/edgezero-pr13-integration-provider-type-migration
Blocking fixes:
- Remove debug_assert! in compat::to_fastly_response that contradicted the
  documented silent-drop fallback and caused to_fastly_response_with_streaming_body_produces_empty_body
  to panic in debug builds
- Restore streaming dispatch for PublisherResponse::Stream, which was regressed
  by the PR 11 compat shim. Introduce HandlerOutcome enum so route_request can
  signal streaming intent back to the adapter. main() handles HandlerOutcome::Streaming
  via to_fastly_response_skeleton + stream_to_client + stream_publisher_body directly,
  avoiding the Vec<u8> buffer that delayed TTFB and scaled WASM memory with body size.

Non-blocking fixes:
- Add to_fastly_response_skeleton for headers-only fastly response conversion
- Add geo-401 comment: skip geo lookup for unauthenticated callers
- Add audited-callers comment on EdgeBody::Stream arms in compat
- Add O(N) / PR 15 comment on bytes.to_vec() calls
- Deduplicate auction/publisher status extraction in route_tests via outcome_status helper
Resolve conflicts in main.rs, compat.rs, auction/endpoints.rs,
cookies.rs, edge_cookie.rs, migration_guards.rs, publisher.rs,
and request_signing/endpoints.rs. Includes main's JA4 debug
endpoint, Sourcepoint integration, and Prebid bid param override
rules. Removed unused try_build_ec_cookie_value and
get_ec_id_from_http_request stubs surfaced by clippy after merge.
StreamingBody in fastly 0.11.12 has no Drop impl — dropping it without
calling finish() leaves the chunked HTTP stream with no terminal chunk,
causing clients to receive "unexpected EOF during chunk size line".

Match on stream_publisher_body result and call finish() in both the
success and error paths so the response is always properly terminated.
prk-Jr added 4 commits June 20, 2026 19:40
…er' into feature/edgezero-pr19-cutover-canary
EDGEZERO_LOG_LEVEL is not propagated into the Compute guest by
fastly compute serve, so the documented local route-decision validation
path never raised the level. Detect the Viceroy environment through the
guest-visible FASTLY_HOSTNAME=localhost signal and auto-raise to debug
there, keeping production at the safe Info default. The explicit env
override is retained for harnesses that populate the guest environment.
max_level was applied only to the inner log_fastly Logger, so the outer
fern Dispatch still ran timestamp and format construction for records the
child logger later dropped. Set the same level on the Dispatch so
below-threshold records are filtered before formatting at production QPS.
Update the migration runbook and implementation plan to describe local
route-decision logging via the guest-visible FASTLY_HOSTNAME=localhost
signal instead of the EDGEZERO_LOG_LEVEL env var, which the Compute guest
never sees. Replace the stale Ok(None) => 100 plan snippet with the
fail-safe absent-key arm, correct the stickiness invariant to per-client-IP
(NAT and IP changes can re-bucket), switch verification commands to the
cargo clippy-fastly alias, and drop the tail/head pipes that masked the
real exit status of test, fmt, clippy, build, and docs-format checks.

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly canary routing implementation, rollout defaults/failure behavior, logging-level changes, local config, migration runbook, duplicated implementation-plan guidance, CI, and existing review feedback. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment on stale plan guidance that now contradicts the implemented tests/helper names and could mislead future agent handoffs.

Findings

P0 / Blockers

None.

P1 / High

None.

P2 / Medium

1 inline comment submitted.

P3 / Low

None.

CI / Existing Reviews

All reported PR checks are passing. Existing review feedback about absent-key fail-safe behavior, per-client-IP stickiness, route-decision observability, logging override scope, stale fail-open snippets, and validation command exit-status masking appears addressed in the runtime code, fastly.toml, the canonical migration runbook, and portions of the implementation-plan document. Locally, cargo test -p trusted-server-adapter-fastly --target wasm32-wasip1 rollout_pct --no-run completed successfully.

Comment thread docs/superpowers/plans/2026-05-21-pr19-cutover-canary-rollout.md Outdated

@aram356 aram356 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Re-review against the PR base (feature/edgezero-pr19-spin-adapter). All findings from the prior review rounds are resolved:

  • Absent edgezero_rollout_pct now fails safe to 0 (legacy), with the doc table, runbook failure-modes table, and setup precondition updated to match.
  • The runbook stickiness claim is corrected to "sticky per client IP (not per user)".
  • The log-level override docs no longer imply it only raises (it can lower too), and the FASTLY_HOSTNAME signal is documented as Fastly-specific and not to be copied into the axum/spin/cloudflare adapters.
  • .level(max_level) on the fern Dispatch now filters route-decision debug logs out of production (without it, fern's default would have overridden the Logger's Info).

Approving. Three non-blocking items remain inline: a test-coverage gap on the log-level detection logic (♻️), the two-per-request config-store reads (🤔, carried), and an EDGEZERO_LOG_LEVEL precedence nit (⛏).

Non-blocking

♻️ refactor

  • Log-level detection logic is untested — extract a pure helper and pin the production-safety property (real hostname must stay Info). (logging.rs:76)

🤔 thinking

  • Two config-store reads per request — consolidatable into one read. (main.rs:328)

⛏ nitpick

  • Unparseable EDGEZERO_LOG_LEVEL suppresses local auto-debug — minor dev-surprise. (logging.rs:71)

CI Status

  • fmt: PASS
  • clippy: PASS
  • rust tests: PASS
  • js tests: PASS

Comment thread crates/trusted-server-adapter-fastly/src/logging.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/logging.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
prk-Jr added 3 commits June 22, 2026 21:29
Resolve conflicts keeping this branch's full fastly-removal architecture
while preserving main's independent improvements:

- Cargo.toml / core Cargo.toml: keep tower (axum dep); drop the residual
  fastly dependency from core (this branch removes it entirely via the
  EcKvStore platform trait + adapter-local rate limiter, superseding main's
  allowlist approach).
- migration_guards.rs: keep the no-fastly guards; drop main's deferred-EC
  fastly allowlist tests.
- fastly app.rs / main.rs: keep this branch's fastly-in-adapter wiring
  (ec_kv, rate_limiter, use_finalize_kv, core buffer_publisher_response)
  and adopt main's full ClientInfo capture into request extensions, which
  fixes the EdgeZero bot-protection metadata loss for integration request
  filters (DataDome).
- edge_cookie.rs / storage/kv_store.rs: take main's identifier-free consent
  and EC log messages.
- proxy.rs: take main's SVG image-optimizer tests (superset of this branch's).
- auction/orchestrator.rs: keep the sequential-fan-out rejection test.
Blocking:
- Force manual redirect handling on the Cloudflare fetch path
  (RequestRedirect::Manual). The Workers runtime defaults to Follow and
  auto-chases 3xx to any host, bypassing core's per-hop allowed_domains
  enforcement in proxy_with_redirects (SSRF). Surfaces the 3xx back to core
  unfollowed, matching the Axum adapter.
- Return a fixed 501 from /admin/keys/rotate and /admin/keys/deactivate.
  Cloudflare's config/secret stores reject writes, so the real handlers
  surfaced an opaque 500; 501 makes the unsupported-runtime contract explicit,
  matching Axum.

Non-blocking parity/safety:
- Cap the buffered upstream response body at 10 MiB (Content-Length pre-check
  + post-buffer check), mirroring the Fastly adapter, to avoid OOMing the
  isolate on a large/hostile origin.
- Reject image_optimizer and stream_response in the fetch path rather than
  silently dropping the behavior, honoring the PlatformHttpRequest contract.
- Strip hop-by-hop response headers (and Connection-listed tokens) before
  forwarding, matching the Axum adapter's is_hop_by_hop_response_header.
- Derive the S3 SigV4 signing time from the wasm-safe web_time clock so it no
  longer calls std::time::SystemTime::now(), which panics on
  wasm32-unknown-unknown (latent once asset proxying lands on Cloudflare).

Tests:
- Add routes_with_settings/build_router seam to the Cloudflare adapter and
  assert admin routes return 501 and the tsjs catch-all returns < 500,
  matching the Axum route-test coverage that would have caught the admin
  status divergence.
The build script only installed worker-build when none was on PATH, so a
globally installed worker-build 0.8.x was used as-is and rejected the worker
0.7 crate pinned in Cargo.toml ("Unsupported version worker@0.7.x, expected at
least worker@0.8.4"), failing `wrangler dev`. Check the installed major.minor
and reinstall ^0.7 when it is absent or mismatched.

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

👍 Approving. The Fastly canary runtime path now fails safe to legacy, the critical rollout helper branches are covered, and reported CI is green. I left non-blocking inline comments using the CREG style from CONTRIBUTING.md.

Folded follow-ups

🔧 wrench — Configuration guide still describes the old one-key rollout flowdocs/guide/configuration.md:1163

This file is not in the PR diff, so I could not leave an inline suggestion. The guide says edgezero_enabled=true routes traffic through EdgeZero and the examples only set that key. With this PR, absent edgezero_rollout_pct defaults to 0, so following the guide produces 0% EdgeZero. Please update the guide with both config keys, staged values, fallback behavior, and rollback via edgezero_rollout_pct = "0".

📝 memo — Spec drift to reconcile laterdocs/superpowers/specs/2026-03-19-edgezero-migration-design.md:1512 still says rollout hashes the request ID, while the implementation and runbook hash client IP. Please align that source doc when updating rollout documentation.

Comment thread docs/internal/EDGEZERO_MIGRATION.md
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/main.rs
prk-Jr added 15 commits June 23, 2026 20:52
Resolve conflicts from main's crate rename (js→trusted-server-js,
openrtb→trusted-server-openrtb, integration-tests→trusted-server-integration-tests),
toolchain bump (1.95.0), and the edgezero #269 repin.

Key resolutions:
- Keep the branch's pinned edgezero rev 38198f9 and fastly 0.11.12. That rev's
  Body::into_bytes returns Bytes (not Option) and the Fastly 0.11 TLS getters
  return Option directly, so main's #269 glue (.into_bytes().unwrap_or_default(),
  .ok().flatten() on TLS getters, oneshot().expect()/?) is reverted to the
  branch's API shape. Forward repin to fastly 0.12 is deferred to its own PR.
- Adopt toolchain 1.95.0 but keep both wasm targets (Cloudflare needs
  wasm32-unknown-unknown).
- Take the platform-neutral EC KV rewrite (ec/kv.rs) over main's Fastly-coupled
  version; main's changes there were cosmetic.
- Carry main's new features onto the branch: tester-cookie endpoints
  (/_ts/set-tester, /_ts/clear-tester) wired into the Fastly router, and the
  osano integration.
- Fix the Cloudflare adapter's trusted-server-js path after the rename, drop the
  leftover renamed-away crate dirs, and restore repo-wide node_modules/dist
  ignores (the old crates/js/.gitignore was removed by the rename).

All adapters build, lint, and test green (fastly+core, axum, cloudflare).
Resolve PR19 conflicts against the latest PR18 base.

Keep PR18's renamed crate layout/toolchain updates while preserving PR19's Spin workspace member, CI checks, route parity, and worker 0.8 compatibility.

Combine Fastly TLS header hardening with PR18's client-info extraction helper, and update Spin/integration-test paths for the trusted-server-js rename.
…ezero-pr18-phase5-verification

# Conflicts:
#	.github/workflows/integration-tests.yml
…ezero-pr18-phase5-verification

# Conflicts:
#	crates/trusted-server-integration-tests/Cargo.lock
…/edgezero-pr19-spin-adapter

# Conflicts:
#	crates/trusted-server-integration-tests/Cargo.lock
@prk-Jr prk-Jr changed the base branch from feature/edgezero-pr19-spin-adapter to main June 29, 2026 11:10
prk-Jr and others added 2 commits June 29, 2026 16:41
The cutover-canary gating in the Fastly adapter main() routes to
edgezero_main only when edgezero_enabled=true and read_rollout_pct()
covers the request. read_rollout_pct fails safe to 0 (legacy) when the
edgezero_rollout_pct key is absent, so the EdgeZero canary fixture must
set it explicitly. Without it, main() fell back to legacy_main and the
test_ec_lifecycle_fastly TRACE canary saw a proxied 200 instead of the
EdgeZero router 405.

Set the key to 100 (full cutover) so every request routes to EdgeZero
deterministically, independent of client-IP bucketing.
@prk-Jr prk-Jr merged commit c5a1f5f into main Jun 29, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cutover and canary rollout

3 participants