Review: 17 findings raised, 17 confirmed (0 refuted). Deduplicated below. Many
collapse into ONE unifying fix: resolve the per-request proxy exactly once
(BYOP > config), store it in REQUEST_PROXY, and make BOTH the HTTP and CDP
paths consume that single entry — no re-picking, no path-specific resolution.
crw-core/proxy.rs— makeStickyPerHoststateless:idx = fxhash(host) % len. Removes theMutex<HashMap>(→ fixes #13 unbounded growth) and the first-insertnext_rr()advance (→ removes a cursor side-effect). Replacedebug_assert!(len>0)with a real guard / keep build the sole non-empty constructor (#14).- Single resolution point. Add
ProxyRotator::pickusage so callers resolve ONEProxyEntry:single.rs::scrape_url: build the BYOP rotator fromreq.proxy_list/req.proxyFIRST (InvalidRequest on bad);resolved = byop.pick(host) ?? renderer.pick_proxy_for_url(url); scopeREQUEST_PROXY(resolved). (Fixes A/#1/#4/#9.)crawl.rs: per page build BYOP rotator fromCrawlRequest.proxy_list(job-scoped, fail the job on bad) and resolveREQUEST_PROXYfrom it, else config. (Fixes E/#8.)
- HTTP path honors
REQUEST_PROXY.RotatingHttpFetcher::fetch: ifREQUEST_PROXYis set, use the warm client whoseentry.raw()matches; if absent (BYOP not in the config pool) build a one-off client; only whenREQUEST_PROXYisNonefall back torotator.pick_index. (Fixes D/#11/#12 — single pick, HTTP+CDP aligned.) Also make the plain (no-config) HTTP path honorREQUEST_PROXYso BYOP-with-no-config-proxy is proxied, never direct. - Drop the now-redundant per-request temp-fetcher branch in
single.rs(the shared renderer +REQUEST_PROXYcovers proxy; keep the temp fetcher only for the stealth-only override case).
- B/#2/#3 — LightPanda escape hatch. In
fetch_with_js, whenproxy_activeand filtering removes all renderers, return a hardCrwError("proxy required but no proxy-capable JS renderer"), never keep LightPanda. AddPageFetcher::supports_proxy()(default true; LightPanda false) to generalize. - C/#5/#6 — CLI silent fallback. Route CLI
--proxythroughProxyRotator::build+with_proxy_rotator(like the server), and/or splitHttpFetcher: infallible no-proxy ctor + fallible proxy ctor. Remove theOption<&str> proxy → unwrap_or_else(default client)foot-gun so a bad proxy is a hard error everywhere. - F/#10 — robots/discovery. Build the robots/discovery reqwest client from the
rotator (pick for the origin host) instead of the single
proxy; replace thewarn + continue directon bad proxy with a hard error. - G/#7 — SOCKS5+auth on CDP. In
ProxyEntry, exposesupports_cdp_auth()(false for socks5/socks5h with auth) and mapsocks5h→socks5forchrome_proxy_server. Reject (or skip-to-error) a socks5+auth entry on the CDP path so it never silently hangs.
- #15 dispose the browser context on the malformed-
createBrowserContextparse-failure path. - #16
proxyBypassList: omit it (Chrome bypasses loopback by default) or use"<local>". - #17 map
RotatingHttpFetcherbuild error toInvalidRequeston the BYOP call site.
cargo clippy --workspace(default +cdp) clean; full test suite green.- New unit tests: stateless-sticky determinism; round_robin single-advance per request;
socks5+auth → rejected for CDP; socks5h→socks5 mapping; lightpanda-only+proxy → error;
proxy-active prefers chrome over lightpanda;
HttpFetcher::with_proxyfail-closed. - Re-run the multi-agent review workflow until 0 confirmed blocker/high findings.
- Round 1 — 17 confirmed (3 blocker, then highs/mediums/lows). Fixed in commits
92c5027+3ff9145. - Round 2 — 11 confirmed, 4 high — all one root: the
/mapdiscover_urlspath was left on the old direct/single-proxy path (BFS fetch unscoped + discovery client ignored the rotator + silent-swallow). Fixed inc920a32. - Round 3 — 8 confirmed, 0 blocker/high →
satisfied: true. Per-request client warm-pool regression and the missing fail-closed tests were then addressed. - Round 4 — verifying the round-3 polish surfaced 1 high (new):
crw crawl --js --proxyrouted only the HTTP tier through the proxy — the CLI crawl/map renderer never attached a rotator, soREQUEST_PROXYstayedNoneand the JS/CDP tier (+ the lightpanda fail-closed guard) was bypassed → real-IP leak. Fixed inca5d7e2(CLI crawl/map build the renderer with a rotator from--proxy). - Round 5 — 6 confirmed, 0 blocker/high →
satisfied: true. Loop converged.
- robots/sitemap client is origin-scoped: under a multi-host crawl with
sticky_per_host, non-origin hosts' robots/sitemap fetches use the origin host's proxy while their page fetches use their own — still proxied (no leak), just an IP-correlation inconsistency. Per-host robots client or documentation. - /map drops per-request BYOP:
MapRequest/DiscoverOptionshave noproxy_list; /map honors only the server-config rotator (fail-safe, never direct when config proxy exists). Add fields if /map BYOP is desired. - proxy client cache eviction:
http_fetcher_for_requestclears the whole cache at 512 entries (coarse safety valve for arbitrary BYOP; config pools never hit it). Swap for LRU if a pathological BYOP workload warrants it. - #15 createBrowserContext parse-failure: if Chrome returns success without a
browserContextId, the context isn't disposed — effectively unreachable (Chrome always returns the id; there is no id to dispose), accepted.
- #17 error class:
build_clientmaps a post-parsereqwest build failure toConfigError(HTTP 500) even on the per-request BYOP path (should beInvalidRequest/400). Window is tiny —ProxyEntry::parsealready validated scheme+host, soreqwest::Proxy::allfailing afterward is unlikely. Map at thewith_proxycall sites if surfaced. - chrome_proxy tier override: when a config/BYOP
REQUEST_PROXYis active AND a request escalates to the residentialchrome_proxytier, the per-request proxy overrides the DataImpulse residential egress. Uncommon combo; document or skip the per-request context whenself.name == "chrome_proxy". - sticky key normalization: BYOP sticky uses the raw host; the config rotator uses
normalize_host(eTLD+1). Mutually exclusive per request (no correctness impact); unify by normalizing the BYOP host key too.