Skip to content

Conditional compilation of compute drivers #1943

@dimityrmirchev

Description

@dimityrmirchev

Problem Statement

The openshell-gateway binary unconditionally compiles in all four compute drivers (Kubernetes, Docker, Podman, VM). Operators who only need one pay for the others in:

  • Supply-chain surface. A CVE in bollard, kube-rs, oci-client, etc., applies to the shipped binary even when that driver is never instantiated.
  • Build time and binary size. Every clean build pays for every driver.
  • Downstream packaging. Producing e.g. a "Kubernetes-only" or "Docker-only" gateway image today requires forking Cargo.toml.

Measured against the gateway's current dependency graph, a single-driver build (e.g. K8s-only or Docker-only) sheds a substantial fraction of the transitive crate set. Trimming one driver from the full default set saves less, because the four drivers share most of their tonic/hyper/serde stack. The headline savings come from picking a single driver, not from removing one.

The same need exists for telemetry as a build-time switch — already partially solved (a telemetry cargo feature exists on openshell-server), so this issue mainly extends an established pattern.

Proposed Design

Add per-driver cargo features on openshell-server, all enabled by default. Building with --no-default-features and selecting a subset produces a gateway that contains only the requested drivers (and optionally telemetry).

Feature surface:

  • default = ["telemetry", "driver-kubernetes", "driver-docker", "driver-podman", "driver-vm"]
  • telemetry — already exists; retained as-is.
  • driver-kubernetes, driver-docker, driver-podman, driver-vm — new; each gates the corresponding driver crate as an optional dependency.

Behavior contract:

  1. Default build is byte-equivalent to today (all drivers + telemetry).
  2. Any non-empty combination of driver features must compile and produce a working gateway that supports exactly those drivers.
  3. A build with zero driver features must fail at compile time with a clear message — not at first sandbox creation.
  4. If gateway.toml configures a driver that wasn't compiled in, the gateway must fail at startup with a clear error naming the missing driver and the cargo flag that re-enables it.
  5. Auto-detection (detect_driver() when no driver is configured) must skip drivers not compiled in and try the next available one in a deterministic order, rather than failing.
  6. Runtime telemetry behavior (the OPENSHELL_TELEMETRY_ENABLED env var) is unchanged. The cargo feature controls whether the code is present; the env var controls whether present code emits.

CI: A build matrix that exercises the default build, each driver in isolation, and at least one mixed combo. Reuse the approach already used by tasks/scripts/verify-telemetry-compiled-out.sh (grep the binary for known strings from omitted crates) to verify drivers actually compile out.

Alternatives Considered

  1. Runtime-only gating. Cheapest, but doesn't shrink the binary, dependency graph, or supply-chain surface. The explicit ask is to not include the code, not just disable it.
  2. A single slim feature that drops all non-K8s drivers. Inflexible — a Docker-only operator can't use it. Per-driver features cost the same to implement and give every combination for free.
  3. One gateway binary per driver. Multiplies release artifacts, breaks the single-binary deployment story, forces every config/auth change to ripple across N binaries.
  4. Dynamic plugin loading (cdylib). True runtime selection but introduces ABI-stability obligations, breaks single-static-binary deploys, and complicates signing/distribution. Overkill for the stated need.
  5. Workspace default-members exclusion. Doesn't change openshell-server's own dependency graph — solves a different problem.

Agent Investigation

  • Dependency footprint was measured directly with cargo tree -p <crate> --edges normal --prefix none --no-dedupe. Numbers in the Problem Statement table are reproducible from HEAD. Note that "exclusive crates given the other three drivers are present" is a different number from "crates added on top of openshell-core" — the former is small for Docker (1) and Podman (1) because they share their transitive graph with K8s through tonic/hyper/serde; the latter (+19, +33) is what an operator actually saves in a single-driver build. The proposal motivates the feature in terms of single-driver builds, where the savings are real.
  • Driver wiring is centralized in three sites in openshell-server: imports/re-exports in crates/openshell-server/src/compute/mod.rs:9-38, four ComputeRuntime::new_* constructors in crates/openshell-server/src/compute/mod.rs:295-413, and the match driver block in crates/openshell-server/src/lib.rs:716-780. No driver type leaks elsewhere into the gateway, so feature-gating is mechanically straightforward.
  • The CLI does not depend on driver crates (confirmed in crates/openshell-cli/Cargo.toml) — it talks to the gateway over gRPC. No CLI changes needed.
  • The telemetry feature already follows the exact pattern proposed here. crates/openshell-server/Cargo.toml:99-107 defines default = ["telemetry"] and propagates to openshell-core/telemetry. crates/openshell-driver-vm/Cargo.toml:49-54 does the same, and crates/openshell-core/src/telemetry.rs uses #[cfg(feature = "telemetry")] on imports, the background sender thread, the HTTP client, and emission helpers (which become no-ops when the feature is off). tasks/scripts/verify-telemetry-compiled-out.sh already proves out the compile-out property in CI by grepping for events.telemetry.data.nvidia.com and the telemetry client ID. The driver feature surface should reuse this exact shape — no new pattern to invent.
  • The VM driver is already out-of-process — the gateway just spawns openshell-driver-vm as a subprocess and talks to it over a UDS gRPC channel (crates/openshell-server/src/compute/vm.rs). Its in-server footprint is small (a spawn helper + RemoteComputeDriver gRPC client), so a driver-vm feature is essentially gating that helper module.
  • Existing TODO(driver-abstraction) at crates/openshell-server/src/lib.rs:12-20 already anticipates collapsing the per-arm wiring. This proposal does not require that refactor — but lands well alongside it whenever it happens.
  • README already documents OPENSHELL_TELEMETRY_ENABLED as the runtime knob (README.md mentions it around line 251). This issue does not change the runtime env-var contract; it adds a parallel compile-time knob, exactly as the existing telemetry cargo feature does.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions