craftling-go

The control plane and host agent for a multi-host, Firecracker-microVM Minecraft hosting platform. A Gin HTTP API owns users, auth, and game-server desired state; a reconciler drives each server onto a fleet host; and a per-host agent boots it as a real Firecracker microVM. See docs/PLAN.md for the phased roadmap. P0–P6 have landed (foundations, host fleet, scheduler, agent split, Firecracker runtime, world persistence, and the eBPF NAT dataplane); P7–P10 — observability, reliability, quotas, and hardening — remain.

Requirements

Go 1.26+
PostgreSQL 13+
For the real VM backend (AGENT_RUNTIME=firecracker): a Linux host with /dev/kvm, the firecracker binary, a vmlinux kernel, and rootfs images.
For (re)building the eBPF dataplane: clang + libbpf + bpftool on a Linux ≥6.6 host (see docs/ebpf-nat-dataplane.md).

Getting started

cp .env.example .env   # set DATABASE_URL, JWT_SECRET, etc.

# Spin up Postgres (example via Docker):
docker run -d --name craftling-pg -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=craftling -p 5432:5432 postgres:16-alpine

make run               # or: go run ./cmd/server

The server listens on :8080 by default and applies the embedded goose migrations (internal/db/migrations) automatically on startup — clean on both a fresh database and one already at an older revision.

Configuration

Control plane (`make run`)

Variable	Default	Description
`PORT`	`8080`	HTTP listen port
`GRPC_PORT`	`8090`	gRPC AgentLink listen port — agents dial it and hold a command stream open
`APP_ENV`	`development`	`production` → release mode + JSON logs
`DATABASE_URL`	`postgres://postgres:postgres@localhost:5432/craftling?sslmode=disable`	Postgres connection string
`JWT_SECRET`	`dev-secret-change-me`	HMAC signing secret (set in prod)
`ACCESS_TTL`	`15m`	Access-token (JWT) lifetime
`REFRESH_TTL`	`720h` (30d)	Refresh-token lifetime
`ADMIN_EMAIL`	(unset)	If set with `ADMIN_PASSWORD`, seeds/promotes this admin on startup
`ADMIN_PASSWORD`	(unset)	Admin bootstrap password
`TEMPLATE_INDEX_URL`	(unset)	Marketplace registry index the `/templates` API fetches from

Host agent (`make run-agent`, `MODE=agent`)

Variable	Default	Description
`CONTROL_PLANE_GRPC_ADDR`	`localhost:8090`	Control plane's gRPC AgentLink the agent dials and holds a stream open to
`ADVERTISE_HOST`	`127.0.0.1`	Player-facing connect host VMs report
`AGENT_RUNTIME`	`fake`	VM backend: `fake` (in-memory) or `firecracker` (real microVMs)
`FC_BINARY`	`firecracker`	Firecracker executable (PATH lookup by default)
`FC_KERNEL`	(required)	Uncompressed `vmlinux` all VMs boot
`FC_IMAGE_REF`	(required)	OCI image converted to a read-only squashfs rootfs; `{version}` is substituted with the server's version (a ref without it is used as-is)
`FC_IMAGE_REF_DEFAULT`	(unset)	Fallback OCI ref when a server has no version and `FC_IMAGE_REF` is templated
`FC_IMAGE_CACHE_DIR`	`images/` under `FC_WORK_DIR`	Where converted, content-addressed squashfs rootfs files are cached
`FC_INIT_BIN_AMD64` / `FC_INIT_BIN_ARM64`	(required)	Prebuilt `cmd/init` binaries injected as guest PID 1, one per guest arch
`FC_WORK_DIR`	OS temp dir	Per-VM working dirs (sockets, logs, vsock UDS)

World persistence (P5) is off until you opt in:

Variable	Default	Description
`FC_WORLD_PERSIST`	`false`	Attach a per-server world disk overlaid on the rootfs (needs `mkfs.ext4` + a guest kernel with `CONFIG_OVERLAY_FS`/`CONFIG_EXT4_FS`)
`FC_DATA_DIR`	`worlds/` under `FC_WORK_DIR`	Where per-server world disks live (survive stop/start)
`FC_WORLD_DISK_MB`	driver default	Size of a freshly created world disk
`FC_MKFS_EXT4`	PATH lookup	`mkfs.ext4` executable
`FC_WORLD_STORE_DIR`	(unset)	Directory/NFS mount used as the durable world store (restore-on-boot, snapshot-on-stop, cross-host reschedule)
`FC_WORLD_STORE_S3_*`	(unset)	S3-compatible world store (`_ENDPOINT`, `_BUCKET`, `_REGION`, `_ACCESS_KEY`, `_SECRET_KEY`, `_USE_SSL`, `_PREFIX`); takes precedence over `FC_WORLD_STORE_DIR`
`FC_SNAPSHOT_INTERVAL`	`0` (off)	Periodic application-consistent snapshots of running VMs (needs a world store)
`FC_RCON_PORT` / `FC_RCON_PASSWORD`	(unset)	Let the guest flush the workload via RCON before freezing its disk for a live snapshot

The eBPF NAT dataplane (P6) activates automatically on a Linux ≥6.6 agent host with nf_conntrack/nf_nat available; it needs no extra env to run.

AGENT_ID, AGENT_HOSTNAME, ZONE, CPUS_TOTAL, MEMORY_MB_TOTAL, and AGENT_VERSION further describe the host in its registration.

Endpoints

Method	Path	Auth	Description
GET	`/healthz`	—	Liveness probe
GET	`/api/v1/ping`	—	Example endpoint
POST	`/api/v1/auth/register`	—	Create user, returns a token pair
POST	`/api/v1/auth/login`	—	Verify credentials, returns a pair
POST	`/api/v1/auth/refresh`	—	Rotate refresh token, returns a pair
POST	`/api/v1/auth/logout`	—	Revoke a refresh token (`204`)
GET	`/api/v1/me`	Bearer	Current authenticated user
POST	`/api/v1/servers`	Bearer	Create a game server
GET	`/api/v1/servers`	Bearer	List your game servers
GET	`/api/v1/servers/:id`	Bearer	Get one of your servers
PATCH	`/api/v1/servers/:id`	Bearer	Rename, or set `desired_state` (running/stopped)
DELETE	`/api/v1/servers/:id`	Bearer	Tear down a server (`202`)
POST	`/api/v1/servers/:id/snapshot`	Bearer	Request an on-demand world backup
GET	`/api/v1/templates`	Bearer	List marketplace templates
GET	`/api/v1/templates/:id`	Bearer	Get one template's manifest
GET	`/api/v1/admin/users`	Admin	List all users (role `admin` only)
GET	`/api/v1/admin/servers`	Admin	List all servers across all owners
GET	`/api/v1/admin/hosts`	Admin	List the fleet

Agents do not use the HTTP API. Each host agent dials the control plane's gRPC AgentLink service (GRPC_PORT, default :8090) and holds one bidirectional stream open: it registers and heartbeats on that stream, the control plane pushes VM lifecycle commands down it, and the agent answers on the same stream. The agent exposes no inbound API of its own.

Auth endpoints return a token pair:

{
  "access_token": "<jwt>",
  "refresh_token": "<opaque>",
  "token_type": "Bearer",
  "expires_in": 900
}

The short-lived access token is sent as Authorization: Bearer <access_token> on protected routes. When it expires, exchange the refresh token at /auth/refresh for a new pair. Refresh tokens are rotated on every use (the old one is revoked); replaying a rotated token is treated as theft and revokes all of that user's tokens.

# Register (or login) to get a token pair
PAIR=$(curl -s -X POST localhost:8080/api/v1/auth/register \
  -H 'Content-Type: application/json' \
  -d '{"email":"alice@example.com","password":"hunter2pass"}')
ACCESS=$(echo "$PAIR" | jq -r .access_token)
REFRESH=$(echo "$PAIR" | jq -r .refresh_token)

# Call a protected route
curl localhost:8080/api/v1/me -H "Authorization: Bearer $ACCESS"

# Later: rotate for a fresh pair
curl -s -X POST localhost:8080/api/v1/auth/refresh \
  -H 'Content-Type: application/json' \
  -d "{\"refresh_token\":\"$REFRESH\"}"

# Log out (revoke the refresh token)
curl -s -X POST localhost:8080/api/v1/auth/logout \
  -H 'Content-Type: application/json' \
  -d "{\"refresh_token\":\"$REFRESH\"}"

Roles

Each user has a single role (user by default, or admin). The role is embedded in the access-token claims, and RequireRole middleware guards admin routes — so authorization is checked without a database round-trip.

Because the role lives in the JWT, a role change takes effect on the user's next token refresh (within ACCESS_TTL), not instantly.

Set ADMIN_EMAIL + ADMIN_PASSWORD to bootstrap an admin on startup: a matching user is created if absent, or promoted to admin if it already exists (its password is left untouched).

Game servers & the reconciler

A game server separates desired state (running / stopped / deleted, set by the API) from observed status (pending → provisioning → running → stopping → stopped → deleting, plus error). A background reconciler (internal/reconciler) ticks periodically, finds servers whose status doesn't match their desired state, and drives them one step at a time via a Provisioner backend.

Placement (P2). Before a server can run it must be placed on a fleet host. The internal/scheduler picks a ready host with enough allocatable cpu/memory (least-loaded) and atomically reserves that capacity; the assignment is recorded as game_servers.host_id. A server that fits no host right now is marked unschedulable and retried; a spec larger than any host is rejected at create time.

Agent split (P3). The control plane never touches KVM. The reconciler's backend is provisioner.RemoteProvisioner, which routes each call to the assigned host's agent (cmd/agent / internal/agent) by pushing a command down the persistent gRPC stream that agent holds open to the control plane (internal/agentlink) — the control plane never dials the agent, so agents need no inbound reachability. The agent runs a Runtime; which one is chosen by AGENT_RUNTIME (fake for the in-memory stub, or firecracker for real microVMs) — the API and reconciler are identical either way because both satisfy the same agent.Runtime interface. Agents register and heartbeat over that same stream, and dropping the stream marks the host down at once, so the scheduler always knows the fleet.

# Create a server (desired_state defaults to running)
curl -s -X POST localhost:8080/api/v1/servers \
  -H "Authorization: Bearer $ACCESS" -H 'Content-Type: application/json' \
  -d '{"name":"survival","version":"1.20.4"}'

# Or launch from a marketplace template: the control plane fetches the manifest
# from the trusted registry, validates the answers, and resolves the OCI image
# and environment server-side (the client never supplies the image or raw env).
curl -s -X POST localhost:8080/api/v1/servers \
  -H "Authorization: Bearer $ACCESS" -H 'Content-Type: application/json' \
  -d '{"name":"survival","template_id":"vanilla-1","eula_accepted":true,
       "answers":{"DIFFICULTY":"hard"},"cpus":4,"memory_mb":8192}'

# Poll until the reconciler reports status "running" with host/port
curl -s localhost:8080/api/v1/servers/<id> -H "Authorization: Bearer $ACCESS"

# Stop / start
curl -s -X PATCH localhost:8080/api/v1/servers/<id> \
  -H "Authorization: Bearer $ACCESS" -H 'Content-Type: application/json' \
  -d '{"desired_state":"stopped"}'

# Tear down (reconciler deprovisions, then soft-deletes the row)
curl -s -X DELETE localhost:8080/api/v1/servers/<id> -H "Authorization: Bearer $ACCESS"

Deletes are soft: the reconciler deprovisions the VM and then sets status = deleted with a deleted_at timestamp. The row is retained for history/audit but hidden from every API read (and from reconciliation).

Firecracker runtime, the image pipeline & networking

Runtime (P4). internal/agent/firecracker boots each server as a real Firecracker microVM, driving Firecracker through the in-repo generated REST client (internal/firecracker) over the per-VM API Unix socket and managing the VMM process directly. Provision stages a per-VM writable copy of the version's base rootfs, launches firecracker, configures machine/boot-source/drive, then starts the instance; Stop sends SendCtrlAltDel (force-kill after a grace period) but keeps the rootfs so the world survives a restart on that host; Start re-boots from it. make test-kvm runs the gated lifecycle integration test on a /dev/kvm host.

World persistence (P5). With FC_WORLD_PERSIST on, each server gets a writable world disk (/dev/vdb, ext4) overlaid by the guest onto the workload's working dir, so worlds survive stop/start. A durable world store (FC_WORLD_STORE_DIR or FC_WORLD_STORE_S3_*) restores a world on Provision and snapshots it on Stop — which is what makes a cross-host reschedule safe. Live snapshots quiesce the running game (RCON save-off/save-all flush + fsfreeze) over a vsock control channel, and a control-plane reaper GCs orphan worlds.

Image pipeline. internal/image converts an OCI/Docker image into a read-only squashfs rootfs (internal/squashfs, a from-scratch writer), injecting the Go init binary (cmd/init) as PID 1. The init agent mounts the kernel filesystems, fetches a run spec (internal/runspec) from Firecracker's MMDS over the link-local address, applies per-VM networking + the persist overlay, then execs and supervises the workload — powering the VM off when it exits. Production Provision resolves the server's OCI ref (FC_IMAGE_REF), converts it through this pipeline (cached by digest), attaches the squashfs read-only as /dev/vda, and boots cmd/init with root=/dev/vda ro init=/.craftling/init — the run-spec/persist/NAT machinery layers on top. The control plane resolves launchable templates from a marketplace registry (internal/registry, the /templates API).

Networking (P6). An eBPF NAT dataplane gives each VM real connectivity with no Linux bridge and no iptables/nftables rules: TCX-attached programs SNAT egress and DNAT a per-server host port (allocated by an in-agent IPAM) to the in-VM service port, reusing the kernel's nf_conntrack via bpf_ct_* kfuncs. The allocated host/port is written back to the game server, and the guest applies its address, gateway neighbor, and default route from the run spec. Needs a Linux ≥6.6 agent host. Design and status: docs/ebpf-nat-dataplane.md.

Layout

cmd/server          control-plane entry point: DB connect/migrate + HTTP API + gRPC AgentLink + graceful shutdown
cmd/agent           host-agent entry point: dials the control plane, serves pushed VM commands over a persistent gRPC stream; selects the VM backend
cmd/init            in-VM PID 1: mount, fetch run spec from MMDS, apply networking, supervise workload
internal/config     environment configuration (control plane + agent + Firecracker)
internal/db         pgx pool + embedded goose migrations
internal/model      domain types (User, GameServer, Host)
internal/repository data access (Postgres; in-memory host inventory)
internal/auth       bcrypt password hashing + JWT issue/verify
internal/handler    control-plane routes (auth, servers, templates, admin)
internal/middleware request ID, request logging, JWT auth guard
internal/registry   template registry (marketplace) client
internal/scheduler  host placement + atomic capacity reservation (P2)
internal/reconciler desired-state → observed-status convergence loop
internal/provisioner Provisioner seam: Fake + RemoteProvisioner (P3)
internal/agent      host agent: Runtime interface, FakeRuntime, agent-side gRPC link loop (P3)
internal/agentlink  control-plane side of the agent gRPC stream: hub (connection registry + command push) + generated protobuf
internal/agent/firecracker  real Firecracker microVM Runtime + eBPF NAT dataplane (P4/P6)
internal/firecracker generated Firecracker REST client (go-openapi, over the API Unix socket)
internal/image      OCI/Docker image → squashfs rootfs converter (injects cmd/init)
internal/squashfs   squashfs writer/compressor used by the converter
internal/runspec    host↔guest run-spec contract delivered via MMDS
internal/reaper     periodic background cleanup (stale hosts → down)
internal/seed       initial-data bootstrap (admin account)
internal/logger     zap logger + request-scoped context helpers

Commands

make run          # run the control plane
make run-agent    # run a host agent (MODE=agent; see Makefile/.env.example for env)
make build        # build control plane to ./bin/server
make build-agent  # build agent to ./bin/agent
make test         # run unit tests
make test-e2e     # run end-to-end tests (requires Docker)
make test-kvm     # Firecracker lifecycle test (requires /dev/kvm + host artifacts)
make test-bpf     # eBPF NAT/tapfilter dataplane in-kernel (sudo; Linux ≥6.6)
make bpf-generate # regenerate eBPF bindings/objects (Linux ≥6.6 + clang/libbpf/bpftool)
make tidy         # tidy go.mod
make fmt          # format code

Testing

End-to-end tests live in test/e2e/. They are gated behind the e2e build tag, so make test (plain go test ./...) never touches Docker. The e2e suite uses testcontainers to start a real Postgres, serves the actual router over HTTP, and drives the full lifecycle (register → login → server placement → reconcile → teardown):

make test-e2e          # or: go test -tags e2e ./test/e2e/...

The Firecracker lifecycle test is gated behind the kvm build tag and kept out of the default lane; run it on a KVM host with the FC_* artifacts in place:

# The KVM tests convert a tiny busybox image themselves, so only the kernel
# (and optionally FC_BINARY) is required:
sudo -E FC_KERNEL=/path/vmlinux make test-kvm

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.claude		.claude
.github/workflows		.github/workflows
cmd		cmd
docs		docs
frontend		frontend
internal		internal
proto/agentlink		proto/agentlink
test/e2e		test/e2e
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.agent		Dockerfile.agent
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

craftling-go

Requirements

Getting started

Configuration

Control plane (`make run`)

Host agent (`make run-agent`, `MODE=agent`)

Endpoints

Roles

Game servers & the reconciler

Firecracker runtime, the image pipeline & networking

Layout

Commands

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

craftling-go

Requirements

Getting started

Configuration

Control plane (make run)

Host agent (make run-agent, MODE=agent)

Endpoints

Roles

Game servers & the reconciler

Firecracker runtime, the image pipeline & networking

Layout

Commands

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Control plane (`make run`)

Host agent (`make run-agent`, `MODE=agent`)

Packages