Skip to content

fix(scheduler): honor a pauseJob() that lands while a job is in flight#161

Open
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:fix/scheduler-inflight-pause-stomp-148
Open

fix(scheduler): honor a pauseJob() that lands while a job is in flight#161
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:fix/scheduler-inflight-pause-stomp-148

Conversation

@truffle-dev

Copy link
Copy Markdown
Contributor

Summary

executeJob captures newStatus from the in-memory job snapshot taken at pickup, then writes it back unconditionally after handleMessage returns. Because an agent task can run for minutes, a pauseJob() call landing during that window writes status='paused' to the row, but the write-back stomps it back to 'active' and the next scheduled fire runs anyway. The pause is silently lost.

This honors a concurrent pause by re-reading the live status just before the write-back.

Why this matters

pauseJob() (service.ts) does UPDATE ... SET status='paused' WHERE id=? AND status='active'. The SQL succeeds, but executeJob's final UPDATE scheduled_jobs SET ... status=? uses the stale captured value and overwrites it. An operator who pauses a long-running recurring job mid-execution sees it keep firing. There is no error, no log, nothing to debug against.

The fix

Right before the write-back, re-read the row's current status. If an external writer set it to 'paused' during the run, honor it (and clear next_run_at). The guard is scoped to the 'active' continuation case only: 'completed' and 'failed' are this run's terminal outcomes and must win over a concurrent pause. resumeJob() recomputes next_run_at from the schedule when the operator un-pauses, so clearing it here is safe.

if (newStatus === "active") {
	const live = ctx.db.query("SELECT status FROM scheduled_jobs WHERE id = ?").get(job.id) as {
		status: string;
	} | null;
	if (live?.status === "paused") {
		newStatus = "paused";
		nextRunAt = null;
	}
}

One extra single-row read per fire, only on the continuation path.

Test

Adds a regression test in scheduler-hardening.test.ts. The mocked handleMessage performs the concurrent pause mid-run (UPDATE ... SET status='paused'), then the test asserts the row stays 'paused' with next_run_at null after executeJob.

Mutation-proved: reverting the fix turns the test red (Expected: "paused"), restoring it turns it green.

  • bun test src/scheduler/ -> 116 pass, 0 fail
  • bun run lint -> clean
  • bun run typecheck -> clean

Fixes #148

executeJob captured newStatus from the job snapshot taken at pickup, then
wrote it back unconditionally after handleMessage returned. For an agent
task that runs for minutes, a pauseJob() landing in that window set
status='paused' in the row, but the write-back stomped it back to 'active'
and the next fire ran anyway.

Re-read the live status just before the write-back and honor an external
pause. Only the 'active' continuation case is at risk; 'completed' and
'failed' are this run's terminal outcomes and must win. resumeJob()
recomputes next_run_at, so clearing it here is safe.

Fixes ghostwright#148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: pauseJob() is silently no-op on in-flight jobs (executor caches status at pickup)

1 participant