Skip to content

Fix/scoring stuck on broker error#2450

Draft
ObadaS wants to merge 7 commits into
developfrom
fix/scoring-stuck-on-broker-error
Draft

Fix/scoring stuck on broker error#2450
ObadaS wants to merge 7 commits into
developfrom
fix/scoring-stuck-on-broker-error

Conversation

@ObadaS

@ObadaS ObadaS commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Original PR

#2420

A brief description of the purpose of the changes contained in this PR

Submissions could get stuck in SCORING indefinitely because the Celery task that runs the next phase was enqueued inside the outer Django transaction, before the row was committed. The compute_worker could dequeue and execute the task before the new status (and updated FKs like queue / celery_task_id) were visible in the database, then bail out or operate on stale data.

This PR moves the send_task call into a transaction.on_commit() callback so the message hits RabbitMQ only after the surrounding transaction is committed. Behaviour is otherwise unchanged.

Issues this PR resolves

Closes #2419

Symptoms reported:

  • Submissions stuck in SCORING with no worker activity.
  • Sporadic Submission.DoesNotExist / stale-read errors in compute_worker logs right after a status transition.
  • celery_task_id occasionally NULL on rows that did get picked up.

Root cause: app.send_task(...) was called from inside the outer @transaction.atomic scope of _run_submission, so the broker received the task before PostgreSQL committed the writes. Under load (or with a fast worker / slow commit), the worker won the race.

Fix: wrap the enqueue + celery_task_id write in a _enqueue_after_commit() closure and register it via transaction.on_commit(...). The closure runs only when the outer transaction commits successfully, and is silently dropped on rollback (no orphaned messages on the broker).

A checklist for hand testing

  • Create a fresh submission on a competition with a non-default queue → it reaches Finished (no stuck SCORING).
  • Create a fresh submission on a competition with the default queue → same.
  • Re-run an existing submission via the UI → it reaches Finished.
  • Submit, then immediately roll back the surrounding request (e.g. force an exception in a signal) → confirm no orphan message hits compute-worker (RabbitMQ management UI shows no dangling delivery).
  • Inspect submission.celery_task_id after enqueue → not NULL.
  • Restart compute_worker mid-submission lifecycle → submission still completes (does not regress M6 idempotency).
  • Cancel a SUBMITTED submission before its commit completes → celery_app.control.revoke(...) still works because the celery_task_id is set inside the same on_commit callback.

Any relevant files for testing

  • Modified: src/apps/competitions/tasks.py (around _run_submission_enqueue_after_commit closure + transaction.on_commit(...)).
  • Imports: from django.db import transaction (already present).
  • No model / migration changes required.

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

AybH26 and others added 7 commits June 16, 2026 18:47
…rows

When the compute worker PATCHes a submission to status=SCORING, the API serializer used to call run_submission() synchronously inside the same DB transaction. If the broker (RabbitMQ) was unreachable at that exact moment, the status row would commit but the scoring task would never be published, leaving the submission stuck in SCORING forever (no recovery: the 24h cleanup only rescues RUNNING rows).

Move the enqueue into transaction.on_commit so the task is only published after the SCORING status is durably committed, and explicitly mark the submission as Failed (with a clear status_details) if the publish still fails, so the row never stays in a non-terminal limbo state. Wrap update() in @transaction.atomic to make the commit boundary explicit.
fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …
This reverts commit d38a013.
This reverts commit 2aad1c9.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submissions stuck in "Scoring" when broker error occurs during compute_worker PATCH (non-transactional re-enqueue)

2 participants