Skip to content

feat(node-removal): online node removal as inverse of cluster expansion#1104

Open
schmidt-scaled wants to merge 9 commits into
mainfrom
feature/node-removal
Open

feat(node-removal): online node removal as inverse of cluster expansion#1104
schmidt-scaled wants to merge 9 commits into
mainfrom
feature/node-removal

Conversation

@schmidt-scaled

Copy link
Copy Markdown
Contributor

Summary

Implements an online storage-node removal function that works as the inverse of cluster expansion (add_node). It is background, task-driven, idempotent and resumablesn remove validates preconditions then queues an FN_NODE_REMOVAL task; the new tasks_runner_node_removal service drives the flow:

shutdown → in_removal → rewire LVS replicas → remove/fail/migrate devices → removed

This replaces the old remove_storage_node semantics (which only handled an already-offline node and did no LVS rewiring).

Behavior

Preconditions (enforced before anything is queued):

  • target node is online;
  • every other non-removed node is online;
  • FTT headroom allows losing the node (_check_ftt_allows_node_removal);
  • the node hosts no LVols and no snapshots (operator migrates those separately, at a higher level);
  • every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target (catches e.g. 2-node clusters).

LVS rewiring:

  • Case A — the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them, clear cross-references.
  • Case B — replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.

Devices: each data device is driven online → removed → failed (queuing failure-migration on the surviving online nodes), then the flow waits for failed_and_migrated before flipping the node to removed.

Changes

  • models/base_model.py: new STATUS_IN_REMOVAL (code 13)
  • models/job_schedule.py: new FN_NODE_REMOVAL
  • controllers/tasks_controller.py: add_node_removal_task + dedup branch + get_active_node_removal_task; skip IN_REMOVAL nodes when fanning out device-migration tasks (their SPDK is dead)
  • storage_node_ops.py: rewrite remove_storage_node (validate + queue) + node_removal_orchestrate + Case A/B + device-decommission/finalize helpers
  • services/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)
  • scripts/docker-compose-swarm.yml: register TasksNodeRemovalRunner
  • simplyblock_cli/cli.py: updated sn remove help; --force-remove now only cancels active tasks
  • tests/test_node_removal.py: 21 unit tests

Testing

  • New: tests/test_node_removal.py — 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅
  • Full non-migration suite: 1137 passed, 213 skipped, no regressions.

Notes / review focus

  • Case B (recreating a replica on a fresh node while the cluster is online) is the net-new, highest-risk piece — it reuses recreate_lvstore_on_non_leader with the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.
  • Unit tests mock all data-plane RPCs; this has not yet been validated on a live cluster.

🤖 Generated with Claude Code

Add a background, task-driven node-removal orchestration that mirrors
add_node/cluster expansion in reverse. `sn remove` now validates and
queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the
idempotent, resumable flow:

  shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate
  devices -> removed

Preconditions (enforced before queueing): target ONLINE, every other
non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node
(operator migrates those separately), and every secondary/tertiary
replica the node hosts for other primaries has a host-disjoint
relocation target.

LVS rewiring:
  * Case A - the node's own primary LVS: tear down its (empty)
    secondary/tertiary replicas on the peers that host them.
  * Case B - replicas this node hosts for other primaries: re-host each
    on a fresh, anti-affinity-valid node (get_secondary_nodes /
    get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those
    primaries keep their fault tolerance.

Devices are driven online -> removed -> failed (queuing failure
migration on the surviving online nodes) and the flow waits for
failed_and_migrated before flipping the node to removed.

Changes:
  * base_model: new STATUS_IN_REMOVAL (code 13)
  * job_schedule: new FN_NODE_REMOVAL
  * tasks_controller: add_node_removal_task + dedup + getter; skip
    IN_REMOVAL nodes when fanning out device-migration tasks
  * storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate
    and Case A/B + device helpers
  * services/tasks_runner_node_removal.py: new runner service
  * docker-compose-swarm.yml: register TasksNodeRemovalRunner
  * tests/test_node_removal.py: 21 unit tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread simplyblock_core/storage_node_ops.py Fixed
Comment thread simplyblock_cli/cli.py Fixed

@mxsrc mxsrc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.

return cands[0] if cands else None


def node_removal_orchestrate(node_id, force_remove=False):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.

Comment on lines +2247 to +2312
existing = tasks_controller.get_active_node_removal_task(snode.cluster_id, node_id)
if existing:
logger.info(f"Node removal already in progress for {node_id} (task {existing})")
return existing

if snode.status == StorageNode.STATUS_REMOVED:
logger.warning(f"Node already removed: {node_id}")
return False

if snode.status != StorageNode.STATUS_ONLINE:
logger.error(
f"Can not remove node {node_id}: it must be ONLINE to start removal "
f"(current status: {snode.status}). Removal shuts the node down itself.")
return False

# All other nodes must be online — removal rewires LVS replicas and drives
# device migration across the surviving nodes, which requires every peer up.
not_online = [
n for n in db_controller.get_storage_nodes_by_cluster_id(snode.cluster_id)
if n.get_id() != node_id
and n.status not in (StorageNode.STATUS_ONLINE, StorageNode.STATUS_REMOVED)
]
if not_online:
offending = ", ".join(f"{n.get_id()}({n.status})" for n in not_online)
logger.error(
f"Can not remove node {node_id}: all nodes must be online. "
f"Not-online node(s): {offending}")
return False

allowed, reason = _check_ftt_allows_node_removal(node_id, db_controller)
if not allowed:
logger.error(f"Can not remove node {node_id}: {reason}")
return False

lvols = db_controller.get_lvols_by_node_id(node_id)
if lvols:
logger.error(
f"Can not remove node {node_id}: {len(lvols)} LVol(s) present. "
f"Migrate or delete them first.")
return False

node_snaps = [
sn for sn in db_controller.get_snapshots()
if sn.lvol.node_id == node_id and sn.deleted is False
]
if node_snaps:
logger.error(
f"Can not remove node {node_id}: {len(node_snaps)} snapshot(s) present. "
f"Remove them first.")
return False

tasks = tasks_controller.get_active_node_tasks(snode.cluster_id, snode.get_id())
if tasks:
logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force")
logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force-remove")
if force_remove is False:
return False
for task in tasks:
tasks_controller.cancel_task(task.uuid)

lvols = db_controller.get_lvols_by_node_id(node_id)
if lvols:
if force_migrate:
for lvol in lvols:
# Case-B feasibility: every replica this node hosts for another primary must
# have somewhere host-disjoint to go. Catches e.g. 2-node clusters where the
# tertiary cannot be re-placed without violating anti-affinity.
feasible, reason = _check_replica_relocation_feasible(snode, db_controller)
if not feasible:
logger.error(f"Can not remove node {node_id}: {reason}")
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we rework this function completely, I think we should convert this from returning a success-boolean to exceptions, in particular for PreconditionError.

This also introduces quite a few checks, I'm not sure whether some of them are new. In any case, did we check that after starting this task, none of these conditions can be removed by external factors? I.e. do we consistently prevent

  • this node (or any other node) being shut down/suspended/restarted
  • LVols/Snapshots being created/moved to this node
  • Tasks being scheduled for this node

I think at least for some of them we already have checks, but to avoid race conditions we'd have to at least set the IN_REMOVAL status here. Technically, a two phase commit would be needed, setting the status or a lock before running the checks, and clearing it if any fails.

# Conflicts:
#	simplyblock_cli/cli.py
#	simplyblock_core/controllers/tasks_controller.py
#	simplyblock_core/scripts/docker-compose-swarm.yml
#	simplyblock_core/storage_node_ops.py
try:
sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)
exclude_mgmt_ips.append(sec.mgmt_ip)
except KeyError:
Comment thread simplyblock_core/storage_node_ops.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants