feat(node-removal): online node removal as inverse of cluster expansion by schmidt-scaled · Pull Request #1104 · simplyblock/sbcli

schmidt-scaled · 2026-06-15T20:25:02Z

Summary

Implements an online storage-node removal function that works as the inverse of cluster expansion (add_node). It is background, task-driven, idempotent and resumable — sn remove validates preconditions then queues an FN_NODE_REMOVAL task; the new tasks_runner_node_removal service drives the flow:

shutdown → in_removal → rewire LVS replicas → remove/fail/migrate devices → removed

This replaces the old remove_storage_node semantics (which only handled an already-offline node and did no LVS rewiring).

Behavior

Preconditions (enforced before anything is queued):

target node is online;
every other non-removed node is online;
FTT headroom allows losing the node (_check_ftt_allows_node_removal);
the node hosts no LVols and no snapshots (operator migrates those separately, at a higher level);
every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target (catches e.g. 2-node clusters).

LVS rewiring:

Case A — the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them, clear cross-references.
Case B — replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.

Devices: each data device is driven online → removed → failed (queuing failure-migration on the surviving online nodes), then the flow waits for failed_and_migrated before flipping the node to removed.

Changes

models/base_model.py: new STATUS_IN_REMOVAL (code 13)
models/job_schedule.py: new FN_NODE_REMOVAL
controllers/tasks_controller.py: add_node_removal_task + dedup branch + get_active_node_removal_task; skip IN_REMOVAL nodes when fanning out device-migration tasks (their SPDK is dead)
storage_node_ops.py: rewrite remove_storage_node (validate + queue) + node_removal_orchestrate + Case A/B + device-decommission/finalize helpers
services/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)
scripts/docker-compose-swarm.yml: register TasksNodeRemovalRunner
simplyblock_cli/cli.py: updated sn remove help; --force-remove now only cancels active tasks
tests/test_node_removal.py: 21 unit tests

Testing

New: tests/test_node_removal.py — 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅
Full non-migration suite: 1137 passed, 213 skipped, no regressions.

Notes / review focus

Case B (recreating a replica on a fresh node while the cluster is online) is the net-new, highest-risk piece — it reuses recreate_lvstore_on_non_leader with the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.
Unit tests mock all data-plane RPCs; this has not yet been validated on a live cluster.

🤖 Generated with Claude Code

Add a background, task-driven node-removal orchestration that mirrors add_node/cluster expansion in reverse. `sn remove` now validates and queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the idempotent, resumable flow: shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate devices -> removed Preconditions (enforced before queueing): target ONLINE, every other non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node (operator migrates those separately), and every secondary/tertiary replica the node hosts for other primaries has a host-disjoint relocation target. LVS rewiring: * Case A - the node's own primary LVS: tear down its (empty) secondary/tertiary replicas on the peers that host them. * Case B - replicas this node hosts for other primaries: re-host each on a fresh, anti-affinity-valid node (get_secondary_nodes / get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Devices are driven online -> removed -> failed (queuing failure migration on the surviving online nodes) and the flow waits for failed_and_migrated before flipping the node to removed. Changes: * base_model: new STATUS_IN_REMOVAL (code 13) * job_schedule: new FN_NODE_REMOVAL * tasks_controller: add_node_removal_task + dedup + getter; skip IN_REMOVAL nodes when fanning out device-migration tasks * storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate and Case A/B + device helpers * services/tasks_runner_node_removal.py: new runner service * docker-compose-swarm.yml: register TasksNodeRemovalRunner * tests/test_node_removal.py: 21 unit tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mxsrc

The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.

mxsrc · 2026-06-17T14:44:46Z

+    return cands[0] if cands else None
+
+
+def node_removal_orchestrate(node_id, force_remove=False):


I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.

mxsrc · 2026-06-17T14:53:20Z

+    existing = tasks_controller.get_active_node_removal_task(snode.cluster_id, node_id)
+    if existing:
+        logger.info(f"Node removal already in progress for {node_id} (task {existing})")
+        return existing
+
+    if snode.status == StorageNode.STATUS_REMOVED:
+        logger.warning(f"Node already removed: {node_id}")
+        return False
+
+    if snode.status != StorageNode.STATUS_ONLINE:
+        logger.error(
+            f"Can not remove node {node_id}: it must be ONLINE to start removal "
+            f"(current status: {snode.status}). Removal shuts the node down itself.")
+        return False
+
+    # All other nodes must be online — removal rewires LVS replicas and drives
+    # device migration across the surviving nodes, which requires every peer up.
+    not_online = [
+        n for n in db_controller.get_storage_nodes_by_cluster_id(snode.cluster_id)
+        if n.get_id() != node_id
+        and n.status not in (StorageNode.STATUS_ONLINE, StorageNode.STATUS_REMOVED)
+    ]
+    if not_online:
+        offending = ", ".join(f"{n.get_id()}({n.status})" for n in not_online)
+        logger.error(
+            f"Can not remove node {node_id}: all nodes must be online. "
+            f"Not-online node(s): {offending}")
+        return False
+
+    allowed, reason = _check_ftt_allows_node_removal(node_id, db_controller)
+    if not allowed:
+        logger.error(f"Can not remove node {node_id}: {reason}")
+        return False
+
+    lvols = db_controller.get_lvols_by_node_id(node_id)
+    if lvols:
+        logger.error(
+            f"Can not remove node {node_id}: {len(lvols)} LVol(s) present. "
+            f"Migrate or delete them first.")
+        return False
+
+    node_snaps = [
+        sn for sn in db_controller.get_snapshots()
+        if sn.lvol.node_id == node_id and sn.deleted is False
+    ]
+    if node_snaps:
+        logger.error(
+            f"Can not remove node {node_id}: {len(node_snaps)} snapshot(s) present. "
+            f"Remove them first.")
        return False

    tasks = tasks_controller.get_active_node_tasks(snode.cluster_id, snode.get_id())
    if tasks:
-        logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force")
+        logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force-remove")
        if force_remove is False:
            return False
        for task in tasks:
            tasks_controller.cancel_task(task.uuid)

-    lvols = db_controller.get_lvols_by_node_id(node_id)
-    if lvols:
-        if force_migrate:
-            for lvol in lvols:
+    # Case-B feasibility: every replica this node hosts for another primary must
+    # have somewhere host-disjoint to go. Catches e.g. 2-node clusters where the
+    # tertiary cannot be re-placed without violating anti-affinity.
+    feasible, reason = _check_replica_relocation_feasible(snode, db_controller)
+    if not feasible:
+        logger.error(f"Can not remove node {node_id}: {reason}")
+        return False


Since we rework this function completely, I think we should convert this from returning a success-boolean to exceptions, in particular for PreconditionError.

This also introduces quite a few checks, I'm not sure whether some of them are new. In any case, did we check that after starting this task, none of these conditions can be removed by external factors? I.e. do we consistently prevent

this node (or any other node) being shut down/suspended/restarted

LVols/Snapshots being created/moved to this node

Tasks being scheduled for this node

I think at least for some of them we already have checks, but to avoid race conditions we'd have to at least set the IN_REMOVAL status here. Technically, a two phase commit would be needed, setting the status or a lock before running the checks, and clearing it if any fails.

# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py

+            try:
+                sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)
+                exclude_mgmt_ips.append(sec.mgmt_ip)
+            except KeyError:


…ions

…e operations

…node removal process

…d improve JM reassignment logic in node removal process

schmidt-scaled requested review from Hamdy-khader and mxsrc June 15, 2026 20:25

github-code-quality Bot found potential problems Jun 15, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

Comment thread simplyblock_cli/cli.py Fixed

mxsrc requested changes Jun 17, 2026

View reviewed changes

Merge branch 'main' into feature/node-removal

ee05e96

# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py

github-code-quality Bot found potential problems Jun 22, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py

try:

sec = db_controller.get_storage_node_by_id(primary.secondary_node_id)

exclude_mgmt_ips.append(sec.mgmt_ip)

except KeyError:

Hamdy-khader added 3 commits June 23, 2026 00:31

update: point SIMPLY_BLOCK_DOCKER_IMAGE to feature-node-removal

084fb3c

fix: update get_secondary_nodes to handle removed_node parameter

2181fad

remove: comment out phase 2 "in_removal" logic in storage node operat…

4e6ff2b

…ions

github-code-quality Bot found potential problems Jun 24, 2026

View reviewed changes

Comment thread simplyblock_core/storage_node_ops.py Fixed

Hamdy-khader added 4 commits June 24, 2026 23:59

update: exclude node removal tasks in job task listing

d75b049

update: replace device removal logic with state update in storage nod…

ef5170e

…e operations

update: reorder and adjust device removal and finalization phases in …

f264c3f

…node removal process

update: add support for overridden device names during replacement an…

ea86f13

…d improve JM reassignment logic in node removal process

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(node-removal): online node removal as inverse of cluster expansion#1104

feat(node-removal): online node removal as inverse of cluster expansion#1104
schmidt-scaled wants to merge 9 commits into
mainfrom
feature/node-removal

schmidt-scaled commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

mxsrc left a comment

Uh oh!

mxsrc Jun 17, 2026

Uh oh!

mxsrc Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return cands[0] if cands else None


		def node_removal_orchestrate(node_id, force_remove=False):

Uh oh!

Conversation

schmidt-scaled commented Jun 15, 2026

Summary

Behavior

Changes

Testing

Notes / review focus

Uh oh!

Uh oh!

Uh oh!

mxsrc left a comment

Choose a reason for hiding this comment

Uh oh!

mxsrc Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mxsrc Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants