feat(node-removal): online node removal as inverse of cluster expansion#1104
feat(node-removal): online node removal as inverse of cluster expansion#1104schmidt-scaled wants to merge 9 commits into
Conversation
Add a background, task-driven node-removal orchestration that mirrors
add_node/cluster expansion in reverse. `sn remove` now validates and
queues an FN_NODE_REMOVAL task; tasks_runner_node_removal drives the
idempotent, resumable flow:
shutdown -> in_removal -> rewire LVS replicas -> remove/fail/migrate
devices -> removed
Preconditions (enforced before queueing): target ONLINE, every other
non-removed node ONLINE, FTT headroom OK, no LVols/snapshots on the node
(operator migrates those separately), and every secondary/tertiary
replica the node hosts for other primaries has a host-disjoint
relocation target.
LVS rewiring:
* Case A - the node's own primary LVS: tear down its (empty)
secondary/tertiary replicas on the peers that host them.
* Case B - replicas this node hosts for other primaries: re-host each
on a fresh, anti-affinity-valid node (get_secondary_nodes /
get_secondary_nodes_2 + recreate_lvstore_on_non_leader) so those
primaries keep their fault tolerance.
Devices are driven online -> removed -> failed (queuing failure
migration on the surviving online nodes) and the flow waits for
failed_and_migrated before flipping the node to removed.
Changes:
* base_model: new STATUS_IN_REMOVAL (code 13)
* job_schedule: new FN_NODE_REMOVAL
* tasks_controller: add_node_removal_task + dedup + getter; skip
IN_REMOVAL nodes when fanning out device-migration tasks
* storage_node_ops: rewrite remove_storage_node + node_removal_orchestrate
and Case A/B + device helpers
* services/tasks_runner_node_removal.py: new runner service
* docker-compose-swarm.yml: register TasksNodeRemovalRunner
* tests/test_node_removal.py: 21 unit tests
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mxsrc
left a comment
There was a problem hiding this comment.
The new functions that are being introduced use return-value based error reporting, since they are new I feel they should use exceptions for the error handling. Other than that I think the code looks good, I just have a few comments on two locations. It requires changes in the helm-chart to add the container there as well.
| return cands[0] if cands else None | ||
|
|
||
|
|
||
| def node_removal_orchestrate(node_id, force_remove=False): |
There was a problem hiding this comment.
I think this should be moved to the task module, including its helpers. The tasks should be as self-contained as possible, this code should not be executed from anywhere else.
| existing = tasks_controller.get_active_node_removal_task(snode.cluster_id, node_id) | ||
| if existing: | ||
| logger.info(f"Node removal already in progress for {node_id} (task {existing})") | ||
| return existing | ||
|
|
||
| if snode.status == StorageNode.STATUS_REMOVED: | ||
| logger.warning(f"Node already removed: {node_id}") | ||
| return False | ||
|
|
||
| if snode.status != StorageNode.STATUS_ONLINE: | ||
| logger.error( | ||
| f"Can not remove node {node_id}: it must be ONLINE to start removal " | ||
| f"(current status: {snode.status}). Removal shuts the node down itself.") | ||
| return False | ||
|
|
||
| # All other nodes must be online — removal rewires LVS replicas and drives | ||
| # device migration across the surviving nodes, which requires every peer up. | ||
| not_online = [ | ||
| n for n in db_controller.get_storage_nodes_by_cluster_id(snode.cluster_id) | ||
| if n.get_id() != node_id | ||
| and n.status not in (StorageNode.STATUS_ONLINE, StorageNode.STATUS_REMOVED) | ||
| ] | ||
| if not_online: | ||
| offending = ", ".join(f"{n.get_id()}({n.status})" for n in not_online) | ||
| logger.error( | ||
| f"Can not remove node {node_id}: all nodes must be online. " | ||
| f"Not-online node(s): {offending}") | ||
| return False | ||
|
|
||
| allowed, reason = _check_ftt_allows_node_removal(node_id, db_controller) | ||
| if not allowed: | ||
| logger.error(f"Can not remove node {node_id}: {reason}") | ||
| return False | ||
|
|
||
| lvols = db_controller.get_lvols_by_node_id(node_id) | ||
| if lvols: | ||
| logger.error( | ||
| f"Can not remove node {node_id}: {len(lvols)} LVol(s) present. " | ||
| f"Migrate or delete them first.") | ||
| return False | ||
|
|
||
| node_snaps = [ | ||
| sn for sn in db_controller.get_snapshots() | ||
| if sn.lvol.node_id == node_id and sn.deleted is False | ||
| ] | ||
| if node_snaps: | ||
| logger.error( | ||
| f"Can not remove node {node_id}: {len(node_snaps)} snapshot(s) present. " | ||
| f"Remove them first.") | ||
| return False | ||
|
|
||
| tasks = tasks_controller.get_active_node_tasks(snode.cluster_id, snode.get_id()) | ||
| if tasks: | ||
| logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force") | ||
| logger.warning(f"Task found: {len(tasks)}, can not remove storage node, or use --force-remove") | ||
| if force_remove is False: | ||
| return False | ||
| for task in tasks: | ||
| tasks_controller.cancel_task(task.uuid) | ||
|
|
||
| lvols = db_controller.get_lvols_by_node_id(node_id) | ||
| if lvols: | ||
| if force_migrate: | ||
| for lvol in lvols: | ||
| # Case-B feasibility: every replica this node hosts for another primary must | ||
| # have somewhere host-disjoint to go. Catches e.g. 2-node clusters where the | ||
| # tertiary cannot be re-placed without violating anti-affinity. | ||
| feasible, reason = _check_replica_relocation_feasible(snode, db_controller) | ||
| if not feasible: | ||
| logger.error(f"Can not remove node {node_id}: {reason}") | ||
| return False |
There was a problem hiding this comment.
Since we rework this function completely, I think we should convert this from returning a success-boolean to exceptions, in particular for PreconditionError.
This also introduces quite a few checks, I'm not sure whether some of them are new. In any case, did we check that after starting this task, none of these conditions can be removed by external factors? I.e. do we consistently prevent
- this node (or any other node) being shut down/suspended/restarted
- LVols/Snapshots being created/moved to this node
- Tasks being scheduled for this node
I think at least for some of them we already have checks, but to avoid race conditions we'd have to at least set the IN_REMOVAL status here. Technically, a two phase commit would be needed, setting the status or a lock before running the checks, and clearing it if any fails.
# Conflicts: # simplyblock_cli/cli.py # simplyblock_core/controllers/tasks_controller.py # simplyblock_core/scripts/docker-compose-swarm.yml # simplyblock_core/storage_node_ops.py
| try: | ||
| sec = db_controller.get_storage_node_by_id(primary.secondary_node_id) | ||
| exclude_mgmt_ips.append(sec.mgmt_ip) | ||
| except KeyError: |
…node removal process
…d improve JM reassignment logic in node removal process
Summary
Implements an online storage-node removal function that works as the inverse of cluster expansion (
add_node). It is background, task-driven, idempotent and resumable —sn removevalidates preconditions then queues anFN_NODE_REMOVALtask; the newtasks_runner_node_removalservice drives the flow:This replaces the old
remove_storage_nodesemantics (which only handled an already-offline node and did no LVS rewiring).Behavior
Preconditions (enforced before anything is queued):
online;online;_check_ftt_allows_node_removal);LVS rewiring:
get_secondary_nodes/get_secondary_nodes_2+recreate_lvstore_on_non_leader) so those primaries keep their fault tolerance. Bookkeeping back-ref on the removed node is cleared only after a successful rebuild, so retries resume cleanly.Devices: each data device is driven
online → removed → failed(queuing failure-migration on the surviving online nodes), then the flow waits forfailed_and_migratedbefore flipping the node toremoved.Changes
models/base_model.py: newSTATUS_IN_REMOVAL(code 13)models/job_schedule.py: newFN_NODE_REMOVALcontrollers/tasks_controller.py:add_node_removal_task+ dedup branch +get_active_node_removal_task; skipIN_REMOVALnodes when fanning out device-migration tasks (their SPDK is dead)storage_node_ops.py: rewriteremove_storage_node(validate + queue) +node_removal_orchestrate+ Case A/B + device-decommission/finalize helpersservices/tasks_runner_node_removal.py: new runner service (lease-aware, suspend-and-revisit for the long migration wait)scripts/docker-compose-swarm.yml: registerTasksNodeRemovalRunnersimplyblock_cli/cli.py: updatedsn removehelp;--force-removenow only cancels active taskstests/test_node_removal.py: 21 unit testsTesting
tests/test_node_removal.py— 21 tests (preconditions, relocation feasibility/picking, Case A/B bookkeeping incl. idempotency/resume, device completion gate, status mapping). ✅Notes / review focus
recreate_lvstore_on_non_leaderwith the primary as the online leader. Worth a careful look from someone close to the LVS restart/recreate machinery.🤖 Generated with Claude Code